Introduction to how to do spelling correction

1 Spelling correction

##1.1 Simple version

Assumption:

  • have a large dictionary of real word
  • only correct non-word
  • only consider corrections differ by a single character from the non-word

Idea:

  • generate a list of all words $y$ differ by 1 character from $x$
  • compute the probability of each $y$ and choose the highest one

1.2 Alignments and edit distance

Edit distance:

  • substitution
  • deletion
  • insertion

Need to:

  • find the optimal character alignment between two words

    the optimal alignment is the one with minimum edit distance (MED)

How:

  • Dynamic programming: Viterbi, CKY

Uses:

  • spelling correction
  • morphological analysis

Brief introduction to word embedding

1 Distributed word representation (Word2Vec)

Can represent the similarity of two words

Early:

  • define classes $c$ of words, by hand or automatically
  • $P_{CL}(w_i \mid w_{i-1})=P(c_i \mid c_{i-1})P(w_i \mid c_i)$

Recent idea: use neural networks to project words into a continuous space, so words that appear in similar contexts have similar representation (usually 50 or 100 dimensions)

Pros:

  • can represent the similarity (distance) between different words

1.1 How to train?

No direct way to train word vectors, but we can get word vectors while training our language model

Models for one-hot representation: neural network language model (NNLM), recurrent neural network language model (RNNLM)

  • calculations are too complicated
  • too many parameters

Models for Word2Vec:

  • Continuous Bag of Words (CBOW): predict central word based on the context
  • Skip-gram: predict context based on the central word

1.1.1 CBOW

Loss function: $$ L=\sum_{w \in C} \log P(w \mid \text{Context}(w)) $$

  • no hidden layer (just one projection layer)
  • two directions window
  • input: one-hot representations
  • simplify projection layer as a sum and a mean operation

After getting the weights matrix $w$, multiply one-hot representation and $w$, we can get the distributed representation of that word

  • accelerating method: Hierarchical softmax, Negative sampling

1.1.2 Skip-gram

Loss function: $$ L=\sum_{w \in C} \log P(\text{context}(w) \mid w) $$

  • two parameters

    • skip_window: window size = 2*skip_window + 1

    • num_skips: pick num_skips pairs of words

      e.g. I am a boy who loves football.

      for num_skips = 2, skip_window = 2, word = a

      => (a, I), (a, am), (a,boy), (a,who)

  • input: one-hot representation of $w(t)$

##1.2 Problem

  • the use of window size ignore the information contained in the whole text
  • Word vector cannot represent different meanings of a single word

Solution: Glove

2 Noisy channel model

Introduction to smoothing word counts to get the probabilities

1. Evaluation

  • Extrinstic: measure performance on a downstream application
    • plug different models into the system, test on the same dataset and compare the accuracy
  • Intrinstic: design a measure that is inherent to the current task

Entropy: human need $H(X)$ of Y/N questions to determine the next word $$ H(X)=\sum_x -P(x) \log_2P(x) $$ Cross-entropy: $$ H(P,\hat{P}) = \sum_x -P(x) \log_2 \hat{P}(x) $$

  • cross-entropy $\ge$ entropy
  • lower cross-entropy: model is better at predicting next word

Data compression:

  • If we designed an optimal code based on our LM model, we could encode the entire sentence in $H(P,\hat{P}) \times \text{number of words}$ bits. ($H(P,\hat{P})$ of Y/N questions for each word)

Perplexity:

  • LM performance is often reported as perplexity
  • $\text{perplexity} = 2^{\text{cross-entropy}}$

2. Smoothing 1

Smoothing methods address the problem by stealing probability mass from seen events and reallocating it to unseen events.

Otherwise, only using the N-gram model is useless since the assumption is too strict for natural language processing.

2.1 Add-One (Laplace) Smoothing

$$ P_{+1}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v} \ \ \ \text{where }\sum_{w_i \in V} \frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v}=1 $$

  • $v = \text{vocabulary size}$
  • Cons: changed the denominator, which can have big effects even on frequent events

2.2 Add-$\alpha$ (Lidstone) Smoothing

$$ P_{+\alpha}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+\alpha}{C(w_{i-2},w_{i-1})+\alpha v} $$

  • Optimising $\alpha$

  • Cons:

    • changed the denominator, which can have big effects even on frequent events
    • no good when vocabulary size is large

2.3 Good-Turing

Idea:

  • change the numerator
  • use the frequency of singletons as a re-estimate of the frequency of zero-count bigrams

Adjusted counts: $$ c^* = (c+1)\frac{N_{c+1}}{N_c} \ \ \ P_{GT}=\frac{c^*}{n} $$

Assumption:

  • tye distribution of each bigram is binomail

  • each count is reduced slightly (Zipf): discounting

Cons:

  • assumes we know the vocabulary size
  • do not allow “holes” in the counts (some frequency has zero count)
  • applies discounts even to high-frequency items

3. Smoothing 2

Problems of previous methods: assign equal probability to all unseen events

Two ways:

  • interpolation
  • backoff

3.1 Interpolation

Idea: combine higher and lower order N-gram models

  • high-order N-grams are sensitive to more context, but have sparse counts
  • low-order N-grams have limited context, but robust counts

Model: $$ P_{INT}(w_3 \mid w_1,w_2) = \lambda_1P_1(w_3) + \lambda_2P_2(w_3 \mid w_2)+\lambda_3P_3(w_3\mid w_1,w_2) $$

  • $P_N$: N-gram estimate

  • $\lambda_i$: interpolation parameters / mixture weights

    must sum to 1

Pick values of $\lambda_i$: machine learning method, optimise perplexity on a held-out data set

3.2 Katz Back-off

Idea: discount the trigram-based probability estimates

Model:

$$ T(n) = \begin{equation} \left{ \begin{array}{lr} P^*(w_i \mid w_{i-N+1},\dots,w_{i-1}) \qquad \qquad \ \text{if}\ \ \text{count}(w_{i-N+1},\dots,w_i)>0, \ \alpha(w_{i-N+1},\dots,w_{i-1})P_{BO}(w_i \mid w_{i-N+2},\dots,w_{i-1}) \qquad \qquad \text{else}. \end{array} \right. \end{equation} $$

  • $ P^*(w_i \mid w_{i-N+1},\dots,w_{i-1})$: adjusted predictoin model (Good Turing)
  • $\alpha(w_1,w_{N-1})$: backoff weights
  • G-T if count is not 0, back off if count is 0

3.3 Kneser-Ney smoothing

Idea: takes diversity of histories into account

Model: $$ N_{1+}(\bullet w_i)=\left| {w_{i-1}:c(w_{i-1}, w_i)>0} \right | $$

$$ P_{KN}(w_i)= \frac{N_{1+}(\bullet w_i)}{\sum_w N_{1+}(\bullet w)} $$

TypesMethods
Uniform probabilitiesadd-$\alpha$, Good-Turing
Probabilities from lower-order n-gramsinterpolation, backoff
Probability of appering in new contextsKneser-Ney

Brief introduction to N-gram language model

1 Application:

  • spelling correction
  • automatic speech recognition
  • machine translation

2 Estimate the probabilities: training

  • training data: large corpus of general English text
  • MLE: maximum likelihood estimation
    • MLE assigns 0 probability to anything hasn’t occurred
  • we don’t know the true probabilities, we estimate instead
  • sparse data problem: not enough observations to estimate probabilities well simply by counting observed data (for sentences, many will occur rarely in the training data, smarter way: N-gram)

3 N-gram language model (N-gram LM)

3.1 Markov assumption (independence assumption)

The probability of a word only depends on a fixed number of previous words (history)

  • not always a good assumption

3.2 Probability calculation

To estimate $ P(\vec{w}) $, use chain rule: $$ P(w_1,\dots,w_n) = \prod_{i=1}^nP(w_i \mid w_1,w_2,\dots,w_{i-1}) \ \approx P(w_1)P(w_2\mid w_1)\prod_{i=3}^n P(w_i \mid w_{i-2}, w_{i-1}) $$ Problem: sentence boundary

3.3 Beginning / end of sequence

What is a corpora?

1. Corpora

Def: a large or complete collection of writings

  • For NLP, we need corpora with linguistic annotations
  • Markup formats: XML, JSON, CoNLL-style
  • e.g. Brown, WSJ, ECI, BNC, Redwoods, Gigaword, AMI, Google Books N-grams, Flickr 8K, English Visual Genome

Why need corpora

  • manual rules or database (rule-based, symbolic, knowledge-driven)
  • learning: provide example input/output pairs for supervised learning

2. Sentiment Analysis

Goal: predict the opinion expressed in a piece of text (e.g. + or -, positive or negative)

Method: attach gold labels to the reviews (+ or -)

  • can be derived automatically from the original data artifact (metadata such as star ratings)

  • can be added by a human annotator who reads the text

    Issue: annotators consistent or not

Simple sentiment classification algorithm

  • Use a sentiment lexicon to count positive and negative words
  • Issues
    • sense ambiguity: need more data -> use frequency counts to ascertain
    • sarcasm
    • text could mention expectation (expectation, not comment)
    • words may describe a character’s attitude, not the author’s
    • phrases like “can’t believe how great it is”

Choice of training evalution data

  • Assume that the training data and the test data are sampled from the same distribution
  • Method: domain adaptation techniques

Brief introduction to NLP

1. Why hard?

  • Ambiguity

  • Sparse data (due to Zipf’s Law)

    not enough observations to estimate probabilities well simply by counting observed data.

    $$f\times r \approx k$$

    𝑓 = frequency of a word

    𝑟 = rank of a word

    𝑘 = a constant

  • Variation

  • Expressivity: one form can have different meanings and the same meaning can be expressed with different forms

  • Context dependence: correct interpretation is context-dependent and often requires world knowledge

  • Unknown representation: we don’t know how to represent the knowledge a human has / needs

2. Models

  • Non-probabilistic methods: FSMs, CKY parsers

  • Probabilistic models: HMMs for POS tagging, PCFGs for syntax

    Probabilistic algorithms: Viterbi, probabilistic CKY

3. Zipf’s Law

Implications:

  • Regardless of how large the corpus is, there will be a lot of infrequent words
  • Holds for many other levels of linguistic structure (e.g. syntactic rules in a CFG)
  • Zipf’s Law shows that we need to find ways to estimate probabilities for things we have rarely or never seen during training