Introduction to how to do spelling correction

1 Spelling correction

##1.1 Simple version

Assumption:

have a large dictionary of real word
only correct non-word
only consider corrections differ by a single character from the non-word

Idea:

generate a list of all words $y$ differ by 1 character from $x$
compute the probability of each $y$ and choose the highest one

1.2 Alignments and edit distance

Edit distance:

substitution
deletion
insertion

Need to:

find the optimal character alignment between two words
the optimal alignment is the one with minimum edit distance (MED)

How:

Dynamic programming: Viterbi, CKY

Uses:

spelling correction
morphological analysis

Brief introduction to word embedding

1 Distributed word representation (Word2Vec)

Can represent the similarity of two words

Early:

define classes $c$ of words, by hand or automatically
$P_{CL}(w_i \mid w_{i-1})=P(c_i \mid c_{i-1})P(w_i \mid c_i)$

Recent idea: use neural networks to project words into a continuous space, so words that appear in similar contexts have similar representation (usually 50 or 100 dimensions)

Pros:

can represent the similarity (distance) between different words

1.1 How to train?

No direct way to train word vectors, but we can get word vectors while training our language model

Models for one-hot representation: neural network language model (NNLM), recurrent neural network language model (RNNLM)

calculations are too complicated
too many parameters

Models for Word2Vec:

Continuous Bag of Words (CBOW): predict central word based on the context
Skip-gram: predict context based on the central word

1.1.1 CBOW

Loss function: $$ L=\sum_{w \in C} \log P(w \mid \text{Context}(w)) $$

no hidden layer (just one projection layer)
two directions window
input: one-hot representations
simplify projection layer as a sum and a mean operation

After getting the weights matrix $w$, multiply one-hot representation and $w$, we can get the distributed representation of that word

accelerating method: Hierarchical softmax, Negative sampling

1.1.2 Skip-gram

Loss function: $$ L=\sum_{w \in C} \log P(\text{context}(w) \mid w) $$

two parameters
- skip_window: window size = 2*skip_window + 1
- num_skips: pick num_skips pairs of words
  e.g. I am a boy who loves football.
  for num_skips = 2, skip_window = 2, word = a
  => (a, I), (a, am), (a,boy), (a,who)
input: one-hot representation of $w(t)$

##1.2 Problem

the use of window size ignore the information contained in the whole text
Word vector cannot represent different meanings of a single word

Solution: Glove

2 Noisy channel model

Introduction to smoothing word counts to get the probabilities

1. Evaluation

Extrinstic: measure performance on a downstream application
- plug different models into the system, test on the same dataset and compare the accuracy
Intrinstic: design a measure that is inherent to the current task

Entropy: human need $H(X)$ of Y/N questions to determine the next word $$ H(X)=\sum_x -P(x) \log_2P(x) $$ Cross-entropy: $$ H(P,\hat{P}) = \sum_x -P(x) \log_2 \hat{P}(x) $$

cross-entropy $\ge$ entropy
lower cross-entropy: model is better at predicting next word

Data compression:

If we designed an optimal code based on our LM model, we could encode the entire sentence in $H(P,\hat{P}) \times \text{number of words}$ bits. ($H(P,\hat{P})$ of Y/N questions for each word)

Perplexity:

LM performance is often reported as perplexity
$\text{perplexity} = 2^{\text{cross-entropy}}$

2. Smoothing 1

Smoothing methods address the problem by stealing probability mass from seen events and reallocating it to unseen events.

Otherwise, only using the N-gram model is useless since the assumption is too strict for natural language processing.

2.1 Add-One (Laplace) Smoothing

$$ P_{+1}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v} \ \ \ \text{where }\sum_{w_i \in V} \frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v}=1 $$

$v = \text{vocabulary size}$
Cons: changed the denominator, which can have big eﬀects even on frequent events

2.2 Add-$\alpha$ (Lidstone) Smoothing

$$ P_{+\alpha}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+\alpha}{C(w_{i-2},w_{i-1})+\alpha v} $$

Optimising $\alpha$
Cons:
- changed the denominator, which can have big eﬀects even on frequent events
- no good when vocabulary size is large

2.3 Good-Turing

Idea:

change the numerator
use the frequency of singletons as a re-estimate of the frequency of zero-count bigrams

Adjusted counts: $$ c^* = (c+1)\frac{N_{c+1}}{N_c} \ \ \ P_{GT}=\frac{c^*}{n} $$

Assumption:

tye distribution of each bigram is binomail
each count is reduced slightly (Zipf): discounting

Cons:

assumes we know the vocabulary size
do not allow “holes” in the counts (some frequency has zero count)
applies discounts even to high-frequency items

3. Smoothing 2

Problems of previous methods: assign equal probability to all unseen events

Two ways:

interpolation
backoff

3.1 Interpolation

Idea: combine higher and lower order N-gram models

high-order N-grams are sensitive to more context, but have sparse counts
low-order N-grams have limited context, but robust counts

Model: $$ P_{INT}(w_3 \mid w_1,w_2) = \lambda_1P_1(w_3) + \lambda_2P_2(w_3 \mid w_2)+\lambda_3P_3(w_3\mid w_1,w_2) $$

$P_N$: N-gram estimate
$\lambda_i$: interpolation parameters / mixture weights
must sum to 1

Pick values of $\lambda_i$: machine learning method, optimise perplexity on a held-out data set

3.2 Katz Back-off

Idea: discount the trigram-based probability estimates

Model:

$$ T(n) = \begin{equation} \left{ \begin{array}{lr} P^*(w_i \mid w_{i-N+1},\dots,w_{i-1}) \qquad \qquad \ \text{if}\ \ \text{count}(w_{i-N+1},\dots,w_i)>0, \ \alpha(w_{i-N+1},\dots,w_{i-1})P_{BO}(w_i \mid w_{i-N+2},\dots,w_{i-1}) \qquad \qquad \text{else}. \end{array} \right. \end{equation} $$

$ P^*(w_i \mid w_{i-N+1},\dots,w_{i-1})$: adjusted predictoin model (Good Turing)
$\alpha(w_1,w_{N-1})$: backoff weights
G-T if count is not 0, back off if count is 0

3.3 Kneser-Ney smoothing

Idea: takes diversity of histories into account

Model: $$ N_{1+}(\bullet w_i)=\left| {w_{i-1}:c(w_{i-1}, w_i)>0} \right | $$

$$ P_{KN}(w_i)= \frac{N_{1+}(\bullet w_i)}{\sum_w N_{1+}(\bullet w)} $$

Types	Methods
Uniform probabilities	add-$\alpha$, Good-Turing
Probabilities from lower-order n-grams	interpolation, backoff
Probability of appering in new contexts	Kneser-Ney

Brief introduction to N-gram language model

1 Application:

spelling correction
automatic speech recognition
machine translation

2 Estimate the probabilities: training

training data: large corpus of general English text
MLE: maximum likelihood estimation
- MLE assigns 0 probability to anything hasn’t occurred
we don’t know the true probabilities, we estimate instead
sparse data problem: not enough observations to estimate probabilities well simply by counting observed data (for sentences, many will occur rarely in the training data, smarter way: N-gram)

3 N-gram language model (N-gram LM)

3.1 Markov assumption (independence assumption)

The probability of a word only depends on a ﬁxed number of previous words (history)

not always a good assumption

3.2 Probability calculation

To estimate $ P(\vec{w}) $, use chain rule: $$ P(w_1,\dots,w_n) = \prod_{i=1}^nP(w_i \mid w_1,w_2,\dots,w_{i-1}) \ \approx P(w_1)P(w_2\mid w_1)\prod_{i=3}^n P(w_i \mid w_{i-2}, w_{i-1}) $$ Problem: sentence boundary

3.3 Beginning / end of sequence

What is a corpora?

1. Corpora

Def: a large or complete collection of writings

For NLP, we need corpora with linguistic annotations
Markup formats: XML, JSON, CoNLL-style
e.g. Brown, WSJ, ECI, BNC, Redwoods, Gigaword, AMI, Google Books N-grams, Flickr 8K, English Visual Genome

Why need corpora

manual rules or database (rule-based, symbolic, knowledge-driven)
learning: provide example input/output pairs for supervised learning

2. Sentiment Analysis

Goal: predict the opinion expressed in a piece of text (e.g. + or -, positive or negative)

Method: attach gold labels to the reviews (+ or -)

can be derived automatically from the original data artifact (metadata such as star ratings)
can be added by a human annotator who reads the text
Issue: annotators consistent or not

Simple sentiment classification algorithm

Use a sentiment lexicon to count positive and negative words
Issues
- sense ambiguity: need more data -> use frequency counts to ascertain
- sarcasm
- text could mention expectation (expectation, not comment)
- words may describe a character’s attitude, not the author’s
- phrases like “can’t believe how great it is”

Choice of training evalution data

Assume that the training data and the test data are sampled from the same distribution
Method: domain adaptation techniques

Brief introduction to NLP

1. Why hard?

Ambiguity
Sparse data (due to Zipf’s Law)
not enough observations to estimate probabilities well simply by counting observed data.
$$f\times r \approx k$$
𝑓 = frequency of a word
𝑟 = rank of a word
𝑘 = a constant
Variation
Expressivity: one form can have different meanings and the same meaning can be expressed with different forms
Context dependence: correct interpretation is context-dependent and often requires world knowledge
Unknown representation: we don’t know how to represent the knowledge a human has / needs

2. Models

Non-probabilistic methods: FSMs, CKY parsers
Probabilistic models: HMMs for POS tagging, PCFGs for syntax
Probabilistic algorithms: Viterbi, probabilistic CKY

3. Zipf’s Law

Implications:

Regardless of how large the corpus is, there will be a lot of infrequent words
Holds for many other levels of linguistic structure (e.g. syntactic rules in a CFG)
Zipf’s Law shows that we need to find ways to estimate probabilities for things we have rarely or never seen during training