Introduction to how to do spelling correction
1 Spelling correction
##1.1 Simple version
Assumption:
- have a large dictionary of real word
- only correct non-word
- only consider corrections differ by a single character from the non-word
Idea:
- generate a list of all words $y$ differ by 1 character from $x$
- compute the probability of each $y$ and choose the highest one
1.2 Alignments and edit distance
Edit distance:
- substitution
- deletion
- insertion
Need to:
find the optimal character alignment between two words
the optimal alignment is the one with minimum edit distance (MED)
How:
- Dynamic programming: Viterbi, CKY
Uses:
- spelling correction
- morphological analysis
Brief introduction to word embedding
1 Distributed word representation (Word2Vec)
Can represent the similarity of two words
Early:
- define classes $c$ of words, by hand or automatically
- $P_{CL}(w_i \mid w_{i-1})=P(c_i \mid c_{i-1})P(w_i \mid c_i)$
Recent idea: use neural networks to project words into a continuous space, so words that appear in similar contexts have similar representation (usually 50 or 100 dimensions)
Pros:
- can represent the similarity (distance) between different words
1.1 How to train?
No direct way to train word vectors, but we can get word vectors while training our language model
Models for one-hot representation: neural network language model (NNLM), recurrent neural network language model (RNNLM)
- calculations are too complicated
- too many parameters
Models for Word2Vec:
- Continuous Bag of Words (CBOW): predict central word based on the context
- Skip-gram: predict context based on the central word
1.1.1 CBOW
Loss function: $$ L=\sum_{w \in C} \log P(w \mid \text{Context}(w)) $$
- no hidden layer (just one projection layer)
- two directions window
- input: one-hot representations
- simplify projection layer as a sum and a mean operation
After getting the weights matrix $w$, multiply one-hot representation and $w$, we can get the distributed representation of that word
- accelerating method: Hierarchical softmax, Negative sampling
1.1.2 Skip-gram
Loss function: $$ L=\sum_{w \in C} \log P(\text{context}(w) \mid w) $$
two parameters
skip_window: window size = 2*skip_window + 1
num_skips: pick num_skips pairs of words
e.g. I am a boy who loves football.
for num_skips = 2, skip_window = 2, word = a
=> (a, I), (a, am), (a,boy), (a,who)
input: one-hot representation of $w(t)$
##1.2 Problem
- the use of window size ignore the information contained in the whole text
- Word vector cannot represent different meanings of a single word
Solution: Glove
2 Noisy channel model
Introduction to smoothing word counts to get the probabilities
1. Evaluation
- Extrinstic: measure performance on a downstream application
- plug different models into the system, test on the same dataset and compare the accuracy
- Intrinstic: design a measure that is inherent to the current task
Entropy: human need $H(X)$ of Y/N questions to determine the next word $$ H(X)=\sum_x -P(x) \log_2P(x) $$ Cross-entropy: $$ H(P,\hat{P}) = \sum_x -P(x) \log_2 \hat{P}(x) $$
- cross-entropy $\ge$ entropy
- lower cross-entropy: model is better at predicting next word
Data compression:
- If we designed an optimal code based on our LM model, we could encode the entire sentence in $H(P,\hat{P}) \times \text{number of words}$ bits. ($H(P,\hat{P})$ of Y/N questions for each word)
Perplexity:
- LM performance is often reported as perplexity
- $\text{perplexity} = 2^{\text{cross-entropy}}$
2. Smoothing 1
Smoothing methods address the problem by stealing probability mass from seen events and reallocating it to unseen events.
Otherwise, only using the N-gram model is useless since the assumption is too strict for natural language processing.
2.1 Add-One (Laplace) Smoothing
$$ P_{+1}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v} \ \ \ \text{where }\sum_{w_i \in V} \frac{C(w_{i-2},w_{i-1},w_i)+1}{C(w_{i-2},w_{i-1})+v}=1 $$
- $v = \text{vocabulary size}$
- Cons: changed the denominator, which can have big effects even on frequent events
2.2 Add-$\alpha$ (Lidstone) Smoothing
$$ P_{+\alpha}(w_i \mid w_{i-2},w_{i-1}) =\frac{C(w_{i-2},w_{i-1},w_i)+\alpha}{C(w_{i-2},w_{i-1})+\alpha v} $$
Optimising $\alpha$
Cons:
- changed the denominator, which can have big effects even on frequent events
- no good when vocabulary size is large
2.3 Good-Turing
Idea:
- change the numerator
- use the frequency of singletons as a re-estimate of the frequency of zero-count bigrams
Adjusted counts: $$ c^* = (c+1)\frac{N_{c+1}}{N_c} \ \ \ P_{GT}=\frac{c^*}{n} $$
Assumption:
tye distribution of each bigram is binomail
each count is reduced slightly (Zipf): discounting
Cons:
- assumes we know the vocabulary size
- do not allow “holes” in the counts (some frequency has zero count)
- applies discounts even to high-frequency items
3. Smoothing 2
Problems of previous methods: assign equal probability to all unseen events
Two ways:
- interpolation
- backoff
3.1 Interpolation
Idea: combine higher and lower order N-gram models
- high-order N-grams are sensitive to more context, but have sparse counts
- low-order N-grams have limited context, but robust counts
Model: $$ P_{INT}(w_3 \mid w_1,w_2) = \lambda_1P_1(w_3) + \lambda_2P_2(w_3 \mid w_2)+\lambda_3P_3(w_3\mid w_1,w_2) $$
$P_N$: N-gram estimate
$\lambda_i$: interpolation parameters / mixture weights
must sum to 1
Pick values of $\lambda_i$: machine learning method, optimise perplexity on a held-out data set
3.2 Katz Back-off
Idea: discount the trigram-based probability estimates
Model:
$$ T(n) = \begin{equation} \left{ \begin{array}{lr} P^*(w_i \mid w_{i-N+1},\dots,w_{i-1}) \qquad \qquad \ \text{if}\ \ \text{count}(w_{i-N+1},\dots,w_i)>0, \ \alpha(w_{i-N+1},\dots,w_{i-1})P_{BO}(w_i \mid w_{i-N+2},\dots,w_{i-1}) \qquad \qquad \text{else}. \end{array} \right. \end{equation} $$
- $ P^*(w_i \mid w_{i-N+1},\dots,w_{i-1})$: adjusted predictoin model (Good Turing)
- $\alpha(w_1,w_{N-1})$: backoff weights
- G-T if count is not 0, back off if count is 0
3.3 Kneser-Ney smoothing
Idea: takes diversity of histories into account
Model: $$ N_{1+}(\bullet w_i)=\left| {w_{i-1}:c(w_{i-1}, w_i)>0} \right | $$
$$ P_{KN}(w_i)= \frac{N_{1+}(\bullet w_i)}{\sum_w N_{1+}(\bullet w)} $$
Types | Methods |
---|---|
Uniform probabilities | add-$\alpha$, Good-Turing |
Probabilities from lower-order n-grams | interpolation, backoff |
Probability of appering in new contexts | Kneser-Ney |
Brief introduction to N-gram language model
1 Application:
- spelling correction
- automatic speech recognition
- machine translation
2 Estimate the probabilities: training
- training data: large corpus of general English text
- MLE: maximum likelihood estimation
- MLE assigns 0 probability to anything hasn’t occurred
- we don’t know the true probabilities, we estimate instead
- sparse data problem: not enough observations to estimate probabilities well simply by counting observed data (for sentences, many will occur rarely in the training data, smarter way: N-gram)
3 N-gram language model (N-gram LM)
3.1 Markov assumption (independence assumption)
The probability of a word only depends on a fixed number of previous words (history)
not always a good assumption
3.2 Probability calculation
To estimate $ P(\vec{w}) $, use chain rule: $$ P(w_1,\dots,w_n) = \prod_{i=1}^nP(w_i \mid w_1,w_2,\dots,w_{i-1}) \ \approx P(w_1)P(w_2\mid w_1)\prod_{i=3}^n P(w_i \mid w_{i-2}, w_{i-1}) $$ Problem: sentence boundary
3.3 Beginning / end of sequence
What is a corpora?
1. Corpora
Def: a large or complete collection of writings
- For NLP, we need corpora with linguistic annotations
- Markup formats: XML, JSON, CoNLL-style
- e.g. Brown, WSJ, ECI, BNC, Redwoods, Gigaword, AMI, Google Books N-grams, Flickr 8K, English Visual Genome
Why need corpora
- manual rules or database (rule-based, symbolic, knowledge-driven)
- learning: provide example input/output pairs for supervised learning
2. Sentiment Analysis
Goal: predict the opinion expressed in a piece of text (e.g. + or -, positive or negative)
Method: attach gold labels to the reviews (+ or -)
can be derived automatically from the original data artifact (metadata such as star ratings)
can be added by a human annotator who reads the text
Issue: annotators consistent or not
Simple sentiment classification algorithm
- Use a sentiment lexicon to count positive and negative words
- Issues
- sense ambiguity: need more data -> use frequency counts to ascertain
- sarcasm
- text could mention expectation (expectation, not comment)
- words may describe a character’s attitude, not the author’s
- phrases like “can’t believe how great it is”
Choice of training evalution data
- Assume that the training data and the test data are sampled from the same distribution
- Method: domain adaptation techniques
Brief introduction to NLP
1. Why hard?
Ambiguity
Sparse data (due to Zipf’s Law)
not enough observations to estimate probabilities well simply by counting observed data.
$$f\times r \approx k$$
𝑓 = frequency of a word
𝑟 = rank of a word
𝑘 = a constant
Variation
Expressivity: one form can have different meanings and the same meaning can be expressed with different forms
Context dependence: correct interpretation is context-dependent and often requires world knowledge
Unknown representation: we don’t know how to represent the knowledge a human has / needs
2. Models
Non-probabilistic methods: FSMs, CKY parsers
Probabilistic models: HMMs for POS tagging, PCFGs for syntax
Probabilistic algorithms: Viterbi, probabilistic CKY
3. Zipf’s Law
Implications:
- Regardless of how large the corpus is, there will be a lot of infrequent words
- Holds for many other levels of linguistic structure (e.g. syntactic rules in a CFG)
- Zipf’s Law shows that we need to find ways to estimate probabilities for things we have rarely or never seen during training