Brief introduction to N-gram language model
1 Application:
- spelling correction
- automatic speech recognition
- machine translation
2 Estimate the probabilities: training
- training data: large corpus of general English text
- MLE: maximum likelihood estimation
- MLE assigns 0 probability to anything hasn’t occurred
- we don’t know the true probabilities, we estimate instead
- sparse data problem: not enough observations to estimate probabilities well simply by counting observed data (for sentences, many will occur rarely in the training data, smarter way: N-gram)
3 N-gram language model (N-gram LM)
3.1 Markov assumption (independence assumption)
The probability of a word only depends on a fixed number of previous words (history)
not always a good assumption
3.2 Probability calculation
To estimate $ P(\vec{w}) $, use chain rule: $$ P(w_1,\dots,w_n) = \prod_{i=1}^nP(w_i \mid w_1,w_2,\dots,w_{i-1}) \ \approx P(w_1)P(w_2\mid w_1)\prod_{i=3}^n P(w_i \mid w_{i-2}, w_{i-1}) $$ Problem: sentence boundary