Brief introduction to N-gram language model

1 Application:

  • spelling correction
  • automatic speech recognition
  • machine translation

2 Estimate the probabilities: training

  • training data: large corpus of general English text
  • MLE: maximum likelihood estimation
    • MLE assigns 0 probability to anything hasn’t occurred
  • we don’t know the true probabilities, we estimate instead
  • sparse data problem: not enough observations to estimate probabilities well simply by counting observed data (for sentences, many will occur rarely in the training data, smarter way: N-gram)

3 N-gram language model (N-gram LM)

3.1 Markov assumption (independence assumption)

The probability of a word only depends on a fixed number of previous words (history)

  • not always a good assumption

3.2 Probability calculation

To estimate $ P(\vec{w}) $, use chain rule: $$ P(w_1,\dots,w_n) = \prod_{i=1}^nP(w_i \mid w_1,w_2,\dots,w_{i-1}) \ \approx P(w_1)P(w_2\mid w_1)\prod_{i=3}^n P(w_i \mid w_{i-2}, w_{i-1}) $$ Problem: sentence boundary

3.3 Beginning / end of sequence