1. What is the fundamental task of a Language Model?
A language model estimates the probability of a sequence of words. Given a sentence consisting of words $(w_1, w_2, …, w_m)$, the model estimates:
\[P(w_1, w_2, ..., w_m)\]Intuitively, a good language model assigns higher probability to fluent and natural sentences, and lower probability to unnatural word sequences.
For example, it should assign a higher probability to:
The cat sits on the mat.
than to:
Mat the on sits cat the.
Therefore, the fundamental task of a language model is to model the probability distribution of word sequences.
1.1 Traditional Language Models: N-gram Models
Before deep learning became dominant, language models were mainly based on statistical methods. A classic example is the N-gram language model.
The probability of a sentence can be decomposed using the chain rule of probability:
\[P(w_1, w_2, ..., w_m) = \prod_{i=1}^{m} P(w_i \mid w_1, ..., w_{i-1})\]This means that the probability of a sentence is the product of the conditional probabilities of each word, given all previous words.
However, this is difficult to estimate in practice because of data sparsity. For example:
\[P(w_m \mid w_1, ..., w_{m-1})\]requires the full context $(w_1, …, w_{m-1})$ to appear frequently enough in the training corpus. In real-world data, many long word sequences rarely or never appear, making the estimation unreliable.
1.2 Markov Assumption
To make language modeling practical, N-gram models use the Markov assumption.
The core idea is:
The probability of the current word depends only on a limited number of previous words, rather than the entire history.
Formally, an N-gram model approximates:
\[P(w_i \mid w_1, ..., w_{i-1})\]with:
\[P(w_i \mid w_{i-n+1}, ..., w_{i-1})\]Here, $n$ is the size of the context window.
Bigram Model
When $n = 2$, the model is called a bigram model. It only considers the previous word:
\[P(w_i \mid w_1, ..., w_{i-1}) \approx P(w_i \mid w_{i-1})\]Trigram Model
When $n = 3$, the model is called a trigram model. It considers the previous two words:
\[P(w_i \mid w_1, ..., w_{i-1}) \approx P(w_i \mid w_{i-2}, w_{i-1})\]In general, an N-gram model uses the previous $n-1$ words to predict the next word.
1.3 Estimating N-gram Probabilities with MLE
N-gram probabilities are usually estimated using Maximum Likelihood Estimation (MLE).
The idea is simple:
If something appears more often in the training corpus, the model considers it more likely.
For a bigram model, we estimate the probability of word $w_i$ appearing after word $w_{i-1}$ as:
\[P(w_i \mid w_{i-1}) = \frac{Count(w_{i-1}, w_i)}{Count(w_{i-1})}\]where:
- $Count(w_{i-1}, w_i)$ is the number of times the word pair $(w_{i-1}, w_i)$ appears consecutively.
- $Count(w_{i-1})$ is the number of times the word $w_{i-1}$ appears.
In other words:
Among all occurrences of $w_{i-1}$, how often is it followed by $w_i$?
Example:
If "deep" appears 100 times, and "deep learning" appears 60 times, then:
So the estimated probability of "learning" appearing after "deep" is 0.6.
1.4 Limitations of N-gram Models
N-gram models are simple and interpretable, but they have three major limitations.
(1) Data Sparsity
Imagine you are learning English only by memorizing exact sentences from a textbook. If you see a new sentence that is grammatically correct but never appeared in the textbook, you may mistakenly think it is invalid.
N-gram models rely on exact word sequence counts. If a valid phrase never appears in the corpus, the model may assign it zero probability.
For example, if "neural language model" never appears in the training data, a trigram model may fail to estimate its probability correctly.
This is known as the zero-probability problem.
(2) Limited Context
Imagine reading a long story but only being allowed to remember the last one or two words. You may understand local phrases, but you will easily miss the larger meaning of the sentence.
N-gram models only use a fixed-size context window, they struggle to capture long-range dependencies.
For example:
The book that I borrowed from the library yesterday was very interesting.
To understand that "was" refers to "book", the model needs information from much earlier in the sentence. Traditional N-gram models are not good at handling this.
(3) Weak Semantic Representation
Imagine treating every word as a separate ID number, without knowing what the word means. You may know that two words appear in similar places, but you do not truly understand that they have similar meanings.
N-gram models treat words as discrete symbols. They rely on surface-level co-occurrence statistics and do not learn deep semantic relationships.
For example, they cannot naturally understand that "car" and "automobile" are semantically similar unless this relation is reflected in the corpus statistics.