Maximum Likelihood Estimation

Update: 2023-10-13

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

In the previous section, you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

P(studentI,am,a)=P(I,am,a,student)P(I,am,a)P(student|I,am,a) = \frac{P(I,am,a,student)}{P(I,am,a)}

The joint probability of the word sequence "I am a student" can be measured as follows:

P(I,am,a,student)=#(I,am,a,student)abcd#(wa,wb,wc,wd)P(I,am,a,student) = \frac{\#(I,am,a,student)}{\sum_{\forall a}\sum_{\forall b}\sum_{\forall c}\sum_{\forall d}\#(w_a,w_b,w_c,w_d)}

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: Chain Rule and Markov Assumption.

Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

P(I,am,a,student)=P(I)P(amI)P(aI,am)P(studentI,am,a)P(I,am,a,student) = P(I) \cdot P(am|I) \cdot P(a|I,am) \cdot P(student|I,am,a)

Thus, the probability of any given word sequence w1n=(w1,,wn)w_1^n=(w_1,\ldots,w_n) can be measured as:

P(w1n)=P(w1)P(w2w1)P(w3w12)P(wnw1n1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1})

The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring P(wnw1n1)P(w_n|w_1^{n-1}) is as challenging as measuring P(w1n)P(w_1^n).

Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

P(w1n)=P(w1)P(w2w1)P(w3w2)P(wnwn1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})

The joint probability can now be measured by the product of unigram and bigram probabilities.

How do the chain rule and Markov assumption simplify the estimation of sequence probability?

Initial Word Probability

Let us consider the unigram probabilities of P(the)P(the) and P(The)P(The). In general, "the" appears more frequently than "The" such that:

P(the)>P(The)P(the) > P(The)

Let w0w_0 be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as P(thew0)P(the|w_0) and P(Thew0)P(The|w_0), respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:

P(thew0)<P(Thew0)P(the|w_0) < P(The|w_0)

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

Thus, to predict a more probable initial word, it is better to consider the bigram probability P(w1w0)P(w_1 | w_0) rather than the unigram probability P(w1)P(w_1) when measuring sequence probability:

P(w1n)=P(w1w0)P(w2w1)P(w3w2)P(wnwn1)P(w_1^n) = P(w_1|w_0) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

P(w1n)=i=1nP(wiwi1)P(w_1^n) = \prod_{i=1}^n P(w_i|w_{i-1})

Is it worth considering the end of the text by introducing another artificial token, wn+1w_{n+1}, to improve last-word prediction by multiplying the above product with P(wn+1wn)P(w_{n+1} | w_n)?

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

logP(w1n)=log(i=1nP(wiwi1))=i=1nlogP(wiwi1)\log P(w_1^n) = \log(\prod_{i=1}^n P(w_i|w_{i-1})) = \sum_{i=1}^n \log P(w_i|w_{i-1})

Last updated

Copyright © 2023 All rights reserved