NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Sequence Probability
  • Chain Rule
  • Markov Assumption
  • Initial Word Probability
  • References

Was this helpful?

Export as PDF
  1. Language Models

Maximum Likelihood Estimation

PreviousSmoothingNextEntropy and Perplexity

Last updated 3 months ago

Was this helpful?

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

In the , you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

P(student∣I,am,a)=P(I,am,a,student)P(I,am,a)P(student|I,am,a) = \frac{P(I,am,a,student)}{P(I,am,a)}P(student∣I,am,a)=P(I,am,a)P(I,am,a,student)​

The joint probability of the word sequence "I am a student" can be measured as follows:

P(I,am,a,student)=#(I,am,a,student)∑∀a∑∀b∑∀c∑∀d#(wa,wb,wc,wd)P(I,am,a,student) = \frac{\#(I,am,a,student)}{\sum_{\forall a}\sum_{\forall b}\sum_{\forall c}\sum_{\forall d}\#(w_a,w_b,w_c,w_d)}P(I,am,a,student)=∑∀a​∑∀b​∑∀c​∑∀d​#(wa​,wb​,wc​,wd​)#(I,am,a,student)​

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .

Q7: What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?

Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

The joint probability can now be measured by the product of unigram and bigram probabilities.

Q8: How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?

Initial Word Probability

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

References

P(I,am,a,student)=P(I)⋅P(am∣I)⋅P(a∣I,am)⋅P(student∣I,am,a)P(I,am,a,student) = P(I) \cdot P(am|I) \cdot P(a|I,am) \cdot P(student|I,am,a)P(I,am,a,student)=P(I)⋅P(am∣I)⋅P(a∣I,am)⋅P(student∣I,am,a)

Thus, the probability of any given word sequence w1n=(w1,…,wn)w_1^n=(w_1,\ldots,w_n)w1n​=(w1​,…,wn​) can be measured as:

P(w1n)=P(w1)⋅P(w2∣w1)⋅P(w3∣w12)⋯P(wn∣w1n−1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1})P(w1n​)=P(w1​)⋅P(w2​∣w1​)⋅P(w3​∣w12​)⋯P(wn​∣w1n−1​)

The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring P(wn∣w1n−1)P(w_n|w_1^{n-1})P(wn​∣w1n−1​) is as challenging as measuring P(w1n)P(w_1^n)P(w1n​).

P(w1n)=P(w1)⋅P(w2∣w1)⋅P(w3∣w2)⋯P(wn∣wn−1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1​)⋅P(w2​∣w1​)⋅P(w3​∣w2​)⋯P(wn​∣wn−1​)

Let us consider the unigram probabilities of P(the)P(the)P(the) and P(The)P(The)P(The). In general, "the" appears more frequently than "The" such that:

P(the)>P(The)P(the) > P(The)P(the)>P(The)

Let w0w_0w0​ be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as P(the∣w0)P(the|w_0)P(the∣w0​) and P(The∣w0)P(The|w_0)P(The∣w0​), respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:

P(the∣w0)<P(The∣w0)P(the|w_0) < P(The|w_0)P(the∣w0​)<P(The∣w0​)

Thus, to predict a more probable initial word, it is better to consider the bigram probability P(w1∣w0)P(w_1 | w_0)P(w1​∣w0​) rather than the unigram probability P(w1)P(w_1)P(w1​) when measuring sequence probability:

P(w1n)=P(w1∣w0)⋅P(w2∣w1)⋅P(w3∣w2)⋯P(wn∣wn−1)P(w_1^n) = P(w_1|w_0) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1​∣w0​)⋅P(w2​∣w1​)⋅P(w3​∣w2​)⋯P(wn​∣wn−1​)
P(w1n)=∏i=1nP(wi∣wi−1)P(w_1^n) = \prod_{i=1}^n P(w_i|w_{i-1})P(w1n​)=i=1∏n​P(wi​∣wi−1​)
log⁡P(w1n)=log⁡(∏i=1nP(wi∣wi−1))=∑i=1nlog⁡P(wi∣wi−1)\log P(w_1^n) = \log(\prod_{i=1}^n P(w_i|w_{i-1})) = \sum_{i=1}^n \log P(w_i|w_{i-1})logP(w1n​)=log(i=1∏n​P(wi​∣wi−1​))=i=1∑n​logP(wi​∣wi−1​)

Q9: Is it worth considering the end of the text by introducing another artificial token, wn+1w_{n+1}wn+1​, to improve last-word prediction by multiplying the above product with P(wn+1∣wn)P(w_{n+1} | w_n)P(wn+1​∣wn​)?

, Wikipedia

Maximum likelihood estimation
previous section
Chain Rule
Markov Assumption