arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Entropy and Perplexity

hashtag
Entropy

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable X={x1,…,xn}X = \{x_1, \ldots, x_n\}X={x1​,…,xn​} is defined as:

H(X)=βˆ‘i=1nP(xi)β‹…(βˆ’log⁑P(xi))=βˆ’βˆ‘i=1nP(xi)β‹…log⁑P(xi)H(X) = \sum_{i=1}^n P(x_i) \cdot (-\log P(x_i)) = -\sum_{i=1}^n P(x_i) \cdot \log P(x_i)H(X)=i=1βˆ‘n​P(xi​)β‹…(βˆ’logP(xi​))=βˆ’i=1βˆ‘n​P(xi​)β‹…logP(xi​)

P(xi)P(x_i)P(xi​) is the probability distribution of xix_ixi​. The self-information of xix_ixi​ is defined as βˆ’log⁑P(x)-\log P(x)βˆ’logP(x), which measures how much information is gained when xix_ixi​ occurs. The negative sign indicates that as the occurrence of xix_ixi​ increases, its self-information value decreases.

Entropy has several properties, including:

  • It is non-negative: .

  • It is at its minimum when is entirely predictable (all probability mass on a single outcome).

  • It is at its maximum when all outcomes of are equally likely.

circle-exclamation

Q10: Why is logarithmic scale used to measure self-information in entropy calculations?

hashtag
Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:

The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :

In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:

The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.

circle-info

The in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.

By applying this theorem, can be approximated:

Consequently, is approximated as follows, where :

circle-exclamation

Q11: What indicates high entropy in a text corpus?

hashtag
Perplexity

Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:

Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .

Perplexity, , can be directly derived from the approximated entropy rate, :

circle-exclamation

Q12: What is the relationship between corpus entropy and language model perplexity?

hashtag
References

  1. , Wikipedia

  2. , Wikipedia

H(X)β‰₯0H(X) \geq 0H(X)β‰₯0
xix_ixi​
xix_ixi​
W=[w1,…,wn]W = [w_1, \ldots, w_n]W=[w1​,…,wn​]
L\mathcal{L}L
S={S1,…,Sm}\mathcal{S} = \{S_1, \ldots, S_m\}S={S1​,…,Sm​}
WWW
S1S_1S1​
Sm=WS_m = WSm​=W
WWW
H(W)=βˆ’βˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H(W) = -\sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(W)=βˆ’j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
Hβ€²(W)H'(W)Hβ€²(W)
H(W)H(W)H(W)
nnn
Hβ€²(W)=βˆ’1nβˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H'(W) = -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)Hβ€²(W)=βˆ’n1​j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
L\mathcal{L}L
L\mathcal{L}L
Hβ€²(W)H'(W)Hβ€²(W)
nnn
H(L)=lim⁑nβ†’βˆžHβ€²(W)=lim⁑nβ†’βˆžβˆ’1nβˆ‘j=1mP(Sj)β‹…log⁑P(Sj)H(\mathcal{L}) = \lim_{n \rightarrow \infty} H'(W) = \lim_{n \rightarrow \infty} -\frac{1}{n} \sum_{j=1}^m P(S_j) \cdot \log P(S_j)H(L)=nβ†’βˆžlim​Hβ€²(W)=nβ†’βˆžlimβ€‹βˆ’n1​j=1βˆ‘m​P(Sj​)β‹…logP(Sj​)
H(L)H(\mathcal{L})H(L)
P(wi+1,wi)P(w_{i+1}, w_i)P(wi+1​,wi​)
H(L)H(\mathcal{L})H(L)
H(L)β‰ˆlim⁑nβ†’βˆžβˆ’1nlog⁑P(Sm)H(\mathcal{L}) \approx \lim_{n \rightarrow \infty} -\frac{1}{n} \log P(S_m)H(L)β‰ˆnβ†’βˆžlimβ€‹βˆ’n1​logP(Sm​)
Hβ€²(W)H'(W)Hβ€²(W)
Sm=WS_m = WSm​=W
Hβ€²(W)β‰ˆβˆ’1nlog⁑P(W)H'(W) \approx -\frac{1}{n} \log P(W)Hβ€²(W)β‰ˆβˆ’n1​logP(W)
WWW
PP(W)=P(W)βˆ’1n=1P(W)nPP(W) = P(W)^{-\frac{1}{n}} = \sqrt[n]{\frac{1}{P(W)}}PP(W)=P(W)βˆ’n1​=nP(W)1​​
P(W)P(W)P(W)
WWW
PP(W)PP(W)PP(W)
Hβ€²(W)H'(W)Hβ€²(W)
Hβ€²(W)β‰ˆβˆ’1nlog⁑2P(W)2Hβ€²(W)β‰ˆ2βˆ’1nlog⁑2P(W)=2log⁑2P(W)βˆ’1n=P(W)βˆ’1n=PP(W)\begin{array}{rcl} H'(W) &\approx & -\frac{1}{n} \log_2 P(W) \\[5pt] 2^{H'(W)} &\approx & 2^{-\frac{1}{n} \log_2 P(W)} \\[5pt] &= & 2^{\log_2 P(W)^{-\frac{1}{n}}} \\[5pt] &= & P(W)^{-\frac{1}{n}} \\[5pt] &= & PP(W) \end{array}Hβ€²(W)2Hβ€²(W)β€‹β‰ˆβ‰ˆ===β€‹βˆ’n1​log2​P(W)2βˆ’n1​log2​P(W)2log2​P(W)βˆ’n1​P(W)βˆ’n1​PP(W)​
Shannon-McMillan-Breiman theoremarrow-up-right
bigram model
Entropyarrow-up-right
Perplexityarrow-up-right

Homework

HW2: Language Models

hashtag
Task 1: Bigram Modeling

Your goal is to build a bigram model using (1) Laplace smoothing with normalization and (2) initial word probabilities by adding the artificial token w0w_0w0​ at the beginning of every sentence.

hashtag
Implementation

  1. Create a file in the directory.

  2. Define a function named bigram_model() that takes a file path pointing to the text file, and returns a dictionary of bigram probabilities estimated in the text file.

  3. Use the following constants to indicate the unknown and initial probabilities:

hashtag
Notes

  1. Test your model using .

  2. Each line should be treated independently for bigram counting such that the INIT token should precede the first word of each line.

  3. Use such that all probabilities must sum to 1.0 for any given previous word.

hashtag
Task 2: Sequence Generation

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

hashtag
Implementation

Under , define a function named sequence_generator() that takes the following parameters:

  • A bigram model (the resulting dictionary of Task 1)

  • The initial word (the first word to appear in the sequence)

  • The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

  • It must have the precise number of tokens as specified.

  • Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of puncuation tokens are permitted).

  • Excluding punctuation, there should be no redundant tokens in the sequence.

The goal of this task is not to discover a sequence that maximizes the overall , but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word , a potential strategy is as follows:

  1. Identify the next word where the bigram probability is maximized.

  2. If fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.

  3. Make and repeat the #1 until you reach the specific sequence length.

Finally, the function returns a tuple comprising the following two elements:

  • The list of tokens in the sequence

  • The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base , provided as the math.log() function in Python.

hashtag
Extra Credit

Create a function called sequence_generator_plus() that takes the same input parameters as the existing sequence_generator() function. This new function should generate sequences with higher probability scores and better semantic coherence compared to the original implementation.

hashtag
Submission

Commit and push the language_models.py file to your GitHub repository.

hashtag
Rubric

  • Task 1: Bigram Modeling (5 points)

  • Task 2: Sequence Generator (4.6 points), Extra Credit (2 points)

  • Concept Quiz (2.4 points)

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

circle-info

In the previous section, you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

hashtag
Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

The joint probability of the word sequence "I am a student" can be measured as follows:

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .

circle-exclamation

Q7: What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?

hashtag
Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

Thus, the probability of any given word sequence can be measured as:

The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring is as challenging as measuring .

hashtag
Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

The joint probability can now be measured by the product of unigram and bigram probabilities.

circle-exclamation

Q8: How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?

hashtag
Initial Word Probability

Let us consider the unigram probabilities of and . In general, "the" appears more frequently than "The" such that:

Let be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as and , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:

circle-info

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

Thus, to predict a more probable initial word, it is better to consider the bigram probability rather than the unigram probability when measuring sequence probability:

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

circle-exclamation

Q9: Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?

hashtag
References

  1. , Wikipedia

Smoothing

hashtag
Unigram Smoothing

The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

PL(wi)=#(wi)+1βˆ‘βˆ€wk∈V(#(wk)+1)=#(wi)+1βˆ‘βˆ€wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(wi​)=βˆ‘βˆ€wkβ€‹βˆˆV​(#(wk​)+1)#(wi​)+1​=βˆ‘βˆ€wkβ€‹βˆˆV​#(wk​)+∣V∣#(wi​)+1​

Thus, the probability of any unknown word with Laplace smoothing is calculated as follows:

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

  • L1: Import the unigram_count() function from the package.

  • L3: Define a constant representing the unknown word.

  • L7: Increment the total count by the vocabulary size.

We then test unigram_smoothing() with a text file :

  • L1: Import test_unigram() from the package.

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

circle-exclamation

Q4: When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram's probability to increase?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

  • L2: Use to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the UNKNOWN token.

  • L2: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

hashtag
Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

Thus, the probability of an unknown bigram where is known but is unknown is calculated as follows:

circle-exclamation

Q5: What does the Laplace smoothed bigram probability of represent when is unknown, and what is a potential problem with this estimation?

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

  • L1: Import the bigram_count() function from the package.

  • L5: Create a set vocab containing all unique .

  • L6-7: Add all unique to vocab

We then test bigram_smoothing() with the same text file:

  • L1: Import the test_bigram() function from the package.

Finally, we test the bigram estimation using smoothing for unknown sequences:

  • L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.

  • L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.

  • L3: The tuple word is as passed as the second and third parameters.

hashtag
Normalization

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that . Before Laplace smoothing, the bigram probabilities of are estimated as follows:

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

The bigram distribution for can be normalized to 1 by adding the total number of word types occurring after , denoted as , to the denominator instead of :

Consequently, the probability of an unknown bigram can be calculated with the normalization as follows:

For the above example, . Once you apply to , the sum of its bigram probabilities becomes 1:

A major drawback of this normalization is that the probability cannot be measured when is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of , where the previous word is unknown, as follows:

circle-exclamation

Q6: Why is it problematic when bigram probabilities following a given word don't sum to 1?

hashtag
Reference

  1. Source:

  2. , Wikipedia

P(student∣I,am,a)=P(I,am,a,student)P(I,am,a)P(student|I,am,a) = \frac{P(I,am,a,student)}{P(I,am,a)}P(student∣I,am,a)=P(I,am,a)P(I,am,a,student)​
P(I,am,a,student)=#(I,am,a,student)βˆ‘βˆ€aβˆ‘βˆ€bβˆ‘βˆ€cβˆ‘βˆ€d#(wa,wb,wc,wd)P(I,am,a,student) = \frac{\#(I,am,a,student)}{\sum_{\forall a}\sum_{\forall b}\sum_{\forall c}\sum_{\forall d}\#(w_a,w_b,w_c,w_d)}P(I,am,a,student)=βˆ‘βˆ€aβ€‹βˆ‘βˆ€bβ€‹βˆ‘βˆ€cβ€‹βˆ‘βˆ€d​#(wa​,wb​,wc​,wd​)#(I,am,a,student)​
P(I,am,a,student)=P(I)β‹…P(am∣I)β‹…P(a∣I,am)β‹…P(student∣I,am,a)P(I,am,a,student) = P(I) \cdot P(am|I) \cdot P(a|I,am) \cdot P(student|I,am,a)P(I,am,a,student)=P(I)β‹…P(am∣I)β‹…P(a∣I,am)β‹…P(student∣I,am,a)
w1n=(w1,…,wn)w_1^n=(w_1,\ldots,w_n)w1n​=(w1​,…,wn​)
P(w1n)=P(w1)β‹…P(w2∣w1)β‹…P(w3∣w12)β‹―P(wn∣w1nβˆ’1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1})P(w1n​)=P(w1​)β‹…P(w2β€‹βˆ£w1​)β‹…P(w3β€‹βˆ£w12​)β‹―P(wnβ€‹βˆ£w1nβˆ’1​)
P(wn∣w1nβˆ’1)P(w_n|w_1^{n-1})P(wnβ€‹βˆ£w1nβˆ’1​)
P(w1n)P(w_1^n)P(w1n​)
P(w1n)=P(w1)β‹…P(w2∣w1)β‹…P(w3∣w2)β‹―P(wn∣wnβˆ’1)P(w_1^n) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1​)β‹…P(w2β€‹βˆ£w1​)β‹…P(w3β€‹βˆ£w2​)β‹―P(wnβ€‹βˆ£wnβˆ’1​)
P(the)P(the)P(the)
P(The)P(The)P(The)
P(the)>P(The)P(the) > P(The)P(the)>P(The)
w0w_0w0​
P(the∣w0)P(the|w_0)P(the∣w0​)
P(The∣w0)P(The|w_0)P(The∣w0​)
P(the∣w0)<P(The∣w0)P(the|w_0) < P(The|w_0)P(the∣w0​)<P(The∣w0​)
P(w1∣w0)P(w_1 | w_0)P(w1β€‹βˆ£w0​)
P(w1)P(w_1)P(w1​)
P(w1n)=P(w1∣w0)β‹…P(w2∣w1)β‹…P(w3∣w2)β‹―P(wn∣wnβˆ’1)P(w_1^n) = P(w_1|w_0) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1})P(w1n​)=P(w1β€‹βˆ£w0​)β‹…P(w2β€‹βˆ£w1​)β‹…P(w3β€‹βˆ£w2​)β‹―P(wnβ€‹βˆ£wnβˆ’1​)
P(w1n)=∏i=1nP(wi∣wiβˆ’1)P(w_1^n) = \prod_{i=1}^n P(w_i|w_{i-1})P(w1n​)=i=1∏n​P(wiβ€‹βˆ£wiβˆ’1​)
log⁑P(w1n)=log⁑(∏i=1nP(wi∣wiβˆ’1))=βˆ‘i=1nlog⁑P(wi∣wiβˆ’1)\log P(w_1^n) = \log(\prod_{i=1}^n P(w_i|w_{i-1})) = \sum_{i=1}^n \log P(w_i|w_{i-1})logP(w1n​)=log(i=1∏n​P(wiβ€‹βˆ£wiβˆ’1​))=i=1βˆ‘n​logP(wiβ€‹βˆ£wiβˆ’1​)
wn+1w_{n+1}wn+1​
P(wn+1∣wn)P(w_{n+1} | w_n)P(wn+1β€‹βˆ£wn​)
Chain Rule
Markov Assumption
Maximum likelihood estimationarrow-up-right

Unknown word probabilities should be retrieved using the UNKNOWN key for both the previous word (wiβˆ’1w_{i-1}wiβˆ’1​) and the current word (wiw_iwi​).

floor(21/5)=4\mathrm{floor}(21 / 5) = 4floor(21/5)=4
www
wβ€²w'wβ€²
P(wβ€²βˆ£w)P(wβ€²βˆ£w)P(wβ€²βˆ£w)
wβ€²wβ€²wβ€²
w=wβ€²w = w'w=wβ€²
eee
language_models.pyarrow-up-right
src/homework/arrow-up-right
dat/chronicles_of_narnia.txtarrow-up-right
smoothing with normalization
language_models.pyarrow-up-right
sequence probability

L8: Increment each unigram count by 1.

  • L9: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

  • .
  • L11: Calculate the total count of all bigrams with the same previous word.

  • L12: Calculate and store the probabilities of each current word given the previous word

  • L13: Calculate the probability for an unknown current word.

  • L16: Add a probability for an unknown previous word.

  • wβˆ—w_*wβˆ—β€‹
    PL(wβˆ—)=1βˆ‘βˆ€wk∈V#(wk)+∣V∣P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}PL​(wβˆ—β€‹)=βˆ‘βˆ€wkβ€‹βˆˆV​#(wk​)+∣V∣1​
    βˆ‘i=1vP(wi)=βˆ‘i=1vPL(wi)=1\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1i=1βˆ‘v​P(wi​)=i=1βˆ‘v​PL​(wi​)=1
    PL(wi∣wiβˆ’1)=#(wiβˆ’1,wi)+1βˆ‘βˆ€wk∈Vi#(wiβˆ’1,wk)+∣V∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}PL​(wiβ€‹βˆ£wiβˆ’1​)=βˆ‘βˆ€wkβ€‹βˆˆVi​​#(wiβˆ’1​,wk​)+∣V∣#(wiβˆ’1​,wi​)+1​
    (wuβˆ’1,wβˆ—)(w_{u-1}, w_{*})(wuβˆ’1​,wβˆ—β€‹)
    wuβˆ’1w_{u-1}wuβˆ’1​
    wβˆ—w_{*}wβˆ—β€‹
    PL(wβˆ—βˆ£wuβˆ’1)=1βˆ‘βˆ€wk∈Vi#(wuβˆ’1,wk)+∣V∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}PL​(wβˆ—β€‹βˆ£wuβˆ’1​)=βˆ‘βˆ€wkβ€‹βˆˆVi​​#(wuβˆ’1​,wk​)+∣V∣1​
    (wuβˆ’1,wu)(w_{u-1}, w_{u})(wuβˆ’1​,wu​)
    wuβˆ’1w_{u-1}wuβˆ’1​
    wiβˆ’1w_{i-1}wiβˆ’1​
    wiw_iwi​
    wiw_iwi​
    v=7v=7v=7
    (wiβˆ’1=You,wi=βˆ—)(w_{i-1} = \textit{You}, w_{i} = *)(wiβˆ’1​=You,wi​=βˆ—)
    P(are|You)=P(and|You)=1/2P(are|You)+P(and|You)=1\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}P(are|You)=P(and|You)P(are|You)+P(and|You)​=1/2=1​
    PL(are|You)=PL(and|You)=(1+1)/(2+7)=2/9PL(are|You)+PL(and|You)=4/9\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+7)=2/9=4/9​
    wiβˆ’1w_{i-1}wiβˆ’1​
    wiβˆ’1w_{i-1}wiβˆ’1​
    ∣Vi∣|V_i|∣Viβ€‹βˆ£
    vvv
    PL(wi∣wiβˆ’1)=#(wiβˆ’1,wi)+1βˆ‘βˆ€wk∈Vi#(wiβˆ’1,wk)+∣Vi∣P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}PL​(wiβ€‹βˆ£wiβˆ’1​)=βˆ‘βˆ€wkβ€‹βˆˆVi​​#(wiβˆ’1​,wk​)+∣Viβ€‹βˆ£#(wiβˆ’1​,wi​)+1​
    (wuβˆ’1,wβˆ—)(w_{u-1}, w_{*})(wuβˆ’1​,wβˆ—β€‹)
    PL(wβˆ—βˆ£wuβˆ’1)=1βˆ‘βˆ€wk∈Vi#(wuβˆ’1,wk)+∣Vi∣P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}PL​(wβˆ—β€‹βˆ£wuβˆ’1​)=βˆ‘βˆ€wkβ€‹βˆˆVi​​#(wuβˆ’1​,wk​)+∣Viβ€‹βˆ£1​
    ∣Vi∣=∣{are,and}∣=2|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2∣Viβ€‹βˆ£=∣{are,and}∣=2
    ∣Vi∣|V_{i}|∣Viβ€‹βˆ£
    PL(βˆ—βˆ£You)P_{\mathcal{L}}(*|\textit{You})PL​(βˆ—βˆ£You)
    PL(are|You)=PL(and|You)=(1+1)/(2+2)=1/2PL(are|You)+PL(and|You)=1\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}PL​(are|You)=PL​(and|You)PL​(are|You)+PL​(and|You)​=(1+1)/(2+2)=1/2=1​
    wuβˆ’1w_{u-1}wuβˆ’1​
    (wβˆ—,wu)(w_*, w_u)(wβˆ—β€‹,wu​)
    PL(wu∣wβˆ—)=min⁑({PL(wβˆ—βˆ£wk):βˆ€wk∈V})P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})PL​(wuβ€‹βˆ£wβˆ—β€‹)=min({PL​(wβˆ—β€‹βˆ£wk​):βˆ€wkβ€‹βˆˆV})
    src.ngram_modelsarrow-up-right
    dat/chronicles_of_narnia.txtarrow-up-right
    ngram_modelsarrow-up-right
    get()arrow-up-right
    src.ngram_modelsarrow-up-right
    ngram_modelsarrow-up-right
    unpackedarrow-up-right
    smoothing.pyarrow-up-right
    Additive smoothingarrow-up-right
    UNKNOWN = ''
    INIT = '[INIT]'
    from src.ngram_models import unigram_count, Unigram
    
    UNKNOWN = ''
    
    def unigram_smoothing(filepath: str) -> Unigram:
        counts = unigram_count(filepath)
        total = sum(counts.values()) + len(counts)
        unigrams = {word: (count + 1) / total for word, count in counts.items()}
        unigrams[UNKNOWN] = 1 / total
        return unigrams
    from src.ngram_models import test_unigram
    
    corpus = 'dat/chronicles_of_narnia.txt'
    test_unigram(corpus, unigram_smoothing)
             I 0.010225
         Aslan 0.001796
          Lucy 0.001762
        Edmund 0.001369
        Narnia 0.001339
       Caspian 0.001300
          Jill 0.001226
         Peter 0.001005
        Shasta 0.000902
        Digory 0.000899
       Eustace 0.000853
         Susan 0.000636
        Tirian 0.000585
         Polly 0.000533
        Aravis 0.000523
          Bree 0.000479
    Puddleglum 0.000479
        Scrubb 0.000469
        Andrew 0.000396
       Unigram  With Smoothing   W/O Smoothing
             I        0.010225        0.010543
         Aslan        0.001796        0.001850
          Lucy        0.001762        0.001815
        Edmund        0.001369        0.001409
        Narnia        0.001339        0.001379
       Caspian        0.001300        0.001338
          Jill        0.001226        0.001262
         Peter        0.001005        0.001034
        Shasta        0.000902        0.000928
        Digory        0.000899        0.000925
       Eustace        0.000853        0.000877
         Susan        0.000636        0.000654
        Tirian        0.000585        0.000601
         Polly        0.000533        0.000547
        Aravis        0.000523        0.000537
          Bree        0.000479        0.000492
    Puddleglum        0.000479        0.000492
        Scrubb        0.000469        0.000482
        Andrew        0.000396        0.000406
    def smoothed_unigram(probs: Unigram, word: str) -> float:
        return probs.get(word, unigram[UNKNOWN])
    unigram = unigram_smoothing(corpus)
    for word in ['Aslan', 'Jinho']:
        print(f'{word} {smoothed_unigram(unigram, word):.6f}')
    Aslan 0.001796
    Jinho 0.000002
    from src.ngram_models import bigram_count, Bigram
    
    def bigram_smoothing(filepath: str) -> Bigram:
        counts = bigram_count(filepath)
        vocab = set(counts.keys())
        for _, css in counts.items():
            vocab.update(css.keys())
    
        bigrams = dict()
        for prev, ccs in counts.items():
            total = sum(ccs.values()) + len(vocab)
            d = {curr: (count + 1) / total for curr, count in ccs.items()}
            d[UNKNOWN] = 1 / total
            bigrams[prev] = d
    
        bigrams[UNKNOWN] = 1 / len(vocab)
        return bigrams
    from src.ngram_models import test_bigram
    
    corpus = 'dat/chronicles_of_narnia.txt'
    test_bigram(corpus, bigram_smoothing)
    I
            'm 0.020590
            do 0.019136
           've 0.011143
           was 0.010598
          have 0.009629
            am 0.009084
           'll 0.008236
         think 0.008115
            'd 0.006661
          know 0.006540
    the
          same 0.008403
         other 0.007591
          King 0.007096
         Witch 0.006673
         whole 0.005119
        others 0.005084
         first 0.004978
         Dwarf 0.004872
          door 0.004837
         great 0.004837
    said
           the 0.039038
             , 0.018270
          Lucy 0.014312
        Edmund 0.011206
       Caspian 0.010049
         Peter 0.009805
          Jill 0.008709
             . 0.008648
        Digory 0.007734
         Aslan 0.007491
    def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
        d = probs.get(prev, None)
        return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])
    bigram = bigram_smoothing(corpus)
    for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
        print(f'{word} {smoothed_bigram(bigram, *word):.6f}')
    ('Aslan', 'is') 0.001146
    ('Aslan', 'Jinho') 0.000076
    ('Jinho', 'is') 0.000081
    You are a student
    You and I are students

    Language Models

    A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.

    hashtag
    Contents

    N-gram Models
    Smoothing
    Maximum Likelihood Estimation
    Entropy and Perplexity
    Assignments

    N-gram Models

    An n-gram is a contiguous sequence of n items from text data. These items can be:

    • Words (most common in language modeling)

    • Characters (useful for morphological analysis)

    • Subword tokens (common in modern NLP)

    • Bytes or other units

    For the sentence "I'm a computer scientist.", we can extract different n-grams:

    • 1-gram (unigram): {"I'm", "a", "computer", "scientist."}

    • 2-gram (bigram): {"I'm a", "a computer", "computer scientist."}

    • 3-gram (trigram): {"I'm a computer", "a computer scientist."}

    In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

    circle-exclamation

    Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?

    hashtag
    Unigram Estimation

    Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:

    Where:

    • : the count of word in the corpus.

    • : the vocabulary (set of all unique words).

    • The denominator represents the total word count.

    Let us define the Unigram type:

    • L1: .

    • L3: A dictionary where the key is a unigram and the value is its probability.

    Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

    We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

    • L3: Calculate the total count of all unigrams in the text.

    • L4: Return a dictionary where each word is a key and its probability is the value.

    Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :

    • L1: Import the type from the typing module.

    • L3: The second argument accepts a function that takes a string and returns a Unigram.

    • L4: Call the estimator with the text file and store the result in unigrams.

    • L2: Pass the unigram_estimation() function as the second argument.

    This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".

    circle-exclamation

    Q2: What advantages do unigram probabilities have over ?

    hashtag
    Bigram Estimation

    A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):

    Where:

    • : the total occurrences of in the corpus in that order.

    • : a set of all word types occurring after .

    Let us define the Bigram type:

    • A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a for .

    Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

    • L1: Import the class from the package.

    • L2: import the Bigram from the package.

    • L5: Create a defaultdict with Counters as default values to store bigram frequencies.

    We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

    • L8: Calculate the total count of all bigrams with the same previous word.

    • L9: Calculate and store the probabilities of each current word given the previous word.

    Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

    • L2: Call the bigram_estimation() function with the text file and store the result.

    • L5: Create a bigram list given the previous word, sorted by probability in descending order.

    • L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

    circle-exclamation

    Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?

    hashtag
    References

    1. Source: .

    2. , Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

    L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.

  • L7: Iterate through the top 300 unigrams with the highest probabilities.

  • L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).

  • L9: Iterate through the words, starting from the second word (index 1) in each line.

  • L10: Update the frequency of the current bigram.

  • wiw_iwi​
    P(wi)=#(wi)βˆ‘βˆ€wk∈V#(wk)P(w_i) = \frac{\#(w_i)}{\sum_{\forall w_k \in V}\#(w_k)}P(wi​)=βˆ‘βˆ€wkβ€‹βˆˆV​#(wk​)#(wi​)​
    #(wi)\#(w_i)#(wi​)
    VVV
    wiw_{i}wi​
    wIβˆ’1w_{I-1}wIβˆ’1​
    #(wiβˆ’1,wi)\#(w_{i-1},w_{i})#(wiβˆ’1​,wi​)
    (wiβˆ’1,wi)(w_{i-1},w_{i})(wiβˆ’1​,wi​)
    ViV_iVi​
    wiβˆ’1w_{i-1}wiβˆ’1​
    P(wi∣wiβˆ’1)=#(wiβˆ’1,wi)βˆ‘βˆ€wk∈Vi#(wiβˆ’1,wk)P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}P(wiβ€‹βˆ£wiβˆ’1​)=βˆ‘βˆ€wkβ€‹βˆˆVi​​#(wiβˆ’1​,wk​)#(wiβˆ’1​,wi​)​
    #(wiβˆ’1,wi)\#(w_{i-1},w_{i})#(wiβˆ’1​,wi​)
    (wiβˆ’1,wi)(w_{i-1},w_{i})(wiβˆ’1​,wi​)
    ViV_iVi​
    wiβˆ’1w_{i-1}wiβˆ’1​
    wiβˆ’1w_{i-1}wiβˆ’1​
    wiw_{i}wi​
    wiβˆ’1w_{i-1}wiβˆ’1​
    wiβˆ’1w_{i-1}wiβˆ’1​
    tokenized
    Type aliasesarrow-up-right
    dat/chronicles_of_narnia.txtarrow-up-right
    Callablearrow-up-right
    word frequencies
    smoothing probability
    defaultdictarrow-up-right
    collectionsarrow-up-right
    type aliasarrow-up-right
    src.typesarrow-up-right
    ngram_models.pyarrow-up-right
    N-gram Language Modelsarrow-up-right
    from typing import TypeAlias
    
    Unigram: TypeAlias = dict[str, float]
    from collections import Counter
    
    def unigram_count(filepath: str) -> Counter:
        unigrams = Counter()
    
        for line in open(filepath):
            words = line.split()
            unigrams.update(words)
    
        return unigrams
    def unigram_estimation(filepath: str) -> Unigram:
        counts = unigram_count(filepath)
        total = sum(counts.values())
        return {word: count / total for word, count in counts.items()}
    from collections.abc import Callable
    
    def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
        unigrams = estimator(filepath)
        unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]
    
        for word, prob in unigram_list[:300]:
            if word[0].isupper() and word.lower() not in unigrams:
                print(f'{word:>10} {prob:.6f}')
    corpus = 'dat/chronicles_of_narnia.txt'
    test_unigram(corpus, unigram_estimation)
             I 0.010543
         Aslan 0.001850
          Lucy 0.001815
        Edmund 0.001409
        Narnia 0.001379
       Caspian 0.001338
          Jill 0.001262
         Peter 0.001034
        Shasta 0.000928
        Digory 0.000925
       Eustace 0.000877
         Susan 0.000654
        Tirian 0.000601
         Polly 0.000547
        Aravis 0.000537
          Bree 0.000492
    Puddleglum 0.000492
        Scrubb 0.000482
        Andrew 0.000406
    Bigram: TypeAlias = dict[str, Unigram | float]
    from collections import Counter, defaultdict
    from src.types import Bigram
    
    def bigram_count(filepath: str) -> dict[str, Counter]:
        bigrams = defaultdict(Counter)
    
        for line in open(filepath):
            words = line.split()
            for i in range(1, len(words)):
                bigrams[words[i - 1]].update([words[i]])
    
        return bigrams
    from src.types import Bigram
    
    def bigram_estimation(filepath: str) -> Bigram:
        counts = bigram_count(filepath)
        bigrams = dict()
    
        for prev, ccs in counts.items():
            total = sum(ccs.values())
            bigrams[prev] = {curr: count / total for curr, count in ccs.items()}
    
        return bigrams
    def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
        bigrams = estimator(filepath)
        for prev in ['I', 'the', 'said']:
            print(prev)
            bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
            for curr, prob in bigram_list[:10]:
                print("{:>10} {:.6f}".format(curr, prob))
    test_bigram(corpus, bigram_estimation)
    I
            'm 0.081628
            do 0.075849
           've 0.044065
           was 0.041897
          have 0.038045
            am 0.035878
           'll 0.032507
         think 0.032025
            'd 0.026246
          know 0.025765
    the
          same 0.014846
         other 0.013405
          King 0.012528
         Witch 0.011776
         whole 0.009020
        others 0.008958
         first 0.008770
         Dwarf 0.008582
          door 0.008519
         great 0.008519
    said
           the 0.157635
             , 0.073645
          Lucy 0.057635
        Edmund 0.045074
       Caspian 0.040394
         Peter 0.039409
          Jill 0.034975
             . 0.034729
        Digory 0.031034
         Aslan 0.030049