1 of 6

Language Modeling

A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.

N-gram Models

Update: 2024-01-05

An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.

For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:

1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

What are the potential issues of using n-grams without proper tokenization?

Unigram Estimation

Given a large corpus, a unigram model calculates the probability of each word as follows (: the total occurrences of in the corpus, : a set of all word types in the corpus):

Let us define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())?

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.

L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation() function as the second argument.

What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?

Bigram Estimation

Let us define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

References

Smoothing

Update: 2024-01-05

Unigram Smoothing

The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.

Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:

P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

Thus, the probability of any unknown word $w_*$ with Laplace smoothing is calculated as follows:

P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import unigram_count
from src.types import Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams

L1: Import the unigram_count() function from the src.ngram_models package.
L4: Define a constant representing the unknown word.
L8: Increment the total count by the vocabulary size.
L9: Increment each unigram count by 1.
L10: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

We then test unigram_smoothing() with a text file dat/chronicles_of_narnia.txt:

from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)

L1: Import the test_unigram() function from the ngram_models package.

         I 0.010225
     Aslan 0.001796
      Lucy 0.001762
    Edmund 0.001369
    Narnia 0.001339
   Caspian 0.001300
      Jill 0.001226
     Peter 0.001005
    Shasta 0.000902
    Digory 0.000899
   Eustace 0.000853
     Susan 0.000636
    Tirian 0.000585
     Polly 0.000533
    Aravis 0.000523
      Bree 0.000479
Puddleglum 0.000479
    Scrubb 0.000469
    Andrew 0.000396

   Unigram  With Smoothing   W/O Smoothing
         I        0.010225        0.010543
     Aslan        0.001796        0.001850
      Lucy        0.001762        0.001815
    Edmund        0.001369        0.001409
    Narnia        0.001339        0.001379
   Caspian        0.001300        0.001338
      Jill        0.001226        0.001262
     Peter        0.001005        0.001034
    Shasta        0.000902        0.000928
    Digory        0.000899        0.000925
   Eustace        0.000853        0.000877
     Susan        0.000636        0.000654
    Tirian        0.000585        0.000601
     Polly        0.000533        0.000547
    Aravis        0.000523        0.000537
      Bree        0.000479        0.000492
Puddleglum        0.000479        0.000492
    Scrubb        0.000469        0.000482
    Andrew        0.000396        0.000406

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Will the probabilities of all unigrams always decrease when Laplace smoothing is applied? If not, under what circumstances might the unigram probabilities increase after smoothing?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])

unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print("{} {:.6f}".format(word, smoothed_unigram(unigram, word)))

L2: Use the get() method to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the 'UNKNOWN' token.
L5: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Aslan 0.001796
Jinho 0.000002

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import bigram_count
from src.types import Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: count / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1/ len(vocab)
    return bigrams

L1: Import the bigram_count() function from the src.ngram_models package.
L6-8: Create a set vocab containing all unique words in the bigrams.
L12: Calculate the total count of all bigrams with the same previous word.
L13: Calculate and store the probabilities of each current word given the previous word
L14: Calculate the probability for an unknown current word.
L17: Add a probability for an unknown previous word.

Why are the L7-8 in the above code necessary to retrieve all word types?

We then test bigram_smoothing() with the same text file:

from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)

L1: Import the test_bigram() function from the ngram_models package.

I
        'm 0.020529
        do 0.019076
       've 0.011082
       was 0.010537
      have 0.009568
        am 0.009023
       'll 0.008175
     think 0.008054
        'd 0.006601
      know 0.006480
the
      same 0.008367
     other 0.007555
      King 0.007061
     Witch 0.006637
     whole 0.005084
    others 0.005049
     first 0.004943
     Dwarf 0.004837
      door 0.004802
     great 0.004802
said
       the 0.038977
         , 0.018210
      Lucy 0.014251
    Edmund 0.011145
   Caspian 0.009988
     Peter 0.009744
      Jill 0.008648
         . 0.008587
    Digory 0.007674
     Aslan 0.007430

Finally, we test the bigram estimation using smoothing for unknown sequences:

def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])

test_bigram(corpus, bigram_smoothing)
    bigram = bigram_smoothing(corpus)
    for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
        print("{} {:.6f}".format(word, smoothed_bigram(bigram, *word)))

L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.
L8: The tuple word is unpacked as passed as the second and third parameters.

('Aslan', 'is') 0.001070
('Aslan', 'Jinho') 0.000076
('Jinho', 'is') 0.000081

Normalization

You are a student
You and I are students

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

Reference

Source: smoothing.py

Maximum Likelihood Estimation

Update: 2023-10-13

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.

In the , you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.

Sequence Probability

Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.

The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:

The joint probability of the word sequence "I am a student" can be measured as follows:

Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .

Chain Rule

By applying the chain rule, the above joint probability can be decomposed into:

Markov Assumption

The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:

The joint probability can now be measured by the product of unigram and bigram probabilities.

How do the chain rule and Markov assumption simplify the estimation of sequence probability?

Initial Word Probability

This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.

This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:

The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:

Entropy and Perplexity

Update: 2023-10-13

Entropy

Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable $X = \{x_1, \ldots, x_n\}$ is defined as:

H(X) = \sum_{i=1}^n P(x_i) \cdot (-\log P(x_i)) = -\sum_{i=1}^n P(x_i) \cdot \log P(x_i)

$P(x_i)$ is the probability distribution of $x_i$ . The self-information of $x_i$ is defined as $-\log P(x)$ , which measures how much information is gained when $x_i$ occurs. The negative sign indicates that as the occurrence of $x_i$ increases, its self-information value decreases.

Entropy has several properties, including:

It is non-negative: $H(X) \geq 0$ .
It is at its minimum when $x_i$ is entirely predictable (all probability mass on a single outcome).
It is at its maximum when all outcomes of $x_i$ are equally likely.

Why is the self-information value expressed in a logarithmic scale?

Sequence Entropy

Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.

Assume a long sequence of words, $W = [w_1, \ldots, w_n]$ , concatenating the entire text from a language $\mathcal{L}$ . Let $\mathcal{S} = \{S_1, \ldots, S_m\}$ be a set of all possible sequences derived from $W$ , where $S_1$ is the shortest sequence (a single word) and $S_m = W$ is the longest sequence. Then, the entropy of $W$ can be measured as follows:

H(W) = -\sum_{j=1}^m P(S_j) \cdot \log P(S_j)

What does it mean when the entropy of a corpus is high?

Perplexity

How does high entropy affect the perplexity of a language model?

Homework

Task 1

Your goal is to develop a bigram model that uses the following techniques:

Laplace smoothing with normalization
Measure the initial word probability by adding the artificial token $w_0$ at the beginning of every sentence.

Test your model using chronicles_of_narnia.txt.

Steps

Create a language_modeling.py file in the src/homework/ directory.
Define a function named bigram_model() that takes a file path pointing to the text file and returns a dictionary of bigram probabilities found in the text file.
Use the following constants to indicate the unknown and initial probabilities:

UNKNOWN = ''
INIT = '[INIT]'

Task 2

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

Steps

Under language_modeling.py, define a function named sequence_generator() that takes the following parameters:

A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

It must have the precise number of tokens as specified.
Excluding punctuation, there should be no redundant tokens in the sequence.

Finally, the function returns a tuple comprising the following two elements:

The list of tokens in the sequence

Extra Credit

Define a function named sequence_generator_max() that accepts the same parameters but returns a sequence with the highest sequence probability among all possible sequences using exhaustive search. To generate long sequences, dynamic programming needs to be adapted.

Submission

Commit and push the language_modeling.py file to your GitHub repository.

Smoothing

Update: 2024-01-05

Unigram Smoothing

P_{\mathcal{L}}(w_i) = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} (\#(w_k) + 1)} = \frac{\#(w_i) + 1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

Thus, the probability of any unknown word $w_*$ with Laplace smoothing is calculated as follows:

P_{\mathcal{L}}(w_*) = \frac{1}{\sum_{\forall w_k \in V} \#(w_k) + |V|}

The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.

Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:

\sum_{i=1}^v P(w_i) = \sum_{i=1}^v P_{\mathcal{L}}(w_i) = 1

Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import unigram_count
from src.types import Unigram

UNKNOWN = ''

def unigram_smoothing(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(counts.values()) + len(counts)
    unigrams = {word: (count + 1) / total for word, count in counts.items()}
    unigrams[UNKNOWN] = 1 / total
    return unigrams

L1: Import the unigram_count() function from the src.ngram_models package.
L4: Define a constant representing the unknown word.
L8: Increment the total count by the vocabulary size.
L9: Increment each unigram count by 1.
L10: Add the unknown word to the unigrams with a probability of 1 divided by the total count.

We then test unigram_smoothing() with a text file dat/chronicles_of_narnia.txt:

from src.ngram_models import test_unigram

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_smoothing)

L1: Import the test_unigram() function from the ngram_models package.

         I 0.010225
     Aslan 0.001796
      Lucy 0.001762
    Edmund 0.001369
    Narnia 0.001339
   Caspian 0.001300
      Jill 0.001226
     Peter 0.001005
    Shasta 0.000902
    Digory 0.000899
   Eustace 0.000853
     Susan 0.000636
    Tirian 0.000585
     Polly 0.000533
    Aravis 0.000523
      Bree 0.000479
Puddleglum 0.000479
    Scrubb 0.000469
    Andrew 0.000396

   Unigram  With Smoothing   W/O Smoothing
         I        0.010225        0.010543
     Aslan        0.001796        0.001850
      Lucy        0.001762        0.001815
    Edmund        0.001369        0.001409
    Narnia        0.001339        0.001379
   Caspian        0.001300        0.001338
      Jill        0.001226        0.001262
     Peter        0.001005        0.001034
    Shasta        0.000902        0.000928
    Digory        0.000899        0.000925
   Eustace        0.000853        0.000877
     Susan        0.000636        0.000654
    Tirian        0.000585        0.000601
     Polly        0.000533        0.000547
    Aravis        0.000523        0.000537
      Bree        0.000479        0.000492
Puddleglum        0.000479        0.000492
    Scrubb        0.000469        0.000482
    Andrew        0.000396        0.000406

Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.

Will the probabilities of all unigrams always decrease when Laplace smoothing is applied? If not, under what circumstances might the unigram probabilities increase after smoothing?

The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:

def smoothed_unigram(probs: Unigram, word: str) -> float:
    return probs.get(word, unigram[UNKNOWN])

unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
    print("{} {:.6f}".format(word, smoothed_unigram(unigram, word)))

L2: Use the get() method to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the 'UNKNOWN' token.
L5: Test a known word, 'Aslan', and an unknown word, 'Jinho'.

Aslan 0.001796
Jinho 0.000002

Bigram Smoothing

The bigram model can also be enhanced by applying Laplace smoothing:

P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V|}

Thus, the probability of an unknown bigram $(w_{u-1}, w_{*})$ where $w_{u-1}$ is known but $w_{*}$ is unknown is calculated as follows:

P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V|}

What does the Laplace smoothed bigram probability of $(w_{u-1}, w_{u})$ represent when $w_{u-1}$ is unknown? What is a potential problem with this estimation?

Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:

from src.ngram_models import bigram_count
from src.types import Bigram

def bigram_smoothing(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    vocab = set(counts.keys())
    for _, css in counts.items():
        vocab.update(css.keys())

    bigrams = dict()
    for prev, ccs in counts.items():
        total = sum(ccs.values()) + len(vocab)
        d = {curr: count / total for curr, count in ccs.items()}
        d[UNKNOWN] = 1 / total
        bigrams[prev] = d

    bigrams[UNKNOWN] = 1/ len(vocab)
    return bigrams

L1: Import the bigram_count() function from the src.ngram_models package.
L6-8: Create a set vocab containing all unique words in the bigrams.
L12: Calculate the total count of all bigrams with the same previous word.
L13: Calculate and store the probabilities of each current word given the previous word
L14: Calculate the probability for an unknown current word.
L17: Add a probability for an unknown previous word.

Why are the L7-8 in the above code necessary to retrieve all word types?

We then test bigram_smoothing() with the same text file:

from src.ngram_models import test_bigram

corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)

L1: Import the test_bigram() function from the ngram_models package.

I
        'm 0.020529
        do 0.019076
       've 0.011082
       was 0.010537
      have 0.009568
        am 0.009023
       'll 0.008175
     think 0.008054
        'd 0.006601
      know 0.006480
the
      same 0.008367
     other 0.007555
      King 0.007061
     Witch 0.006637
     whole 0.005084
    others 0.005049
     first 0.004943
     Dwarf 0.004837
      door 0.004802
     great 0.004802
said
       the 0.038977
         , 0.018210
      Lucy 0.014251
    Edmund 0.011145
   Caspian 0.009988
     Peter 0.009744
      Jill 0.008648
         . 0.008587
    Digory 0.007674
     Aslan 0.007430

Finally, we test the bigram estimation using smoothing for unknown sequences:

def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
    d = probs.get(prev, None)
    return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])

test_bigram(corpus, bigram_smoothing)
    bigram = bigram_smoothing(corpus)
    for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
        print("{} {:.6f}".format(word, smoothed_bigram(bigram, *word)))

L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.
L8: The tuple word is unpacked as passed as the second and third parameters.

('Aslan', 'is') 0.001070
('Aslan', 'Jinho') 0.000076
('Jinho', 'is') 0.000081

Normalization

Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word $w_i$ is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:

You are a student
You and I are students

There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that $v=7$ . Before Laplace smoothing, the bigram probabilities of $(w_{i-1} = \textit{You}, w_{i} = *)$ are estimated as follows:

\begin{align*} P(\text{\textit{are}|\textit{You}}) = P(\text{\textit{and}|\textit{You}}) &= 1/2 \\ P(\text{\textit{are}|\textit{You}}) + P(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:

\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+7) = 2/9 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 4/9 \end{align*}

The bigram distribution for $w_{i-1}$ can be normalized to 1 by adding the total number of word types occurring after $w_{i-1}$ , denoted as $|V_i|$ , to the denominator instead of $v$ :

P_{\mathcal{L}}(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i}) + 1}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k) + |V_i|}

Consequently, the probability of an unknown bigram $(w_{u-1}, w_{*})$ can be calculated with the normalization as follows:

P_{\mathcal{L}}(w_*|w_{u-1}) = \frac{1}{\sum_{\forall w_k \in V_{i}} \#(w_{u-1},w_k) + |V_i|}

For the above example, $|V_{i}| = |\{\textit{are}, \textit{and}\}| = 2$ . Once you apply $|V_{i}|$ to $P_{\mathcal{L}}(*|\textit{You})$ , the sum of its bigram probabilities becomes 1:

\begin{align*} P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) = P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= (1+1)/(2+2) = 1/2 \\ P_{\mathcal{L}}(\text{\textit{are}|\textit{You}}) + P_{\mathcal{L}}(\text{\textit{and}|\textit{You}}) &= 1 \end{align*}

A major drawback of this normalization is that the probability cannot be measured when $w_{u-1}$ is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of $(w_*, w_u)$ , where the previous word is unknown, as follows:

P_{\mathcal{L}}(w_u|w_*) = \min(\{P_{\mathcal{L}}(w_*|w_k) : \forall w_k \in V\})

Reference

Source: smoothing.py

N-gram Models

Update: 2024-01-05

An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.

For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:

1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].

What are the potential issues of using n-grams without proper tokenization?

Unigram Estimation

Given a large corpus, a unigram model calculates the probability of each word as follows (: the total occurrences of in the corpus, : a set of all word types in the corpus):

Let us define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())?

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

from src.types import Unigram

def unigram_estimation(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(unigrams.values())
    return {word: count / total for word, count in counts.items()}

L1: Import the Unigram from the package.
L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :

from collections.abc import Callable

def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
    unigrams = estimator(filepath)
    unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]

    for word, prob in unigram_list[:300]:
        if word[0].isupper() and word.lower() not in unigrams:
            print("{:>10} {:.6f}".format(word, prob))  

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)

L1: Import the type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation() function as the second argument.

         I 0.010543
     Aslan 0.001850
      Lucy 0.001815
    Edmund 0.001409
    Narnia 0.001379
   Caspian 0.001338
      Jill 0.001262
     Peter 0.001034
    Shasta 0.000928
    Digory 0.000925
   Eustace 0.000877
     Susan 0.000654
    Tirian 0.000601
     Polly 0.000547
    Aravis 0.000537
      Bree 0.000492
Puddleglum 0.000492
    Scrubb 0.000482
    Andrew 0.000406

What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?

Bigram Estimation

A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):

P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}

Let us define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

from collections import Counter, defaultdict
from src.types import Bigram

def bigram_count(filepath: str) -> dict[str, Counter]:
    bigrams = defaultdict(Counter)

    for line in open(filepath):
        words = line.split()
        for i in range(1, len(words)):
            bigrams[words[i - 1]].update([words[i]])

    return bigrams

L1: Import the class from the package.
L2: import the Bigram from the package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

from src.types import Bigram

def bigram_estimation(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    bigrams = dict()

    for prev, ccs in counts.items():
        total = sum(ccs.values())
        bigrams[prev] = {curr: count / total for curr, count in ccs.items()}

    return bigrams

L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
    bigrams = estimator(filepath)
    for prev in ['I', 'the', 'said']:
        print(prev)
        bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
        for curr, prob in bigram_list[:10]:
            print("{:>10} {:.6f}".format(curr, prob))

test_bigram(corpus, bigram_estimation)

L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

I
        'm 0.081628
        do 0.075849
       've 0.044065
       was 0.041897
      have 0.038045
        am 0.035878
       'll 0.032507
     think 0.032025
        'd 0.026246
      know 0.025765
the
      same 0.014846
     other 0.013405
      King 0.012528
     Witch 0.011776
     whole 0.009020
    others 0.008958
     first 0.008770
     Dwarf 0.008582
      door 0.008519
     great 0.008519
said
       the 0.157635
         , 0.073645
      Lucy 0.057635
    Edmund 0.045074
   Caspian 0.040394
     Peter 0.039409
      Jill 0.034975
         . 0.034729
    Digory 0.031034
     Aslan 0.030049

References

Source:

, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

Language Modeling

Contents

N-gram Models

Unigram Estimation

Bigram Estimation

References

Smoothing

Unigram Smoothing

Bigram Smoothing

Normalization

Reference

Maximum Likelihood Estimation

Sequence Probability

Chain Rule

Markov Assumption

Initial Word Probability

Entropy and Perplexity

Entropy

Sequence Entropy

Perplexity

Homework

Task 1

Steps

Task 2

Steps

Extra Credit

Submission

Entropy and Perplexity

Entropy

Sequence Entropy

Perplexity

Homework

Task 1

Steps

Task 2

Steps

Extra Credit

Submission

Smoothing

Unigram Smoothing

Bigram Smoothing

Normalization

Reference

Language Modeling

Contents

Maximum Likelihood Estimation

Sequence Probability

Chain Rule

Markov Assumption

Initial Word Probability

N-gram Models

Unigram Estimation

Bigram Estimation

References