N-gram Models

Update: 2024-01-05

An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.

For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:

1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"] and ["scientist", "."].

What are the potential issues of using n-grams without proper tokenization?

Unigram Estimation

Given a large corpus, a unigram model calculates the probability of each word $w_i$ as follows ( $\#(w_i)$ : the total occurrences of $w_i$ in the corpus, $V$ : a set of all word types in the corpus):

P(w_i) = \frac{\#(w_i)}{\sum_{\forall w_k \in V}\#(w_k)}

Let us define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

from collections import Counter

def unigram_count(filepath: str) -> Counter:
    unigrams = Counter()

    for line in open(filepath):
        words = line.split()
        unigrams.update(words)

    return unigrams

What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())?

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

from src.types import Unigram

def unigram_estimation(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(unigrams.values())
    return {word: count / total for word, count in counts.items()}

L1: Import the Unigram type alias from the src.types package.
L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file dat/chronicles_of_narnia.txt:

from collections.abc import Callable

def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
    unigrams = estimator(filepath)
    unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]

    for word, prob in unigram_list[:300]:
        if word[0].isupper() and word.lower() not in unigrams:
            print("{:>10} {:.6f}".format(word, prob))  

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)

L1: Import the Callable type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation() function as the second argument.

         I 0.010543
     Aslan 0.001850
      Lucy 0.001815
    Edmund 0.001409
    Narnia 0.001379
   Caspian 0.001338
      Jill 0.001262
     Peter 0.001034
    Shasta 0.000928
    Digory 0.000925
   Eustace 0.000877
     Susan 0.000654
    Tirian 0.000601
     Polly 0.000547
    Aravis 0.000537
      Bree 0.000492
Puddleglum 0.000492
    Scrubb 0.000482
    Andrew 0.000406

What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?

Bigram Estimation

A bigram model calculates the conditional probability of the current word $w_{i}$ given the previous word $w_{I-1}$ as follows ( $\#(w_{i-1},w_{i})$ : the total occurrences of $(w_{i-1},w_{i})$ in the corpus in that order, $V_i$ : a set of all word types occurring after $w_{i-1}$ ):

P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}

Let us define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

from collections import Counter, defaultdict
from src.types import Bigram

def bigram_count(filepath: str) -> dict[str, Counter]:
    bigrams = defaultdict(Counter)

    for line in open(filepath):
        words = line.split()
        for i in range(1, len(words)):
            bigrams[words[i - 1]].update([words[i]])

    return bigrams

L1: Import the defaultdict class from the collections package.
L2: import the Bigram type alias from the src.types package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

from src.types import Bigram

def bigram_estimation(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    bigrams = dict()

    for prev, ccs in counts.items():
        total = sum(ccs.values())
        bigrams[prev] = {curr: count / total for curr, count in ccs.items()}

    return bigrams

L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
    bigrams = estimator(filepath)
    for prev in ['I', 'the', 'said']:
        print(prev)
        bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
        for curr, prob in bigram_list[:10]:
            print("{:>10} {:.6f}".format(curr, prob))

test_bigram(corpus, bigram_estimation)

L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

I
        'm 0.081628
        do 0.075849
       've 0.044065
       was 0.041897
      have 0.038045
        am 0.035878
       'll 0.032507
     think 0.032025
        'd 0.026246
      know 0.025765
the
      same 0.014846
     other 0.013405
      King 0.012528
     Witch 0.011776
     whole 0.009020
    others 0.008958
     first 0.008770
     Dwarf 0.008582
      door 0.008519
     great 0.008519
said
       the 0.157635
         , 0.073645
      Lucy 0.057635
    Edmund 0.045074
   Caspian 0.040394
     Peter 0.039409
      Jill 0.034975
         . 0.034729
    Digory 0.031034
     Aslan 0.030049

References

Source: ngram_models.py

N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

PreviousLanguage Modeling NextSmoothing

Last updated 1 month ago