N-gram Models

Update: 2024-01-05

An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.

For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:

  • 1-gram (unigram): {"I'm", "a", "computer", "scientist."}

  • 2-gram (bigram): {"I'm a", "a computer", "computer scientist."}

  • 3-gram (trigram): {"I'm a computer", "a computer scientist."}

In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"] and ["scientist", "."].

What are the potential issues of using n-grams without proper tokenization?

Unigram Estimation

Given a large corpus, a unigram model calculates the probability of each word wiw_i as follows (#(wi)\#(w_i): the total occurrences of wiw_i in the corpus, VV: a set of all word types in the corpus):

P(wi)=#(wi)wkV#(wk)P(w_i) = \frac{\#(w_i)}{\sum_{\forall w_k \in V}\#(w_k)}

Let us define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:

from collections import Counter

def unigram_count(filepath: str) -> Counter:
    unigrams = Counter()

    for line in open(filepath):
        words = line.split()
        unigrams.update(words)

    return unigrams

What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())?

We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:

from src.types import Unigram

def unigram_estimation(filepath: str) -> Unigram:
    counts = unigram_count(filepath)
    total = sum(unigrams.values())
    return {word: count / total for word, count in counts.items()}
  • L1: Import the Unigram type alias from the src.types package.

  • L5: Calculate the total count of all unigrams in the text.

  • L6: Return a dictionary where each word is a key and its probability is the value.

Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file dat/chronicles_of_narnia.txt:

from collections.abc import Callable

def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
    unigrams = estimator(filepath)
    unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]

    for word, prob in unigram_list[:300]:
        if word[0].isupper() and word.lower() not in unigrams:
            print("{:>10} {:.6f}".format(word, prob))  

corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)
  • L1: Import the Callable type from the typing module.

  • L3: The second argument accepts a function that takes a string and returns a Unigram.

  • L4: Call the estimator with the text file and store the result in unigrams.

  • L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.

  • L7: Iterate through the top 300 unigrams with the highest probabilities.

  • L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).

  • L12: Pass the unigram_estimation() function as the second argument.

What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?

Bigram Estimation

A bigram model calculates the conditional probability of the current word wiw_{i} given the previous word wI1w_{I-1} as follows (#(wi1,wi)\#(w_{i-1},w_{i}): the total occurrences of (wi1,wi)(w_{i-1},w_{i}) in the corpus in that order, ViV_i: a set of all word types occurring after wi1w_{i-1}):

P(wiwi1)=#(wi1,wi)wkVi#(wi1,wk)P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_{i})}{\sum_{\forall w_k \in V_{i}} \#(w_{i-1},w_k)}

Let us define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:

from collections import Counter, defaultdict
from src.types import Bigram

def bigram_count(filepath: str) -> dict[str, Counter]:
    bigrams = defaultdict(Counter)

    for line in open(filepath):
        words = line.split()
        for i in range(1, len(words)):
            bigrams[words[i - 1]].update([words[i]])

    return bigrams
  • L1: Import the defaultdict class from the collections package.

  • L2: import the Bigram type alias from the src.types package.

  • L5: Create a defaultdict with Counters as default values to store bigram frequencies.

  • L9: Iterate through the words, starting from the second word (index 1) in each line.

  • L10: Update the frequency of the current bigram.

We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:

from src.types import Bigram

def bigram_estimation(filepath: str) -> Bigram:
    counts = bigram_count(filepath)
    bigrams = dict()

    for prev, ccs in counts.items():
        total = sum(ccs.values())
        bigrams[prev] = {curr: count / total for curr, count in ccs.items()}

    return bigrams
  • L8: Calculate the total count of all bigrams with the same previous word.

  • L9: Calculate and store the probabilities of each current word given the previous word.

Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:

def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
    bigrams = estimator(filepath)
    for prev in ['I', 'the', 'said']:
        print(prev)
        bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
        for curr, prob in bigram_list[:10]:
            print("{:>10} {:.6f}".format(curr, prob))

test_bigram(corpus, bigram_estimation)
  • L2: Call the bigram_estimation() function with the text file and store the result.

  • L5: Create a bigram list given the previous word, sorted by probability in descending order.

  • L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.

References

Source: ngram_models.py

  1. N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.

Last updated

Copyright © 2023 All rights reserved