N-gram Models
An n-gram is a contiguous sequence of n items from text data. These items can be:
Words (most common in language modeling)
Characters (useful for morphological analysis)
Subword tokens (common in modern NLP)
Bytes or other units
For the sentence "I'm a computer scientist.", we can extract different n-grams:
1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"]
and ["scientist", "."]
.
Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?
Unigram Estimation
Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:
Where:
: the count of word in the corpus.
: the vocabulary (set of all unique words).
The denominator represents the total word count.
Let us define the Unigram
type:
L1: Type aliases.
L3: A dictionary where the key is a unigram and the value is its probability.
Let us also define a function unigram_count()
that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
We then define a function unigram_estimation()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L3: Calculate the total count of all unigrams in the text.
L4: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram()
that takes a file path as well as an estimator function, and test unigram_estimation()
with a text file dat/chronicles_of_narnia.txt:
L1: Import the Callable type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in
unigrams
.L5: Create a list of unigram-probability pairs,
unigram_list
, sorted by probability in descending order.L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L2: Pass the
unigram_estimation()
function as the second argument.
This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".
Q2: What advantages do unigram probabilities have over word frequencies?
Bigram Estimation
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
Where:
: the total occurrences of in the corpus in that order.
: a set of all word types occurring after .
Let us define the Bigram
type:
A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a smoothing probability for .
Let us also define a function bigram_count()
that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L1: Import the defaultdict class from the collections package.
L2: import the Bigram type alias from the src.types package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.
We then define a function bigram_estimation()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram()
that takes a file path and an estimator function, and test bigram_estimation()
with the text file:
L2: Call the
bigram_estimation()
function with the text file and store the result.L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?
References
Source: ngram_models.py.
N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
Last updated
Was this helpful?