N-gram Models
An n-gram is a contiguous sequence of n items from text data. These items can be:
Words (most common in language modeling)
Characters (useful for morphological analysis)
Subword tokens (common in modern NLP)
Bytes or other units
For the sentence "I'm a computer scientist.", we can extract different n-grams:
1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"] and ["scientist", "."].
Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?
Unigram Estimation
Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:
Where:
: the count of word in the corpus.
: the vocabulary (set of all unique words).
The denominator represents the total word count.
Let us define the Unigram type:
from typing import TypeAlias
Unigram: TypeAlias = dict[str, float]L1: Type aliases.
L3: A dictionary where the key is a unigram and the value is its probability.
Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
from collections import Counter
def unigram_count(filepath: str) -> Counter:
unigrams = Counter()
for line in open(filepath):
words = line.split()
unigrams.update(words)
return unigramsWe then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
def unigram_estimation(filepath: str) -> Unigram:
counts = unigram_count(filepath)
total = sum(counts.values())
return {word: count / total for word, count in counts.items()}L3: Calculate the total count of all unigrams in the text.
L4: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file dat/chronicles_of_narnia.txt:
from collections.abc import Callable
def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
unigrams = estimator(filepath)
unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]
for word, prob in unigram_list[:300]:
if word[0].isupper() and word.lower() not in unigrams:
print(f'{word:>10} {prob:.6f}')L1: Import the Callable type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in
unigrams.L5: Create a list of unigram-probability pairs,
unigram_list, sorted by probability in descending order.L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)L2: Pass the
unigram_estimation()function as the second argument.
I 0.010543
Aslan 0.001850
Lucy 0.001815
Edmund 0.001409
Narnia 0.001379
Caspian 0.001338
Jill 0.001262
Peter 0.001034
Shasta 0.000928
Digory 0.000925
Eustace 0.000877
Susan 0.000654
Tirian 0.000601
Polly 0.000547
Aravis 0.000537
Bree 0.000492
Puddleglum 0.000492
Scrubb 0.000482
Andrew 0.000406This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".
Q2: What advantages do unigram probabilities have over word frequencies?
Bigram Estimation
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
Where:
: the total occurrences of in the corpus in that order.
: a set of all word types occurring after .
Let us define the Bigram type:
Bigram: TypeAlias = dict[str, Unigram | float]A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a smoothing probability for .
Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
from collections import Counter, defaultdict
from src.types import Bigram
def bigram_count(filepath: str) -> dict[str, Counter]:
bigrams = defaultdict(Counter)
for line in open(filepath):
words = line.split()
for i in range(1, len(words)):
bigrams[words[i - 1]].update([words[i]])
return bigramsL1: Import the defaultdict class from the collections package.
L2: import the Bigram type alias from the src.types package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.
We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
from src.types import Bigram
def bigram_estimation(filepath: str) -> Bigram:
counts = bigram_count(filepath)
bigrams = dict()
for prev, ccs in counts.items():
total = sum(ccs.values())
bigrams[prev] = {curr: count / total for curr, count in ccs.items()}
return bigramsL8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:
def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
bigrams = estimator(filepath)
for prev in ['I', 'the', 'said']:
print(prev)
bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
for curr, prob in bigram_list[:10]:
print("{:>10} {:.6f}".format(curr, prob))L2: Call the
bigram_estimation()function with the text file and store the result.L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
test_bigram(corpus, bigram_estimation)I
'm 0.081628
do 0.075849
've 0.044065
was 0.041897
have 0.038045
am 0.035878
'll 0.032507
think 0.032025
'd 0.026246
know 0.025765
the
same 0.014846
other 0.013405
King 0.012528
Witch 0.011776
whole 0.009020
others 0.008958
first 0.008770
Dwarf 0.008582
door 0.008519
great 0.008519
said
the 0.157635
, 0.073645
Lucy 0.057635
Edmund 0.045074
Caspian 0.040394
Peter 0.039409
Jill 0.034975
. 0.034729
Digory 0.031034
Aslan 0.030049Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?
References
Source: ngram_models.py.
N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
Last updated
Was this helpful?