2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].
Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?
Unigram Estimation
Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:
Where:
: the count of word in the corpus.
: the vocabulary (set of all unique words).
The denominator represents the total word count.
Let us define the Unigram type:
L1: .
L3: A dictionary where the key is a unigram and the value is its probability.
Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L3: Calculate the total count of all unigrams in the text.
L4: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :
L1: Import the type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L2: Pass the unigram_estimation() function as the second argument.
This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".
Q2: What advantages do unigram probabilities have over ?
Bigram Estimation
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
Where:
: the total occurrences of in the corpus in that order.
: a set of all word types occurring after .
Let us define the Bigram type:
A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a for .
Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L1: Import the class from the package.
L2: import the Bigram from the package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:
L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?
References
Source: .
, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L9: Iterate through the words, starting from the second word (index 1) in each line.
from typing import TypeAlias
Unigram: TypeAlias = dict[str, float]
from collections import Counter
def unigram_count(filepath: str) -> Counter:
unigrams = Counter()
for line in open(filepath):
words = line.split()
unigrams.update(words)
return unigrams
def unigram_estimation(filepath: str) -> Unigram:
counts = unigram_count(filepath)
total = sum(counts.values())
return {word: count / total for word, count in counts.items()}
from collections.abc import Callable
def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
unigrams = estimator(filepath)
unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]
for word, prob in unigram_list[:300]:
if word[0].isupper() and word.lower() not in unigrams:
print(f'{word:>10} {prob:.6f}')
corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)
I 0.010543
Aslan 0.001850
Lucy 0.001815
Edmund 0.001409
Narnia 0.001379
Caspian 0.001338
Jill 0.001262
Peter 0.001034
Shasta 0.000928
Digory 0.000925
Eustace 0.000877
Susan 0.000654
Tirian 0.000601
Polly 0.000547
Aravis 0.000537
Bree 0.000492
Puddleglum 0.000492
Scrubb 0.000482
Andrew 0.000406
Bigram: TypeAlias = dict[str, Unigram | float]
from collections import Counter, defaultdict
from src.types import Bigram
def bigram_count(filepath: str) -> dict[str, Counter]:
bigrams = defaultdict(Counter)
for line in open(filepath):
words = line.split()
for i in range(1, len(words)):
bigrams[words[i - 1]].update([words[i]])
return bigrams
from src.types import Bigram
def bigram_estimation(filepath: str) -> Bigram:
counts = bigram_count(filepath)
bigrams = dict()
for prev, ccs in counts.items():
total = sum(ccs.values())
bigrams[prev] = {curr: count / total for curr, count in ccs.items()}
return bigrams
def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
bigrams = estimator(filepath)
for prev in ['I', 'the', 'said']:
print(prev)
bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
for curr, prob in bigram_list[:10]:
print("{:>10} {:.6f}".format(curr, prob))
test_bigram(corpus, bigram_estimation)
I
'm 0.081628
do 0.075849
've 0.044065
was 0.041897
have 0.038045
am 0.035878
'll 0.032507
think 0.032025
'd 0.026246
know 0.025765
the
same 0.014846
other 0.013405
King 0.012528
Witch 0.011776
whole 0.009020
others 0.008958
first 0.008770
Dwarf 0.008582
door 0.008519
great 0.008519
said
the 0.157635
, 0.073645
Lucy 0.057635
Edmund 0.045074
Caspian 0.040394
Peter 0.039409
Jill 0.034975
. 0.034729
Digory 0.031034
Aslan 0.030049