Update: 2024-01-05
An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.
For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:
1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been tokenized as ["I", "'m"]
and ["scientist", "."]
.
What are the potential issues of using n-grams without proper tokenization?
Given a large corpus, a unigram model calculates the probability of each word as follows (: the total occurrences of in the corpus, : a set of all word types in the corpus):
Let us define a function unigram_count()
that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())
?
We then define a function unigram_estimation()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L1: Import the Unigram type alias from the src.types package.
L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram()
that takes a file path as well as an estimator function, and test unigram_estimation()
with a text file dat/chronicles_of_narnia.txt:
L1: Import the Callable type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams
.
L5: Create a list of unigram-probability pairs, unigram_list
, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation()
function as the second argument.
What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?
Let us define a function bigram_count()
that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L1: Import the defaultdict class from the collections package.
L2: import the Bigram type alias from the src.types package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.
We then define a function bigram_estimation()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram()
that takes a file path and an estimator function, and test bigram_estimation()
with the text file:
L2: Call the bigram_estimation()
function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
Source: ngram_models.py
N-gram Language Models, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):