Update: 2023-10-13
Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable is defined as:
is the probability distribution of . The self-information of is defined as , which measures how much information is gained when occurs. The negative sign indicates that as the occurrence of increases, its self-information value decreases.
Entropy has several properties, including:
It is non-negative: .
It is at its minimum when is entirely predictable (all probability mass on a single outcome).
It is at its maximum when all outcomes of are equally likely.
Why is the self-information value expressed in a logarithmic scale?
Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.
Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:
What does it mean when the entropy of a corpus is high?
How does high entropy affect the perplexity of a language model?
Your goal is to develop a bigram model that uses the following techniques:
Laplace smoothing with normalization
Measure the initial word probability by adding the artificial token at the beginning of every sentence.
Test your model using chronicles_of_narnia.txt.
Create a language_modeling.py file in the src/homework/ directory.
Define a function named bigram_model()
that takes a file path pointing to the text file and returns a dictionary of bigram probabilities found in the text file.
Use the following constants to indicate the unknown and initial probabilities:
Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.
Under language_modeling.py, define a function named sequence_generator()
that takes the following parameters:
A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)
This function aims to generate a sequence of tokens that adheres to the following criteria:
It must have the precise number of tokens as specified.
Excluding punctuation, there should be no redundant tokens in the sequence.
Finally, the function returns a tuple comprising the following two elements:
The list of tokens in the sequence
Define a function named sequence_generator_max()
that accepts the same parameters but returns a sequence with the highest sequence probability among all possible sequences using exhaustive search. To generate long sequences, dynamic programming needs to be adapted.
Commit and push the language_modeling.py file to your GitHub repository.
Update: 2024-01-05
The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.
Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:
Thus, the probability of any unknown word with Laplace smoothing is calculated as follows:
The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.
Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:
Let us define a function unigram_smoothing()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the unigram_count()
function from the src.ngram_models package.
L4: Define a constant representing the unknown word.
L8: Increment the total count by the vocabulary size.
L9: Increment each unigram count by 1.
L10: Add the unknown word to the unigrams with a probability of 1 divided by the total count.
We then test unigram_smoothing()
with a text file dat/chronicles_of_narnia.txt:
L1: Import the test_unigram()
function from the ngram_models package.
Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.
Will the probabilities of all unigrams always decrease when Laplace smoothing is applied? If not, under what circumstances might the unigram probabilities increase after smoothing?
The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN
key:
L2: Use the get()
method to retrieve the probability of the target word from probs
. If the word is not present, default to the probability of the 'UNKNOWN' token.
L5: Test a known word, 'Aslan', and an unknown word, 'Jinho'.
The bigram model can also be enhanced by applying Laplace smoothing:
Let us define a function bigram_smoothing()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the bigram_count()
function from the src.ngram_models package.
L6-8: Create a set vocab
containing all unique words in the bigrams.
L12: Calculate the total count of all bigrams with the same previous word.
L13: Calculate and store the probabilities of each current word given the previous word
L14: Calculate the probability for an unknown current word.
L17: Add a probability for an unknown previous word.
Why are the L7-8 in the above code necessary to retrieve all word types?
We then test bigram_smoothing()
with the same text file:
L1: Import the test_bigram()
function from the ngram_models package.
Finally, we test the bigram estimation using smoothing for unknown sequences:
L2: Retrieve the bigram probabilities of the previous word, or set it to None
if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.
L8: The tuple word is unpacked as passed as the second and third parameters.
However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:
Source: smoothing.py
Update: 2023-10-13
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.
In the , you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.
Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.
The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:
The joint probability of the word sequence "I am a student" can be measured as follows:
Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .
By applying the chain rule, the above joint probability can be decomposed into:
The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:
The joint probability can now be measured by the product of unigram and bigram probabilities.
How do the chain rule and Markov assumption simplify the estimation of sequence probability?
This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.
This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:
The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:
Update: 2024-01-05
An n-gram is a contiguous sequence of n items from text data. These items are typically words, tokens, or characters, depending on the context and the specific application.
For the sentence "I'm a computer scientist.", [1-3]-grams can be extracted as follows:
1-gram (unigram): {"I'm", "a", "computer", "scientist."}
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"]
and ["scientist", "."]
.
What are the potential issues of using n-grams without proper tokenization?
Given a large corpus, a unigram model calculates the probability of each word as follows (: the total occurrences of in the corpus, : a set of all word types in the corpus):
Let us define a function unigram_count()
that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
What are the benefits of processing line by line, as shown in L6-8, as opposed to processing the entire file at once using unigrams.update(open(filepath).read().split())
?
We then define a function unigram_estimation()
that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L5: Calculate the total count of all unigrams in the text.
L6: Return a dictionary where each word is a key and its probability is the value.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams
.
L5: Create a list of unigram-probability pairs, unigram_list
, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L12: Pass the unigram_estimation()
function as the second argument.
What are the top 10 unigrams with the highest probabilities? What practical value do these unigrams have in terms of language modeling?
Let us define a function bigram_count()
that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
L9: Iterate through the words, starting from the second word (index 1) in each line.
L10: Update the frequency of the current bigram.
We then define a function bigram_estimation()
that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram()
that takes a file path and an estimator function, and test bigram_estimation()
with the text file:
L2: Call the bigram_estimation()
function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :
In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:
The Shannon-McMillan-Breiman theorem implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.
The bigram model in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.
By applying this theorem, can be approximated:
Consequently, is approximated as follows, where :
Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:
Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .
Perplexity, , can be directly derived from the approximated entropy rate, :
Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of puncuation tokens are permitted).
In this task, the goal is not to discover a sequence that maximizes the overall sequence probability, but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word , a potential strategy is as follows:
Identify the next word where the bigram probability is maximized.
If fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.
Make and repeat the #1 until you reach the specific sequence length.
The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base , provided as the math.log()
function in Python.
Thus, the probability of an unknown bigram where is known but is unknown is calculated as follows:
What does the Laplace smoothed bigram probability of represent when is unknown? What is a potential problem with this estimation?
Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:
There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that . Before Laplace smoothing, the bigram probabilities of are estimated as follows:
The bigram distribution for can be normalized to 1 by adding the total number of word types occurring after , denoted as , to the denominator instead of :
Consequently, the probability of an unknown bigram can be calculated with the normalization as follows:
For the above example, . Once you apply to , the sum of its bigram probabilities becomes 1:
A major drawback of this normalization is that the probability cannot be measured when is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of , where the previous word is unknown, as follows:
Thus, the probability of any given word sequence can be measured as:
The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring is as challenging as measuring .
Let us consider the unigram probabilities of and . In general, "the" appears more frequently than "The" such that:
Let be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as and , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:
Thus, to predict a more probable initial word, it is better to consider the bigram probability rather than the unigram probability when measuring sequence probability:
Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?
L1: Import the Unigram from the package.
Finally, let us define a function test_unigram()
that takes a file path as well as an estimator function, and test unigram_estimation()
with a text file :
L1: Import the type from the typing module.
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
L1: Import the class from the package.
L2: import the Bigram from the package.
Source:
, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.