Entropy is a measure of the uncertainty, randomness, or information content of a random variable or a probability distribution. The entropy of a random variable X={x1β,β¦,xnβ} is defined as:
P(xiβ) is the probability distribution of xiβ. The self-information of xiβ is defined as βlogP(x), which measures how much information is gained when xiβ occurs. The negative sign indicates that as the occurrence of xiβ increases, its self-information value decreases.
Entropy has several properties, including:
It is non-negative: .
It is at its minimum when is entirely predictable (all probability mass on a single outcome).
It is at its maximum when all outcomes of are equally likely.
Q10: Why is logarithmic scale used to measure self-information in entropy calculations?
Sequence Entropy
Sequence entropy is a measure of the unpredictability or information content of the sequence, which quantifies how uncertain or random a word sequence is.
Assume a long sequence of words, , concatenating the entire text from a language . Let be a set of all possible sequences derived from , where is the shortest sequence (a single word) and is the longest sequence. Then, the entropy of can be measured as follows:
The entropy rate (per-word entropy), , can be measured by dividing by the total number of words :
In theory, there is an infinite number of unobserved word sequences in the language . To estimate the true entropy of , we need to take the limit to as approaches infinity:
The implies that if the language is both stationary and ergodic, considering a single sequence that is sufficiently long can be as effective as summing over all possible sequences to measure because a long sequence of words naturally contains numerous shorter sequences, and each of these shorter sequences reoccurs within the longer sequence according to their respective probabilities.
The in the previous section is stationary because all probabilities rely on the same condition, . In reality, however, this assumption does not hold. The probability of a word's occurrence often depends on a range of other words in the context, and this contextual influence can vary significantly from one word to another.
By applying this theorem, can be approximated:
Consequently, is approximated as follows, where :
Q11: What indicates high entropy in a text corpus?
Perplexity
Perplexity measures how well a language model can predict a set of words based on the likelihood of those words occurring in a given text. The perplexity of a word sequence is measured as:
Hence, the higher is, the lower its perplexity becomes, implying that the language model is "less perplexed" and more confident in generating .
Perplexity, , can be directly derived from the approximated entropy rate, :
Q12: What is the relationship between corpus entropy and language model perplexity?
Your goal is to build a bigram model using (1) Laplace smoothing with normalization and (2) initial word probabilities by adding the artificial token w0β at the beginning of every sentence.
Implementation
Create a file in the directory.
Define a function named bigram_model() that takes a file path pointing to the text file, and returns a dictionary of bigram probabilities estimated in the text file.
Use the following constants to indicate the unknown and initial probabilities:
Notes
Test your model using .
Each line should be treated independently for bigram counting such that the INIT token should precede the first word of each line.
Use such that all probabilities must sum to 1.0 for any given previous word.
Task 2: Sequence Generation
Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.
Implementation
Under , define a function named sequence_generator() that takes the following parameters:
A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)
This function aims to generate a sequence of tokens that adheres to the following criteria:
It must have the precise number of tokens as specified.
Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of puncuation tokens are permitted).
Excluding punctuation, there should be no redundant tokens in the sequence.
The goal of this task is not to discover a sequence that maximizes the overall , but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word , a potential strategy is as follows:
Identify the next word where the bigram probability is maximized.
If fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.
Make and repeat the #1 until you reach the specific sequence length.
Finally, the function returns a tuple comprising the following two elements:
The list of tokens in the sequence
The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base , provided as the math.log() function in Python.
Extra Credit
Create a function called sequence_generator_plus() that takes the same input parameters as the existing sequence_generator() function. This new function should generate sequences with higher probability scores and better semantic coherence compared to the original implementation.
Submission
Commit and push the language_models.py file to your GitHub repository.
Rubric
Task 1: Bigram Modeling (5 points)
Task 2: Sequence Generator (4.6 points), Extra Credit (2 points)
Concept Quiz (2.4 points)
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution based on observed data. MLE aims to find the values of the model's parameters that make the observed data most probable under the assumed statistical model.
In the previous section, you have already used MLE to estimate unigram and bigram probabilities. In this section, we will apply MLE to estimate sequence probabilities.
Sequence Probability
Let us examine a model that takes a sequence of words and generates the next word. Given a word sequence "I am a", the model aims to predict the most likely next word by estimating the probabilities associated with potential continuations, such as "I am a student" or "I'm a teacher," and selecting the one with the highest probability.
The conditional probability of the word "student" occurring after the word sequence "I am a" can be estimated as follows:
The joint probability of the word sequence "I am a student" can be measured as follows:
Counting the occurrences of n-grams, especially when n can be indefinitely long, is neither practical nor effective, even with a vast corpus. In practice, we address this challenge by employing two techniques: and .
Q7: What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?
Chain Rule
By applying the chain rule, the above joint probability can be decomposed into:
Thus, the probability of any given word sequence can be measured as:
The chain rule effectively decomposes the original problem into subproblems; however, it does not resolve the issue because measuring is as challenging as measuring .
Markov Assumption
The Markov assumption (aka. Markov property) states that the future state of a system depends only on its present state and is independent of its past states, given its present state. In the context of language modeling, it implies that the next word generated by the model should depend solely on the current word. This assumption dramatically simplifies the chain rule mentioned above:
The joint probability can now be measured by the product of unigram and bigram probabilities.
Q8: How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?
Initial Word Probability
Let us consider the unigram probabilities of and . In general, "the" appears more frequently than "The" such that:
Let be an artificial token indicating the beginning of the text. We can then measure the bigram probabilities of "the" and "The" appearing as the initial words of the text, denoted as and , respectively. Since the first letter of the initial word in formal English writing is conventionally capitalized, it is likely that:
This is not necessarily true if the model is trained on informal writings, such as social media data, where conventional capitalization is often neglected.
Thus, to predict a more probable initial word, it is better to consider the bigram probability rather than the unigram probability when measuring sequence probability:
This enhancement allows us to elaborate the sequence probability as a simple product of bigram probabilities:
The multiplication of numerous probabilities can often be computationally infeasible due to slow processing and the potential for decimal points to exceed system limits. In practice, logarithmic probabilities are calculated instead:
Q9: Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?
References
, Wikipedia
Smoothing
Unigram Smoothing
The unigram model in the previous section faces a challenge when confronted with words that do not occur in the corpus, resulting in a probability of 0. One common technique to address this challenge is smoothing, which tackles issues such as zero probabilities, data sparsity, and overfitting that emerge during probability estimation and predictive modeling with limited data.
Laplace smoothing (aka. add-one smoothing) is a simple yet effective technique that avoids zero probabilities and distributes the probability mass more evenly. It adds the count of 1 to every word and recalculates the unigram probabilities:
Thus, the probability of any unknown word with Laplace smoothing is calculated as follows:
The unigram probability of an unknown word is guaranteed to be lower than the unigram probabilities of any known words, whose counts have been adjusted to be greater than 1.
Note that the sum of all unigram probabilities adjusted by Laplace smoothing is still 1:
Let us define a function unigram_smoothing() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the unigram_count() function from the package.
L3: Define a constant representing the unknown word.
L7: Increment the total count by the vocabulary size.
We then test unigram_smoothing() with a text file :
L1: Import test_unigram() from the package.
Compared to the unigram results without smoothing (see the "Comparison" tab above), the probabilities for these top unigrams have slightly decreased.
Q4: When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram's probability to increase?
The unigram probability of any word (including unknown) can be retrieved using the UNKNOWN key:
L2: Use to retrieve the probability of the target word from probs. If the word is not present, default to the probability of the UNKNOWN token.
L2: Test a known word, 'Aslan', and an unknown word, 'Jinho'.
Bigram Smoothing
The bigram model can also be enhanced by applying Laplace smoothing:
Thus, the probability of an unknown bigram where is known but is unknown is calculated as follows:
Q5: What does the Laplace smoothed bigram probability of represent when is unknown, and what is a potential problem with this estimation?
Let us define a function bigram_smoothing() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively, estimated by Laplace smoothing:
L1: Import the bigram_count() function from the package.
L5: Create a set vocab containing all unique .
L6-7: Add all unique to vocab
We then test bigram_smoothing() with the same text file:
L1: Import the test_bigram() function from the package.
Finally, we test the bigram estimation using smoothing for unknown sequences:
L2: Retrieve the bigram probabilities of the previous word, or set it to None if not present.
L3: Return the probability of the current word given the previous word with smoothing. If the previous word is not present, return the probability for an unknown previous word.
L3: The tuple word is as passed as the second and third parameters.
Normalization
Unlike the unigram case, the sum of all bigram probabilities adjusted by Laplace smoothing given a word is not guaranteed to be 1. To illustrate this point, let us consider the following corpus comprising only two sentences:
There are seven word types in this corpus, {"I", "You", "a", "and", "are", "student", "students"}, such that . Before Laplace smoothing, the bigram probabilities of are estimated as follows:
However, after applying Laplace smoothing, the bigram probabilities undergo significant changes, and their sum no longer equals 1:
The bigram distribution for can be normalized to 1 by adding the total number of word types occurring after , denoted as , to the denominator instead of :
Consequently, the probability of an unknown bigram can be calculated with the normalization as follows:
For the above example, . Once you apply to , the sum of its bigram probabilities becomes 1:
A major drawback of this normalization is that the probability cannot be measured when is unknown. Thus, we assign the minimum unknown probability across all bigrams as the bigram probability of , where the previous word is unknown, as follows:
Q6: Why is it problematic when bigram probabilities following a given word don't sum to 1?
unigram = unigram_smoothing(corpus)
for word in ['Aslan', 'Jinho']:
print(f'{word} {smoothed_unigram(unigram, word):.6f}')
Aslan 0.001796
Jinho 0.000002
from src.ngram_models import bigram_count, Bigram
def bigram_smoothing(filepath: str) -> Bigram:
counts = bigram_count(filepath)
vocab = set(counts.keys())
for _, css in counts.items():
vocab.update(css.keys())
bigrams = dict()
for prev, ccs in counts.items():
total = sum(ccs.values()) + len(vocab)
d = {curr: (count + 1) / total for curr, count in ccs.items()}
d[UNKNOWN] = 1 / total
bigrams[prev] = d
bigrams[UNKNOWN] = 1 / len(vocab)
return bigrams
from src.ngram_models import test_bigram
corpus = 'dat/chronicles_of_narnia.txt'
test_bigram(corpus, bigram_smoothing)
I
'm 0.020590
do 0.019136
've 0.011143
was 0.010598
have 0.009629
am 0.009084
'll 0.008236
think 0.008115
'd 0.006661
know 0.006540
the
same 0.008403
other 0.007591
King 0.007096
Witch 0.006673
whole 0.005119
others 0.005084
first 0.004978
Dwarf 0.004872
door 0.004837
great 0.004837
said
the 0.039038
, 0.018270
Lucy 0.014312
Edmund 0.011206
Caspian 0.010049
Peter 0.009805
Jill 0.008709
. 0.008648
Digory 0.007734
Aslan 0.007491
def smoothed_bigram(probs: Bigram, prev: str, curr: str) -> float:
d = probs.get(prev, None)
return probs[UNKNOWN] if d is None else d.get(curr, d[UNKNOWN])
bigram = bigram_smoothing(corpus)
for word in [('Aslan', 'is'), ('Aslan', 'Jinho'), ('Jinho', 'is')]:
print(f'{word} {smoothed_bigram(bigram, *word):.6f}')
A language model is a computational model designed to understand, generate, and predict human language. It captures language patterns, learns the likelihood of a specific term occurring in a given context, and assigns probabilities to word sequences through training on extensive text data.
2-gram (bigram): {"I'm a", "a computer", "computer scientist."}
3-gram (trigram): {"I'm a computer", "a computer scientist."}
In the above example, "I'm" and "scientist." are recognized as individual tokens, which should have been as ["I", "'m"] and ["scientist", "."].
Q1: What are the advantages of splitting "I" and "'m" as two separate tokens, versus recognizing "I'm" as one token?
Unigram Estimation
Given a large corpus, a unigram model assumes word independence and calculates the probability of each word as:
Where:
: the count of word in the corpus.
: the vocabulary (set of all unique words).
The denominator represents the total word count.
Let us define the Unigram type:
L1: .
L3: A dictionary where the key is a unigram and the value is its probability.
Let us also define a function unigram_count() that takes a file path and returns a Counter with all unigrams and their counts in the file as keys and values, respectively:
We then define a function unigram_estimation() that takes a file path and returns a dictionary with unigrams and their probabilities as keys and values, respectively:
L3: Calculate the total count of all unigrams in the text.
L4: Return a dictionary where each word is a key and its probability is the value.
Finally, let us define a function test_unigram() that takes a file path as well as an estimator function, and test unigram_estimation() with a text file :
L1: Import the type from the typing module.
L3: The second argument accepts a function that takes a string and returns a Unigram.
L4: Call the estimator with the text file and store the result in unigrams.
L2: Pass the unigram_estimation() function as the second argument.
This distribution shows the most common unigrams in the text meeting the conditions in L8, dominated by the first-person pronoun "I", followed by proper nouns - specifically character names such as "Aslan", "Lucy", and "Edmund".
Q2: What advantages do unigram probabilities have over ?
Bigram Estimation
A bigram model calculates the conditional probability of the current word given the previous word as follows (: the total occurrences of in the corpus in that order, : a set of all word types occurring after ):
Where:
: the total occurrences of in the corpus in that order.
: a set of all word types occurring after .
Let us define the Bigram type:
A dictionary where the key is and the value is a nested dictionary representing the unigram distribution of all given , or a for .
Let us also define a function bigram_count() that takes a file path and returns a dictionary with all bigrams and their counts in the file as keys and values, respectively:
L1: Import the class from the package.
L2: import the Bigram from the package.
L5: Create a defaultdict with Counters as default values to store bigram frequencies.
We then define a function bigram_estimation() that takes a file path and returns a dictionary with bigrams and their probabilities as keys and values, respectively:
L8: Calculate the total count of all bigrams with the same previous word.
L9: Calculate and store the probabilities of each current word given the previous word.
Finally, let us define a function test_bigram() that takes a file path and an estimator function, and test bigram_estimation() with the text file:
L2: Call the bigram_estimation() function with the text file and store the result.
L5: Create a bigram list given the previous word, sorted by probability in descending order.
L6: Iterate through the top 10 bigrams with the highest probabilities for the previous word.
Q3: What NLP tasks can benefit from bigram estimation over unigram estimation?
References
Source: .
, Speech and Language Processing (3rd ed. draft), Jurafsky and Martin.
L5: Create a list of unigram-probability pairs, unigram_list, sorted by probability in descending order.
L7: Iterate through the top 300 unigrams with the highest probabilities.
L8: Check if the word starts with an uppercase letter and its lowercase version is not in unigrams (aiming to search for proper nouns).
L9: Iterate through the words, starting from the second word (index 1) in each line.
from typing import TypeAlias
Unigram: TypeAlias = dict[str, float]
from collections import Counter
def unigram_count(filepath: str) -> Counter:
unigrams = Counter()
for line in open(filepath):
words = line.split()
unigrams.update(words)
return unigrams
def unigram_estimation(filepath: str) -> Unigram:
counts = unigram_count(filepath)
total = sum(counts.values())
return {word: count / total for word, count in counts.items()}
from collections.abc import Callable
def test_unigram(filepath: str, estimator: Callable[[str], Unigram]):
unigrams = estimator(filepath)
unigram_list = [(word, prob) for word, prob in sorted(unigrams.items(), key=lambda x: x[1], reverse=True)]
for word, prob in unigram_list[:300]:
if word[0].isupper() and word.lower() not in unigrams:
print(f'{word:>10} {prob:.6f}')
corpus = 'dat/chronicles_of_narnia.txt'
test_unigram(corpus, unigram_estimation)
I 0.010543
Aslan 0.001850
Lucy 0.001815
Edmund 0.001409
Narnia 0.001379
Caspian 0.001338
Jill 0.001262
Peter 0.001034
Shasta 0.000928
Digory 0.000925
Eustace 0.000877
Susan 0.000654
Tirian 0.000601
Polly 0.000547
Aravis 0.000537
Bree 0.000492
Puddleglum 0.000492
Scrubb 0.000482
Andrew 0.000406
Bigram: TypeAlias = dict[str, Unigram | float]
from collections import Counter, defaultdict
from src.types import Bigram
def bigram_count(filepath: str) -> dict[str, Counter]:
bigrams = defaultdict(Counter)
for line in open(filepath):
words = line.split()
for i in range(1, len(words)):
bigrams[words[i - 1]].update([words[i]])
return bigrams
from src.types import Bigram
def bigram_estimation(filepath: str) -> Bigram:
counts = bigram_count(filepath)
bigrams = dict()
for prev, ccs in counts.items():
total = sum(ccs.values())
bigrams[prev] = {curr: count / total for curr, count in ccs.items()}
return bigrams
def test_bigram(filepath: str, estimator: Callable[[str], Bigram]):
bigrams = estimator(filepath)
for prev in ['I', 'the', 'said']:
print(prev)
bigram_list = [(curr, prob) for curr, prob in sorted(bigrams[prev].items(), key=lambda x: x[1], reverse=True)]
for curr, prob in bigram_list[:10]:
print("{:>10} {:.6f}".format(curr, prob))
test_bigram(corpus, bigram_estimation)
I
'm 0.081628
do 0.075849
've 0.044065
was 0.041897
have 0.038045
am 0.035878
'll 0.032507
think 0.032025
'd 0.026246
know 0.025765
the
same 0.014846
other 0.013405
King 0.012528
Witch 0.011776
whole 0.009020
others 0.008958
first 0.008770
Dwarf 0.008582
door 0.008519
great 0.008519
said
the 0.157635
, 0.073645
Lucy 0.057635
Edmund 0.045074
Caspian 0.040394
Peter 0.039409
Jill 0.034975
. 0.034729
Digory 0.031034
Aslan 0.030049