Homework

Task 1

Your goal is to develop a bigram model that uses the following techniques:

Laplace smoothing with normalization
Measure the initial word probability by adding the artificial token $w_0$ at the beginning of every sentence.

Test your model using chronicles_of_narnia.txt.

Steps

Create a language_modeling.py file in the src/homework/ directory.
Define a function named bigram_model() that takes a file path pointing to the text file and returns a dictionary of bigram probabilities found in the text file.
Use the following constants to indicate the unknown and initial probabilities:

UNKNOWN = ''
INIT = '[INIT]'

Task 2

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

Steps

Under language_modeling.py, define a function named sequence_generator() that takes the following parameters:

A bigram model (the resulting dictionary of Task 1)
The initial word (the first word to appear in the sequence)
The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

It must have the precise number of tokens as specified.
Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of $\mathrm{floor}(21 / 5) = 4$ puncuation tokens are permitted).
Excluding punctuation, there should be no redundant tokens in the sequence.

In this task, the goal is not to discover a sequence that maximizes the overall sequence probability, but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word $w$ , a potential strategy is as follows:

Identify the next word $w'$ where the bigram probability $P(w′∣w)$ is maximized.
If $w′$ fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.
Make $w = w'$ and repeat the #1 until you reach the specific sequence length.

Finally, the function returns a tuple comprising the following two elements:

The list of tokens in the sequence
The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base $e$ , provided as the math.log() function in Python.

Extra Credit

Define a function named sequence_generator_max() that accepts the same parameters but returns a sequence with the highest sequence probability among all possible sequences using exhaustive search. To generate long sequences, dynamic programming needs to be adapted.

Submission

Commit and push the language_modeling.py file to your GitHub repository.

PreviousEntropy and Perplexity NextVector Space Models

Last updated 2 months ago