NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Task 1: Bigram Modeling
  • Implementation
  • Notes
  • Task 2: Sequence Generation
  • Implementation
  • Extra Credit
  • Submission
  • Rubric

Was this helpful?

Export as PDF
  1. Language Models

Homework

HW2: Language Models

PreviousEntropy and PerplexityNextVector Space Models

Last updated 3 months ago

Was this helpful?

Task 1: Bigram Modeling

Your goal is to build a bigram model using (1) Laplace smoothing with and (2) by adding the artificial token w0w_0w0​ at the beginning of every sentence.

Implementation

  1. Create a file in the directory.

  2. Define a function named bigram_model() that takes a file path pointing to the text file, and returns a dictionary of bigram probabilities estimated in the text file.

  3. Use the following constants to indicate the unknown and initial probabilities:

UNKNOWN = ''
INIT = '[INIT]'

Notes

  1. Test your model using .

  2. Each line should be treated independently for bigram counting such that the INIT token should precede the first word of each line.

  3. Use such that all probabilities must sum to 1.0 for any given previous word.

  4. Unknown word probabilities should be retrieved using the UNKNOWN key for both the previous word (wi−1w_{i-1}wi−1​) and the current word (wiw_iwi​).

Task 2: Sequence Generation

Your goal is to write a function that takes a word and generates a sequence that includes the input as the initial word.

Implementation

  • A bigram model (the resulting dictionary of Task 1)

  • The initial word (the first word to appear in the sequence)

  • The length of the sequence (the number of tokens in the sequence)

This function aims to generate a sequence of tokens that adheres to the following criteria:

  • It must have the precise number of tokens as specified.

  • Excluding punctuation, there should be no redundant tokens in the sequence.

Finally, the function returns a tuple comprising the following two elements:

  • The list of tokens in the sequence

Extra Credit

Create a function called sequence_generator_plus() that takes the same input parameters as the existing sequence_generator() function. This new function should generate sequences with higher probability scores and better semantic coherence compared to the original implementation.

Submission

Commit and push the language_models.py file to your GitHub repository.

Rubric

  • Task 1: Bigram Modeling (5 points)

  • Task 2: Sequence Generator (4.6 points), Extra Credit (2 points)

  • Concept Quiz (2.4 points)

Under , define a function named sequence_generator() that takes the following parameters:

Not more than 20% of the tokens can be punctuation. For instance, if the sequence length is 20, a maximum of 4 punctuation tokens are permitted within the sequence. Use floor of 20% (e.g., if the sequence length is 21, a maximum of floor(21/5)=4\mathrm{floor}(21 / 5) = 4floor(21/5)=4 puncuation tokens are permitted).

The goal of this task is not to discover a sequence that maximizes the overall , but rather to optimize individual bigram probabilities. Hence, it entails a greedy search approach rather than an exhaustive one. Given the input word www, a potential strategy is as follows:

Identify the next word w′w'w′ where the bigram probability P(w′∣w)P(w′∣w)P(w′∣w) is maximized.

If w′w′w′ fulfills all the stipulated conditions, include it in the sequence and proceed. Otherwise, search for the next word whose bigram probability is the second highest. Repeat this process until you encounter a word that meets all the specified conditions.

Make w=w′w = w'w=w′ and repeat the #1 until you reach the specific sequence length.

The log-likelihood estimating the sequence probability using the bigram model. Use the logarithmic function to the base eee, provided as the math.log() function in Python.

language_models.py
language_models.py
src/homework/
dat/chronicles_of_narnia.txt
sequence probability
initial word probabilities
normalization
smoothing with normalization