NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Task 1
  • Task 2
  • Task 3
  • Submission
  • Rubric

Was this helpful?

Export as PDF
  1. Distributional Semantics

Homework

HW4: Distributional Semantics

PreviousWord2VecNextContextual Encoding

Last updated 1 month ago

Was this helpful?

Create a file in the directory.

Task 1

Your task is to read word embeddings trained by :

  1. Define a function called read_word_embeddings() that takes a path to the file consisting of word embeddings, .

  2. Return a dictionary where the key is a word and the value is its corresponding embedding in .

Each line in the file adheres to the following format:

[WORD](\t[FLOAT]){50}

Task 2

Your task is to retrieve a list of the most similar words to a given target word:

  1. Define a function called similar_words() that takes the word embeddings from Task 1, a target word (string), and a threshold (float).

  2. Return a list of tuples, where each tuple contains a word similar to the target word and the cosine similarity between them as determined by the embeddings. The returned list must only include words with similarity scores greater than or equal to the threshold, sorted in descending order based on the similarity scores.

Task 3

Your task is to measure a similarity score between two documents:

  1. Define a function called document_similarity() that takes the word embeddings and two documents (string). Assume that the documents are already tokenized.

  2. For each document, generate a document embedding by averaging the embeddings of all words within the document.

  3. Return the cosine similarity between the two document embeddings.

Submission

Commit and push the distributional_semantics.py file to your GitHub repository.

Rubric

  • Task 1: Read Word Embeddings (2.8 points)

  • Task 2: Similar Words (3 points)

  • Task 3: Document Similarity (3 points)

  • Concept Quiz (3.2 points)

distributional_semantics.py
src/homework/
Word2Vec
word_embeddings.txt
numpy.array