Create a distributional_semantics.py file in the src/homework/ directory.
Your task is to read word embeddings trained by Word2Vec:
Define a function called read_word_embeddings()
that takes a path to the file consisting of word embeddings, word_embeddings.txt.
Return a dictionary where the key is a word and the value is its corresponding embedding in numpy.array.
Each line in the file adheres to the following format:
Your task is to retrieve a list of the most similar words to a given target word:
Define a function called similar_words()
that takes the word embeddings from Task 1, a target word (string), and a threshold (float).
Return a list of tuples, where each tuple contains a word similar to the target word and the cosine similarity between them as determined by the embeddings. The returned list must only include words with similarity scores greater than or equal to the threshold, sorted in descending order based on the similarity scores.
Your task is to measure a similarity score between two documents:
Define a function called document_similarity()
that takes the word embeddings and two documents (string). Assume that the documents are already tokenized.
For each document, generate a document embedding by averaging the embeddings of all words within the document.
Return the cosine similarity between the two document embeddings.
Commit and push the distributional_semantics.py file to your GitHub repository.