Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The distributional hypothesis suggests that words that occur in similar contexts tend to have similar meanings [1]. Let us examine the following two sentences with blanks:
A: I sat on a __
B: I played with my __
The blank for A can be filled with words such as {bench, chair, sofa, stool}, which convey the meaning "something to sit on" in this context. On the other hand, the blank for B can be filled with words such as {child, dog, friend, toy}, carrying the meaning of "someone/thing to play with." However, these two sets of words are not interchangeable, as it is unlikely that you would sit on a "friend" or play with a "sofa".
This hypothesis provides a potent framework for understanding how meaning is encoded in language and has become a cornerstone of modern computational linguistics and natural language processing.
Assuming that your corpus has only the following three sentences, what context would influence the meaning of the word "chair" according to the distributional hypothesis?
I sat on a chair.
I will chair the meeting.
I am the chair of my department.
Distributional Structure, Zellig S. Harris, Word, 10 (2-3): 146-162, 1954.
Distributional semantics represents the meaning of words based on their distributional properties in large corpora of text. It follows the distributional hypothesis, which states that "words with similar meanings tend to occur in similar contexts".
Create a distributional_semantics.py file in the src/homework/ directory.
Your task is to read word embeddings trained by Word2Vec:
Define a function called read_word_embeddings()
that takes a path to the file consisting of word embeddings, word_embeddings.txt.
Return a dictionary where the key is a word and the value is its corresponding embedding in numpy.array.
Each line in the file adheres to the following format:
Your task is to retrieve a list of the most similar words to a given target word:
Define a function called similar_words()
that takes the word embeddings from Task 1, a target word (string), and a threshold (float).
Return a list of tuples, where each tuple contains a word similar to the target word and the cosine similarity between them as determined by the embeddings. The returned list must only include words with similarity scores greater than or equal to the threshold, sorted in descending order based on the similarity scores.
Your task is to measure a similarity score between two documents:
Define a function called document_similarity()
that takes the word embeddings and two documents (string). Assume that the documents are already tokenized.
For each document, generate a document embedding by averaging the embeddings of all words within the document.
Return the cosine similarity between the two document embeddings.
Commit and push the distributional_semantics.py file to your GitHub repository.
One-hot encoding represents words as binary vectors such that each word is represented as a vector where all dimensions are zero except for one, which is set to one, indicating the presence of that word.
Consider the following vocabulary:
Given a vocabulary size of 4, each word is represented as a 4-dimensional vector as illustrated below:
One-hot encoding has been largely adopted in traditional NLP models due to its simple and efficient representation of words in sparse vectors.
What are the drawbacks of using one-hot encoding to represent word vectors?
Word embeddings are dense vector representations of words in a continuous vector space. Each word is represented in a high-dimensional space, where the dimensions correspond to different contextual features of the word's meaning.
Consider the embeddings for three words, 'king', 'male', and 'female':
Based on these distributions, we can infer that the four dimensions in this vector space represent royalty, gender, male, and female respectively, such that the embedding for the word 'queen' can be estimated as follows:
The key idea is to capture semantic relationships between words by representing them in a way that similar words have similar vector representations. These embeddings are learned from large amounts of text data, where the model aims to predict or capture the context in which words appear.
In the above examples, each dimension represents a distinct type of meaning. However, in practice, a dimension can encapsulate multiple types of meanings. Furthermore, a single type of meaning can be depicted by a weighted combination of several dimensions, making it challenging to precisely interpret what each dimension implies.
Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the distributional hypothesis.
LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.
Let us define a function that reads a corpus, and returns a list of all documents in the corpus and a dictionary whose keys and values are terms and their unique indices, respectively:
We then define a function that takes and , and returns the document-term matrix such that indicates the frequency of the 'th term in within the 'th document:
L1: The numpy package.
L3: numpy.array
Let us create a document-term matrix from the corpus, dat/chronicles_of_narnia.txt:
L3: time.time()
With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:
Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:
Why is the performance of document_term_matrix()
significantly slower than document_term_matrix_np()
?
Singular values are non-negative values listed in decreasing order that represent the importance of each topic.
For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:
L4: numpy.diag()
What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?
From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.
By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?
From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.
Source: latent_semantic_analysis.py
Latent Semantic Analysis, Wikipedia.
Singular Value Decomposition, Wikipedia.
Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.
Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.
Consider a sequence of words, . We can predict by leveraging its contextual words using a generative model similar to the n-gram models discussed previously (: a vocabulary list comprising all unique words in the corpus):
This objective can also be achieved by using a discriminative model such as Continuous Bag-of-Words (CBOW) using a multilayer perceptron. Let be an input vector, where . is created by the bag-of-words model on a set of context words, , such that only the dimensions of representing words in have a value of ; otherwise, they are set to .
Let be an output vector, where all dimensions have the value of except for the one representing , which is set to .
Let be a hidden layer between and and be the weight matrix between and , where the sigmoid function is used as the activation function:
Finally, let be the weight matrix between and :
What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?
What are the advantages of CBOW models compared to Skip-gram models, and vice versa?
What limitations does the Word2Vec model have, and how can these limitations be addressed?
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.
GloVe: Global Vectors for Word Representation, Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.
Let be a vector representing an input instance, where denotes the 'th feature of the input and be its corresponding output label. Logistic regression uses the logistic function, aka. the sigmoid function, to estimate the probability that belongs to :
The weight vector assigns weights to each dimension of the input vector for the label such that a higher magnitude of weight indicates greater importance of the feature . Finally, represents the bias of the label within the training distribution.
What role does the sigmoid function play in the logistic regression model?
Consider a corpus consisting of two sentences:
D1: I love this movie
D2: I hate this movie
The input vectors and can be created for these two sentences using the bag-of-words model:
Let and be the output labels of and , representing postive and negative sentiments of the input sentences, respectively. Then, a weight vector can be trained using logistic regression:
What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?
Consider a corpus consisting of three sentences:
D1: I love this movie
D2: I hate this movie
D3: I watched this movie
What are the limitations of the softmax regression model?
Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?
What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?
What are the limitations of a multilayer perceptron?
Neural Network Methodologies and their Potential Application to Cloud Pattern Recognition, J. E. Peak, Defense Technical Information Center, ADA239214, 1991.
The 'th row of is considered the document vector of the 'th document in the corpus, while the transpose of 'th column of is considered the term vector of the 'th term in the vocabulary.
LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix into three matrices, , , and , where are orthogonal matrices and is a diagonal matrix containing singular values, such that .
An orthogonal matrix is a square matrix whose rows and columns are orthogonal such that , where is the identity matrix.
This results , , and such that:
In , each row represents a document and each column represent a topic.
In , each diagonal cell represents the weight of the corresponding topic.
In , each column represents a term and each row represents a topic.
The last two singular values in are actually non-negative values, and , respectively.
The first four singular values in appear to be sufficiently larger than the others; thus, let us reduce their dimensions to such that , , and :
Given the LSA results, an embedding of the 'th document can be obtained as :
Finally, an embedding of the 'th term can be achieved as :
is not transposed in L3 of the above code. Should we use S.transpose()
instead?
Thus, each dimension in represents the probability of the corresponding word being given the set of context words .
In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in given . Formally, the objective of a Skip-gram model is as follows:
Let be an input vector, where only the dimension representing is set to ; all the other dimensions have the value of (thus, in Skip-gram is the same as in CBOW). Let be an output vector, where only the dimension representing is set to ; all the other dimensions have the value of . All the other components, such as the hidden layer and the weight matrices and , stay the same as the ones in CBOW.
What does each dimension in the hidden layer represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in , deemed valuable for predicting the target word . Formally, each dimension is computed as the sigmoid activation of the weighted sum between the input vector and the column vector such that:
Then, what does each row vector represent? The 'th dimension in denotes the weight of the 'th feature in with respect to the 'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the 'th word. Thus, can serve as an embedding for the 'th word in .
What about the other weight matrix ? The 'th column vector denotes the weights of the 'th feature in for all words in the vocabulary. Thus, the 'th dimension of indicates the importance of 'th feature for the 'th word being predicted as the target word .
On the other hand, the 'th row vector denotes the weights of all features for the 'th word in the vocabulary, enabling it to be utilized as an embedding for . However, in practice, only the row vectors of the first weight matrix are employed as word embeddings because the weights in are often optimized for the downstream task, in this case predicting , whereas the weights in are optimized for finding representations that are generalizable across various tasks.
What are the implications of the weight matrices and in the Skip-gram model?
Since the terms "I", "this", and "movie" appear with equal frequency across both labels, their weights , , and are neutralized. On the other hand, the terms "love" and "hate" appear only with the positive and negative labels, respectively. Therefore, while the weight for "love" () contributes positively to the label , the weight for "hate" () has a negative impact on the label . Furthermore, as positive and negative sentiment labels are equally presented in this corpus, the bias is also set to 0.
Given the weight vector and the bias, we have and , resulting the following probabilities:
As the probability of being exceeds (50%), the model predicts the first sentence to convey a positive sentiment. Conversely, the model predicts the second sentence to convey a negative sentiment as its probability of being is below 50%.
Under what circumstances would the bias be negative in the above example? Additionally, when might neutral terms such as "this" or "movie" exhibit non-neutral weights?
Softmax regression, aka. multinomial logistic regression, is an extension of logistic regression to handle classification problems with more than two classes. Given an input vector and its output lable , the model uses the softmax function to estimates the probability that belongs to each class separately:
The weight vector assigns weights to for the label , while represents the bias associated with the label .
Then, the input vectors , , and for the sentences can be created using the bag-of-words model:
Let , , and be the output labels of , , and , representing postive, negative, and neutral sentiments of the input sentences, respectively. Then, weight vectors , , and can be trained using softmax regression as follows:
Unlike the case of logistic regression where all weights are oriented to (both and giving positive and negative weights to respectively, but not ), the values in each weigh vector are oriented to each corresponding label.
Given the weight vectors and the biases, we can estimate the following probabilities for :
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a positive sentiment. For , the following probabilities can be estimated:
Since the probabiilty of is the highest among all labels, the model predicts the first sentence to convey a neutral sentiment.
Softmax regression always predicts values so that it is represented by an output vector , wherein the 'th value in contains the probability of the input belonging to the 'th class. Similarly, the weight vectors for all labels can be stacked into a weight matrix , where the 'th row represents the weight vector for the 'th label.
With this new formulation, softmax regression can be defined as , and the optimal prediction can be achieved as , which returns a set of labels with the highest probabilities.
A multilayer perceptron (MLP) is a type of Feedforward Neural Networks consisting of multiple layers of neurons, where all neurons from one layer are fully connected to all neurons in its adjecent layers. Given an input vector and an output vector , the model allows zero to many hidden layers to generate intermediate representations of the input.
Let be a hidden layer between and . To connect and , we need a weight matrix such that , where is an activation function applied to the output of each neuron; it introduces non-linearity into the network, allowing it to learn complex patterns and relationships in the data. Activation functions determine whether a neuron should be activated or not, implying whether or not the neuron's output should be passed on to the next layer.
Similarly, to connect and , we need a weight matrix such that . Thus, a multilayer perceptron with one hidden layer can be represented as:
Consider a corpus comprising the following five sentences the corresponding labels ():
D1: I love this movie postive
D2: I hate this movie negative
D3: I watched this movie neutral
D4: I truly love this movie very positive
D5: I truly hate this movie very negative
The input vectors can be created using the bag-of-words model:
The first weight matrix can be trained by an MLP as follows:
Given the values in , we can infer that the first, second, and third columns represent "love", 'hate", and "watch", while the fourth and fifth columns learn combined features such as {"truly", "love"} and {"truly", "hate"}, respectively.
Each of is multiplied by to achieve the hiddner layer , respectively, where the activation function is designed as follow:
The second weight matrix can also be trained by an MLP as follows:
By applying the softmax function to each , we achieve the corresponding output vector :
The prediction can be made by taking the argmax of each .