1 of 7

Word Representations

One-hot Encoding

One-hot encoding represents words as binary vectors such that each word is represented as a vector where all dimensions are zero except for one, which is set to one, indicating the presence of that word.

Consider the following vocabulary:

V = [
    'king',    # 0
    'man',     # 1
    'woman',   # 2
    'queen'    # 3
]

Given a vocabulary size of 4, each word is represented as a 4-dimensional vector as illustrated below:

 king = [1, 0, 0, 0]
  man = [0, 1, 0, 0]
woman = [0, 0, 1, 0]
queen = [0, 0, 0, 1]

One-hot encoding has been largely adopted in traditional NLP models due to its simple and efficient representation of words in sparse vectors.

Q2: What are the drawbacks of using one-hot encoding to represent word vectors?

Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space. Each word is represented in a high-dimensional space, where the dimensions correspond to different contextual features of the word's meaning.

Consider the embeddings for three words, 'king', 'male', and 'female':

 king = [0.5, 0.0, 0.5, 0.0]
  man = [0.0, 0.5, 0.5, 0.0]
woman = [0.0, 0.5, 0.0, 0.5]

Based on these distributions, we can infer that the four dimensions in this vector space represent royalty, gender, male, and female respectively, such that the embedding for the word 'queen' can be estimated as follows:

 queen = king - man + woman
       = [0.5, 0.0, 0.0, 0.5]

The key idea is to capture semantic relationships between words by representing them in a way that similar words have similar vector representations. These embeddings are learned from large amounts of text data, where the model aims to predict or capture the context in which words appear.

In the above examples, each dimension represents a distinct type of meaning. However, in practice, a dimension can encapsulate multiple types of meanings. Furthermore, a single type of meaning can be depicted by a weighted combination of several dimensions, making it challenging to precisely interpret what each dimension implies.

Latent Semantic Analysis

Document-Term Matrix

LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.

With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:

Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:

Q3: Why is the performance of document_term_matrix() significantly slower than document_term_matrix_np()?

Dimensionality Reduction

For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:

Q4: What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?

Document Embedding

From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.

Q5: By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?

Word Embedding

From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.

References

Neural Networks

Logistic Regression

Q7: What role does the sigmoid function play in the logistic regression model?

Consider a corpus consisting of two sentences:

D1: I love this movie
D2: I hate this movie

Softmax Regression

Q9: What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?

Consider a corpus consisting of three sentences:

D1: I love this movie
D2: I hate this movie
D3: I watched this movie

What are the limitations of the softmax regression model?

Multilayer Perceptron

Q10: Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?

Q11: What would be the weight assigned to the feature "truly" learned by softmax regression for the above example?

Q12: What are the limitations of a multilayer perceptron?