Word2Vec

Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.

Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.

Continuous Bag-of-Words

Consider a sequence of words, {wk2,wk1,wk,wk+1,wk+2}\{w_{k-2}, w_{k-1}, w_{k}, w_{k+1}, w_{k+2}\}. We can predict wiw_i by leveraging its contextual words using a generative model similar to the n-gram models discussed previously (VV: a vocabulary list comprising all unique words in the corpus):

wk=argmax.wVP(wwk2,wk1,wk+1,wk+2)w_k = \arg\max_{\forall. w_* \in V}P(w_*|w_{k-2},w_{k-1},w_{k+1},w_{k+2})

This objective can also be achieved by using a discriminative model such as Continuous Bag-of-Words (CBOW) using a multilayer perceptron. Let xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} be an input vector, where n=Vn = |V|. x\mathrm{x} is created by the bag-of-words model on a set of context words, I={wk2,wk1,wk+1,wk+2}I = \{w_{k-2},w_{k-1},w_{k+1},w_{k+2}\}, such that only the dimensions of x\mathrm{x} representing words in II have a value of 11; otherwise, they are set to 00.

Let yR1×n\mathrm{y} \in \mathbb{R}^{1 \times n} be an output vector, where all dimensions have the value of 00 except for the one representing wkw_k, which is set to 11.

Let hR1×d\mathrm{h} \in \mathbb{R}^{1 \times d} be a hidden layer between x\mathrm{x} and y\mathrm{y} and WxRn×d\mathrm{W}_x \in \mathbb{R}^{n \times d} be the weight matrix between x\mathrm{x} and h\mathrm{h}, where the sigmoid function is used as the activation function:

h=sigmoid(xWx)\mathrm{h} = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{W}_x)

Finally, let WhRn×d\mathrm{W}_h \in \mathbb{R}^{n \times d} be the weight matrix between h\mathrm{h} and y\mathrm{y}:

y=softmax(hWhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)

Thus, each dimension in y\mathrm{y} represents the probability of the corresponding word being wkw_k given the set of context words II.

What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?

Skip-gram

In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in II given wkw_k. Formally, the objective of a Skip-gram model is as follows:

wk2=argmax.wVP(wwk)wk1=argmax.wVP(wwk)wk+1=argmax.wVP(wwk)wk+2=argmax.wVP(wwk)\begin{align*} w_{k-2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k-1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k) \end{align*}

Let xR1×n\mathrm{x} \in \mathbb{R}^{1 \times n} be an input vector, where only the dimension representing wkw_k is set to 11; all the other dimensions have the value of 00 (thus, x\mathrm{x} in Skip-gram is the same as y\mathrm{y} in CBOW). Let yR1×n\mathrm{y} \in \mathbb{R}^{1 \times n} be an output vector, where only the dimension representing wjIw_j \in I is set to 11; all the other dimensions have the value of 00. All the other components, such as the hidden layer hR1×d\mathrm{h} \in \mathbb{R}^{1 \times d} and the weight matrices WxRn×d\mathrm{W}_x \in \mathbb{R}^{n \times d} and WhRn×d\mathrm{W}_h \in \mathbb{R}^{n \times d}, stay the same as the ones in CBOW.

What are the advantages of CBOW models compared to Skip-gram models, and vice versa?

Distributional Embeddings

What does each dimension in the hidden layer h\mathrm{h} represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in II, deemed valuable for predicting the target word wiw_i​. Formally, each dimension hj\mathrm{h}_j is computed as the sigmoid activation of the weighted sum between the input vector x\mathrm{x} and the column vector such that:

hj=sigmoid(xcxj)\mathrm{h}_j = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{cx}_j)

Then, what does each row vector rxi=Wx[i,:]  R1×d\mathrm{rx}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d} represent? The jj'th dimension in rxi\mathrm{rx}_i denotes the weight of the jj'th feature in h\mathrm{h} with respect to the ii'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the ii'th word. Thus, ri\mathrm{r}_i can serve as an embedding for the ii'th word in VV.

What about the other weight matrix Wh\mathrm{W}_h? The jj'th column vector chj=Wh[:,j] Rn×1\mathrm{ch}_j = \mathrm{W}_h[:,j] \in \mathbb{R}^{n \times 1} denotes the weights of the jj'th feature in h\mathrm{h} for all words in the vocabulary. Thus, the ii'th dimension of chj\mathrm{ch}_j indicates the importance of jj'th feature for the ii'th word being predicted as the target word wkw_k.

On the other hand, the ii'th row vector rhi=Wx[i,:] R1×d\mathrm{rh}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d} denotes the weights of all features for the ii'th word in the vocabulary, enabling it to be utilized as an embedding for wiVw_i \in V. However, in practice, only the row vectors of the first weight matrix Wx\mathrm{W}_x are employed as word embeddings because the weights in Wh\mathrm{W}_h are often optimized for the downstream task, in this case predicting wkw_k, whereas the weights in Wx\mathrm{W}_x are optimized for finding representations that are generalizable across various tasks.

What are the implications of the weight matrices Wx\mathrm{W}_x and Wh\mathrm{W}_h in the Skip-gram model?

What limitations does the Word2Vec model have, and how can these limitations be addressed?

References

  1. Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.

  2. GloVe: Global Vectors for Word Representation, Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

  3. Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.

Last updated

Copyright © 2023 All rights reserved