arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Word2Vec

Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.

Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.

hashtag
Continuous Bag-of-Words

Consider a sequence of words, {wkβˆ’2,wkβˆ’1,wk,wk+1,wk+2}\{w_{k-2}, w_{k-1}, w_{k}, w_{k+1}, w_{k+2}\}{wkβˆ’2​,wkβˆ’1​,wk​,wk+1​,wk+2​}. We can predict wiw_iwi​ by leveraging its contextual words using a similar to the discussed previously (: a vocabulary list comprising all unique words in the corpus):

This objective can also be achieved by using a such as Continuous Bag-of-Words (CBOW) using a . Let be an input vector, where . is created by the model on a set of context words, , such that only the dimensions of representing words in have a value of ; otherwise, they are set to .

Let be an output vector, where all dimensions have the value of except for the one representing , which is set to .

Let be a hidden layer between and and be the weight matrix between and , where the sigmoid function is used as the activation function:

Finally, let be the weight matrix between and :

Thus, each dimension in represents the probability of the corresponding word being given the set of context words .

circle-exclamation

Q13: What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?

hashtag
Skip-gram

In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in given . Formally, the objective of a Skip-gram model is as follows:

Let be an input vector, where only the dimension representing is set to ; all the other dimensions have the value of (thus, in Skip-gram is the same as in CBOW). Let be an output vector, where only the dimension representing is set to ; all the other dimensions have the value of . All the other components, such as the hidden layer and the weight matrices and , stay the same as the ones in CBOW.

circle-exclamation

Q14: What are the advantages of CBOW models compared to Skip-gram models, and vice versa?

hashtag
Distributional Embeddings

What does each dimension in the hidden layer represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in , deemed valuable for predicting the target word . Formally, each dimension is computed as the sigmoid activation of the weighted sum between the input vector and the column vector such that:

Then, what does each row vector represent? The 'th dimension in denotes the weight of the 'th feature in with respect to the 'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the 'th word. Thus, can serve as an embedding for the 'th word in .

What about the other weight matrix ? The 'th column vector denotes the weights of the 'th feature in for all words in the vocabulary. Thus, the 'th dimension of indicates the importance of 'th feature for the 'th word being predicted as the target word .

On the other hand, the 'th row vector denotes the weights of all features for the 'th word in the vocabulary, enabling it to be utilized as an embedding for . However, in practice, only the row vectors of the first weight matrix are employed as word embeddings because the weights in are often optimized for the downstream task, in this case predicting , whereas the weights in are optimized for finding representations that are generalizable across various tasks.

circle-exclamation

Q15: What are the implications of the weight matrices and in the Skip-gram model?

circle-exclamation

Q16: What limitations does the Word2Vec model have, and how can these limitations be addressed?

hashtag
References

  1. , Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.

  2. , Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

  3. , Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.

VVV
wk=arg⁑maxβ‘βˆ€.wβˆ—βˆˆVP(wβˆ—βˆ£wkβˆ’2,wkβˆ’1,wk+1,wk+2)w_k = \arg\max_{\forall. w_* \in V}P(w_*|w_{k-2},w_{k-1},w_{k+1},w_{k+2})wk​=argβˆ€.wβˆ—β€‹βˆˆVmax​P(wβˆ—β€‹βˆ£wkβˆ’2​,wkβˆ’1​,wk+1​,wk+2​)
x∈R1Γ—n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1Γ—n
n=∣V∣n = |V|n=∣V∣
x\mathrm{x}x
I={wkβˆ’2,wkβˆ’1,wk+1,wk+2}I = \{w_{k-2},w_{k-1},w_{k+1},w_{k+2}\}I={wkβˆ’2​,wkβˆ’1​,wk+1​,wk+2​}
x\mathrm{x}x
III
111
000
y∈R1Γ—n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1Γ—n
000
wkw_kwk​
111
h∈R1Γ—d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1Γ—d
x\mathrm{x}x
y\mathrm{y}y
Wx∈RnΓ—d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wxβ€‹βˆˆRnΓ—d
x\mathrm{x}x
h\mathrm{h}h
h=sigmoid(xβ‹…Wx)\mathrm{h} = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{W}_x)h=sigmoid(xβ‹…Wx​)
Wh∈RnΓ—d\mathrm{W}_h \in \mathbb{R}^{n \times d}Whβ€‹βˆˆRnΓ—d
h\mathrm{h}h
y\mathrm{y}y
y=softmax(hβ‹…WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(hβ‹…WhT​)
y\mathrm{y}y
wkw_kwk​
III
III
wkw_kwk​
wkβˆ’2=arg⁑maxβ‘βˆ€.wβˆ—βˆˆVP(wβˆ—βˆ£wk)wkβˆ’1=arg⁑maxβ‘βˆ€.wβˆ—βˆˆVP(wβˆ—βˆ£wk)wk+1=arg⁑maxβ‘βˆ€.wβˆ—βˆˆVP(wβˆ—βˆ£wk)wk+2=arg⁑maxβ‘βˆ€.wβˆ—βˆˆVP(wβˆ—βˆ£wk)\begin{align*} w_{k-2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k-1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k) \end{align*}wkβˆ’2​wkβˆ’1​wk+1​wk+2​​=argβˆ€.wβˆ—β€‹βˆˆVmax​P(wβˆ—β€‹βˆ£wk​)=argβˆ€.wβˆ—β€‹βˆˆVmax​P(wβˆ—β€‹βˆ£wk​)=argβˆ€.wβˆ—β€‹βˆˆVmax​P(wβˆ—β€‹βˆ£wk​)=argβˆ€.wβˆ—β€‹βˆˆVmax​P(wβˆ—β€‹βˆ£wk​)​
x∈R1Γ—n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1Γ—n
wkw_kwk​
111
000
x\mathrm{x}x
y\mathrm{y}y
y∈R1Γ—n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1Γ—n
wj∈Iw_j \in Iwjβ€‹βˆˆI
111
000
h∈R1Γ—d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1Γ—d
Wx∈RnΓ—d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wxβ€‹βˆˆRnΓ—d
Wh∈RnΓ—d\mathrm{W}_h \in \mathbb{R}^{n \times d}Whβ€‹βˆˆRnΓ—d
h\mathrm{h}h
III
wi​w_i​wi​​
hj\mathrm{h}_jhj​
x\mathrm{x}x
hj=sigmoid(xβ‹…cxj)\mathrm{h}_j = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{cx}_j)hj​=sigmoid(xβ‹…cxj​)
rxi=Wx[i,:]  ∈R1Γ—d\mathrm{rx}_i = \mathrm{W}_x[i,:]Β Β \in \mathbb{R}^{1 \times d}rxi​=Wx​[i,:]  ∈R1Γ—d
jjj
rxi\mathrm{rx}_irxi​
jjj
h\mathrm{h}h
iii
iii
ri\mathrm{r}_iri​
iii
VVV
Wh\mathrm{W}_hWh​
jjj
chj=Wh[:,j] ∈RnΓ—1\mathrm{ch}_j = \mathrm{W}_h[:,j]Β \in \mathbb{R}^{n \times 1}chj​=Wh​[:,j] ∈RnΓ—1
jjj
h\mathrm{h}h
iii
chj\mathrm{ch}_jchj​
jjj
iii
wkw_kwk​
iii
rhi=Wx[i,:] ∈R1Γ—d\mathrm{rh}_i = \mathrm{W}_x[i,:] Β \in \mathbb{R}^{1 \times d}rhi​=Wx​[i,:] ∈R1Γ—d
iii
wi∈Vw_i \in Vwiβ€‹βˆˆV
Wx\mathrm{W}_xWx​
Wh\mathrm{W}_hWh​
wkw_kwk​
Wx\mathrm{W}_xWx​
Wx\mathrm{W}_xWx​
Wh\mathrm{W}_hWh​
generative modelarrow-up-right
n-gram models
discriminative modelarrow-up-right
multilayer perceptron
bag-of-words
Efficient Estimation of Word Representations in Vector Spacearrow-up-right
GloVe: Global Vectors for Word Representationarrow-up-right
Bag of Tricks for Efficient Text Classificationarrow-up-right