NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Continuous Bag-of-Words
  • Skip-gram
  • Distributional Embeddings
  • References

Was this helpful?

Export as PDF
  1. Distributional Semantics

Word2Vec

PreviousNeural NetworksNextHomework

Last updated 1 month ago

Was this helpful?

Neural language models leverage neural networks trained on extensive text data, enabling them to discern patterns and connections between terms and documents. Through this training, neural language models gain the ability to comprehend and generate human-like language with remarkable fluency and coherence.

Word2Vec is a neural language model that maps words into a high-dimensional embedding space, positioning similar words closer to each other.

Continuous Bag-of-Words

Consider a sequence of words, {wk−2,wk−1,wk,wk+1,wk+2}\{w_{k-2}, w_{k-1}, w_{k}, w_{k+1}, w_{k+2}\}{wk−2​,wk−1​,wk​,wk+1​,wk+2​}. We can predict wiw_iwi​ by leveraging its contextual words using a similar to the discussed previously (VVV: a vocabulary list comprising all unique words in the corpus):

wk=arg⁡max⁡∀.w∗∈VP(w∗∣wk−2,wk−1,wk+1,wk+2)w_k = \arg\max_{\forall. w_* \in V}P(w_*|w_{k-2},w_{k-1},w_{k+1},w_{k+2})wk​=arg∀.w∗​∈Vmax​P(w∗​∣wk−2​,wk−1​,wk+1​,wk+2​)

This objective can also be achieved by using a such as Continuous Bag-of-Words (CBOW) using a . Let x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n be an input vector, where n=∣V∣n = |V|n=∣V∣. x\mathrm{x}x is created by the model on a set of context words, I={wk−2,wk−1,wk+1,wk+2}I = \{w_{k-2},w_{k-1},w_{k+1},w_{k+2}\}I={wk−2​,wk−1​,wk+1​,wk+2​}, such that only the dimensions of x\mathrm{x}x representing words in III have a value of 111; otherwise, they are set to 000.

Let y∈R1×n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1×n be an output vector, where all dimensions have the value of 000 except for the one representing wkw_kwk​, which is set to 111.

Let h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d be a hidden layer between x\mathrm{x}x and y\mathrm{y}y and Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d be the weight matrix between x\mathrm{x}x and h\mathrm{h}h, where the sigmoid function is used as the activation function:

h=sigmoid(x⋅Wx)\mathrm{h} = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{W}_x)h=sigmoid(x⋅Wx​)

Finally, let Wh∈Rn×d\mathrm{W}_h \in \mathbb{R}^{n \times d}Wh​∈Rn×d be the weight matrix between h\mathrm{h}h and y\mathrm{y}y:

y=softmax(h⋅WhT)\mathrm{y} = \mathrm{softmax}(\mathrm{h} \cdot \mathrm{W}_h^T)y=softmax(h⋅WhT​)

Q13: What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?

Skip-gram

Q14: What are the advantages of CBOW models compared to Skip-gram models, and vice versa?

Distributional Embeddings

Q16: What limitations does the Word2Vec model have, and how can these limitations be addressed?

References

Thus, each dimension in y\mathrm{y}y represents the probability of the corresponding word being wkw_kwk​ given the set of context words III.

In CBOW, a word is predicted by considering its surrounding context. Another approach, known as Skip-gram, reverses the objective such that instead of predicting a word given its context, it predicts each of the context words in III given wkw_kwk​. Formally, the objective of a Skip-gram model is as follows:

wk−2=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk−1=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk+1=arg⁡max⁡∀.w∗∈VP(w∗∣wk)wk+2=arg⁡max⁡∀.w∗∈VP(w∗∣wk)\begin{align*} w_{k-2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k-1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+1} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k)\\ w_{k+2} &= \arg\max_{\forall. w_* \in V}P(w_*|w_k) \end{align*}wk−2​wk−1​wk+1​wk+2​​=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)=arg∀.w∗​∈Vmax​P(w∗​∣wk​)​

Let x∈R1×n\mathrm{x} \in \mathbb{R}^{1 \times n}x∈R1×n be an input vector, where only the dimension representing wkw_kwk​ is set to 111; all the other dimensions have the value of 000 (thus, x\mathrm{x}x in Skip-gram is the same as y\mathrm{y}y in CBOW). Let y∈R1×n\mathrm{y} \in \mathbb{R}^{1 \times n}y∈R1×n be an output vector, where only the dimension representing wj∈Iw_j \in Iwj​∈I is set to 111; all the other dimensions have the value of 000. All the other components, such as the hidden layer h∈R1×d\mathrm{h} \in \mathbb{R}^{1 \times d}h∈R1×d and the weight matrices Wx∈Rn×d\mathrm{W}_x \in \mathbb{R}^{n \times d}Wx​∈Rn×d and Wh∈Rn×d\mathrm{W}_h \in \mathbb{R}^{n \times d}Wh​∈Rn×d, stay the same as the ones in CBOW.

What does each dimension in the hidden layer h\mathrm{h}h represent for CBOW? It represents a feature obtained by aggregating specific aspects from each context word in III, deemed valuable for predicting the target word wi​w_i​wi​​. Formally, each dimension hj\mathrm{h}_jhj​ is computed as the sigmoid activation of the weighted sum between the input vector x\mathrm{x}x and the column vector such that:

hj=sigmoid(x⋅cxj)\mathrm{h}_j = \mathrm{sigmoid}(\mathrm{x} \cdot \mathrm{cx}_j)hj​=sigmoid(x⋅cxj​)

Then, what does each row vector rxi=Wx[i,:]  ∈R1×d\mathrm{rx}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d}rxi​=Wx​[i,:]  ∈R1×d represent? The jjj'th dimension in rxi\mathrm{rx}_irxi​ denotes the weight of the jjj'th feature in h\mathrm{h}h with respect to the iii'th word in the vocabulary. In other words, it indicates the importance of the corresponding feature in representing the iii'th word. Thus, ri\mathrm{r}_iri​ can serve as an embedding for the iii'th word in VVV.

What about the other weight matrix Wh\mathrm{W}_hWh​? The jjj'th column vector chj=Wh[:,j] ∈Rn×1\mathrm{ch}_j = \mathrm{W}_h[:,j] \in \mathbb{R}^{n \times 1}chj​=Wh​[:,j] ∈Rn×1 denotes the weights of the jjj'th feature in h\mathrm{h}h for all words in the vocabulary. Thus, the iii'th dimension of chj\mathrm{ch}_jchj​ indicates the importance of jjj'th feature for the iii'th word being predicted as the target word wkw_kwk​.

On the other hand, the iii'th row vector rhi=Wx[i,:] ∈R1×d\mathrm{rh}_i = \mathrm{W}_x[i,:]  \in \mathbb{R}^{1 \times d}rhi​=Wx​[i,:] ∈R1×d denotes the weights of all features for the iii'th word in the vocabulary, enabling it to be utilized as an embedding for wi∈Vw_i \in Vwi​∈V. However, in practice, only the row vectors of the first weight matrix Wx\mathrm{W}_xWx​ are employed as word embeddings because the weights in Wh\mathrm{W}_hWh​ are often optimized for the downstream task, in this case predicting wkw_kwk​, whereas the weights in Wx\mathrm{W}_xWx​ are optimized for finding representations that are generalizable across various tasks.

Q15: What are the implications of the weight matrices Wx\mathrm{W}_xWx​ and Wh\mathrm{W}_hWh​ in the Skip-gram model?

, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Proceedings of the International Conference on Learning Representations (ICLR), 2013.

, Jeffrey Pennington, Richard Socher, Christopher Manning, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.

Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
Bag of Tricks for Efficient Text Classification
generative model
n-gram models
discriminative model
bag-of-words
multilayer perceptron