One-hot encoding represents words as binary vectors such that each word is represented as a vector where all dimensions are zero except for one, which is set to one, indicating the presence of that word.
Consider the following vocabulary:
Given a vocabulary size of 4, each word is represented as a 4-dimensional vector as illustrated below:
One-hot encoding has been largely adopted in traditional NLP models due to its simple and efficient representation of words in sparse vectors.
What are the drawbacks of using one-hot encoding to represent word vectors?
Word embeddings are dense vector representations of words in a continuous vector space. Each word is represented in a high-dimensional space, where the dimensions correspond to different contextual features of the word's meaning.
Consider the embeddings for three words, 'king', 'male', and 'female':
Based on these distributions, we can infer that the four dimensions in this vector space represent royalty, gender, male, and female respectively, such that the embedding for the word 'queen' can be estimated as follows:
The key idea is to capture semantic relationships between words by representing them in a way that similar words have similar vector representations. These embeddings are learned from large amounts of text data, where the model aims to predict or capture the context in which words appear.
In the above examples, each dimension represents a distinct type of meaning. However, in practice, a dimension can encapsulate multiple types of meanings. Furthermore, a single type of meaning can be depicted by a weighted combination of several dimensions, making it challenging to precisely interpret what each dimension implies.