In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:
One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
How does the bag-of-words model handle unknown words that are not in the vocabulary?
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
Finally, let us our bag-of-words model with the examples above:
Source: bag_of_words_model.py
Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials
Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector: