Bag-of-Words Model
Overview
In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:
W = [
'.', # 0
'John', # 1
'Mary', # 2
'The', # 3
'a', # 4
'book', # 5
'bought', # 6
'funny', # 7
'gave', # 8
'it', # 9
'liked', # 10
'the', # 11
'to', # 12
'was' # 13
]Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:
Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?
Implementation
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
Finally, let us our bag-of-words model with the examples above:
Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
References
Source: bag_of_words_model.py
Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials
Last updated
Was this helpful?