Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two tokenized documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

Let Di=[wi,1,,wi,n]D_i = [w_{i,1}, \ldots, w_{i,n}] be a document, where wi,jw_{i,j}is the jj'th word in DiD_i. A vector representation for DiD_i can be defined as vi=[count(wjDi):wjW]RWv_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}, where wjw_j is the jj'th word in WW and each dimension in viv_i is the frequency of wjw_j's occurrences in DiD_i such that:

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in viv_i being 0 in practice, as most words in the vocabulary WW do not occur in document DiD_i. Therefore, it is more efficient to represent viv_i as a sparse vector:

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

Finally, let us our bag-of-words model with the examples above:

References

  1. Bags of words, Working With Text Data, scikit-learn Tutorials

Last updated

Was this helpful?