Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two tokenized documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

Let Di=[wi,1,,wi,n]D_i = [w_{i,1}, \ldots, w_{i,n}] be a document, where wi,jw_{i,j}is the jj'th word in DiD_i. A vector representation for DiD_i can be defined as vi=[count(wjDi):wjW]RWv_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}, where wjw_j is the jj'th word in WW and each dimension in viv_i is the frequency of wjw_j's occurrences in DiD_i such that:

#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]

One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in viv_i being 0 in practice, as most words in the vocabulary WW do not occur in document DiD_i. Therefore, it is more efficient to represent viv_i as a sparse vector:

v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}

How does the bag-of-words model handle unknown words that are not in the vocabulary?

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

from src.types import Document, Vocab

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

from collections import Counter
from src.types import SparseVector

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

Finally, let us our bag-of-words model with the examples above:

documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))

References

Source: bag_of_words_model.py

Last updated

Copyright © 2023 All rights reserved