In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
Let Di=[wi,1,…,wi,n] be a document, where wi,jis the j'th word in Di. A vector representation for Di can be defined as vi=[count(wj∈Di):∀wj∈W]∈R∣W∣, where wj is the j'th word in W and each dimension in vi is the frequency of wj's occurrences in Di such that:
One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in vi being 0 in practice, as most words in the vocabulary W do not occur in document Di. Therefore, it is more efficient to represent vi as a sparse vector:
How does the bag-of-words model handle unknown words that are not in the vocabulary?
Implementation
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
from src.types import Document, Vocabdefvocabulary(documents: list[Document]) -> Vocab: vocab =set()for document in documents: vocab.update(document)return{word: i for i, word inenumerate(sorted(list(vocab)))}
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
from collections import Counterfrom src.types import SparseVectordefbag_of_words(vocab: Vocab,document: Document) -> SparseVector: counts =Counter(document)return{vocab[word]: count for word, count insorted(counts.items())if word in vocab}
Finally, let us our bag-of-words model with the examples above: