In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:
Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:
Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?
Implementation
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
Finally, let us our bag-of-words model with the examples above:
Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
from typing import TypeAlias
Document: TypeAlias = list[str]
Vocab: TypeAlias = dict[str, int]
def vocabulary(documents: list[Document]) -> Vocab:
vocab = set()
for document in documents:
vocab.update(document)
return {word: i for i, word in enumerate(sorted(list(vocab)))}
from collections import Counter
SparseVector: TypeAlias = dict[int, int | float]
def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
counts = Counter(document)
return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}