NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Overview
  • Implementation
  • References

Was this helpful?

Export as PDF
  1. Vector Space Models

Bag-of-Words Model

PreviousVector Space ModelsNextTerm Weighting

Last updated 2 months ago

Was this helpful?

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

Let Di=[wi,1,…,wi,n]D_i = [w_{i,1}, \ldots, w_{i,n}]Di​=[wi,1​,…,wi,n​] be a document, where wi,jw_{i,j}wi,j​is the jjj'th word in DiD_iDi​. A vector representation for DiD_iDi​ can be defined as vi=[count(wj∈Di):∀wj∈W]∈R∣W∣v_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}vi​=[count(wj​∈Di​):∀wj​∈W]∈R∣W∣, where wjw_jwj​ is the jjj'th word in WWW and each dimension in viv_ivi​ is the frequency of wjw_jwj​'s occurrences in DiD_iDi​ such that:

#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in viv_ivi​ being 0 in practice, as most words in the vocabulary WWW do not occur in document DiD_iDi​. Therefore, it is more efficient to represent viv_ivi​ as a sparse vector:

v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}

Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

from typing import TypeAlias

Document: TypeAlias = list[str]
Vocab: TypeAlias = dict[str, int]

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

from collections import Counter

SparseVector: TypeAlias = dict[int, int | float]

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

Finally, let us our bag-of-words model with the examples above:

documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))
{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}

Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

References

Source:

, Wikipedia

, Working With Text Data, scikit-learn Tutorials

tokenized
bag_of_words_model.py
Bag-of-Words Model
Bags of words