> For the complete documentation index, see [llms.txt](https://emory.gitbook.io/nlp-essentials/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://emory.gitbook.io/nlp-essentials/vector-space-models/bag-of-words-model.md).

# Bag-of-Words Model

## Overview

In the **bag-of-words model**, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two [tokenized](/nlp-essentials/text-processing/tokenization.md) documents:

```python
D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
```

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

```python
W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]
```

Let $$D\_i = \[w\_{i,1}, \ldots, w\_{i,n}]$$ be a document, where $$w\_{i,j}$$is the $$j$$'th word in $$D\_i$$. A vector representation for $$D\_i$$ can be defined as $$v\_i = \[\mathrm{count}(w\_j \in D\_i) : \forall w\_j \in W] \in \mathbb{R}^{|W|}$$, where $$w\_j$$ is the $$j$$'th word in $$W$$ and each dimension in $$v\_i$$ is the frequency of $$w\_j$$'s occurrences in $$D\_i$$ such that:

```python
#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]
```

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in $$v\_i$$ being 0 in practice, as most words in the vocabulary $$W$$ do not occur in document $$D\_i$$. Therefore, it is more efficient to represent $$v\_i$$ as a sparse vector:

```python
v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}
```

{% hint style="warning" %}
**Q1**: One limitation of the bag-of-words model is its inability to handle **unknown words**. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?
{% endhint %}

## Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

{% code lineNumbers="true" %}

```python
from typing import TypeAlias

Document: TypeAlias = list[str]
Vocab: TypeAlias = dict[str, int]

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}
```

{% endcode %}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

{% code lineNumbers="true" %}

```python
from collections import Counter

SparseVector: TypeAlias = dict[int, int | float]

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}
```

{% endcode %}

Finally, let us our bag-of-words model with the examples above:

{% tabs %}
{% tab title="Code" %}
{% code lineNumbers="true" %}

```python
documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))
```

{% endcode %}
{% endtab %}

{% tab title="Output" %}

```python
{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}
```

{% endtab %}
{% endtabs %}

{% hint style="warning" %}
**Q2**: Another limitation of the bag-of-words model is its inability to capture **word order**. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
{% endhint %}

## References

1. Source: [bag\_of\_words\_model.py](https://github.com/emory-courses/nlp-essentials/blob/main/src/bag_of_words_model.py)
2. [Bag-of-Words Model](https://en.wikipedia.org/wiki/Bag-of-words_model), Wikipedia
3. [Bags of words](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#bags-of-words), Working With Text Data, scikit-learn Tutorials


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://emory.gitbook.io/nlp-essentials/vector-space-models/bag-of-words-model.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
