1 of 6

Vector Space Models

A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.

Bag-of-Words Model
Term Weighting
Similarity Metrics
Document Classification
Homework

Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two tokenized documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]

One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}

How does the bag-of-words model handle unknown words that are not in the vocabulary?

Implementation

Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

from src.types import Document, Vocab

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

from collections import Counter
from src.types import SparseVector

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

Finally, let us our bag-of-words model with the examples above:

documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))

{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}

References

Source: bag_of_words_model.py

Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials

Term Weighting

Term Frequency

The frequency of a word 's occurrences in a document is called the Term Frequency (TF) of . TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.

However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:

L1: , , .
L3: Define a tokenizer function using a expression.

We then retrieve documents in and create a vocaburay dictionary:

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

L6: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.

If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

Stopwords

One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

Finally, let us print the TFs of the same documents using the updated vocabulary:

Note that stop words can be filtered either during the creation of the vocabulary dictionary or when generating the bag-of-words representations. Do both approaches produce the same results? Which approach is preferable?

Document Frequency

Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.

Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

We then compare the term and document frequencies of all terms in the above documents:

L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.

Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.

What are the implications when a term has a high document frequency?

TF-IDF

Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).

In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

We then compute the TF-IDF scores of terms in the above documents:

Should we still apply stop words when using TF-IDF scores to represent the documents?

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

References

Document Similarity

Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the corpus:

Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the and .

Euclidean Similarity

Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

L6: ** k represents the power of k.

We then measure the Euclidean distance between the two vectors above:

Cosine Similarity

Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).

The cosine similarity between two vectors can be measured as follow:

Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

We then measure the Euclidean distance between the two vectors above:

Why do these metrics determine that D1 is more similar to D2 than to D3?

The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.

Document Classification

Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.

Supervised Learning

Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.

Data Split

Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.

It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.

The directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the file, following a file-naming convention of A_B, where A denotes the book ID and B indicates the chapter ID.

Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:

We then print the number of documents in each set:

Vectorization

To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:

Why do we use only the training set to collect the vocabulary?

Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:

We then vectorize all documents in each set:

Classification

Finally, we test our classification model on the development set:

What are the potential weaknesses or limitations of this classification model?

References

Homework

Your task is to develop a sentiment analyzer train on the Stanford Sentiment Treebank:

Create a vector_space_models.py file in the src/homework/ directory.
Define a function named sentiment_analyzer() that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.
Use the $k$ -nearest neighbors algorithm for the classification. Find the optimal value of $k$ using the development set, and then hardcode this value into your function before submission.

Data

The sentiment_treebank directory contains the following two files:

sst_trn.tst: a training set consisting of 8,544 labeled documents.
sst_dev.tst: a development set consisting of 1,101 labeled documents.

Each line is a document, which is formatted as follows:

[Label]\t[Document]

Below are the explanations of what each label signifies:

0: Very negative
1: Negative
2: Neutral
3: Positive
4: Very positive

Submission

Commit and push the vector_space_models.py file to your GitHub repository.

Extra Credit

Define a function named sentiment_analyzer_extra() that gives an improved sentiment analyzer.

Bag-of-Words Model

Overview

In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

Consider a corpus containing the following two tokenized documents:

D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']

The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

W = [
    '.',        # 0
    'John',     # 1
    'Mary',     # 2
    'The',      # 3
    'a',        # 4
    'book',     # 5
    'bought',   # 6
    'funny',    # 7
    'gave',     # 8
    'it',       # 9
    'liked',    # 10
    'the',      # 11
    'to',       # 12
    'was'       # 13
]

Let $D_i = [w_{i,1}, \ldots, w_{i,n}]$ be a document, where $w_{i,j}$ is the $j$ 'th word in $D_i$ . A vector representation for $D_i$ can be defined as $v_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}$ , where $w_j$ is the $j$ 'th word in $W$ and each dimension in $v_i$ is the frequency of $w_j$ 's occurrences in $D_i$ such that:

#     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]

One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in $v_i$ being 0 in practice, as most words in the vocabulary $W$ do not occur in document $D_i$ . Therefore, it is more efficient to represent $v_i$ as a sparse vector:

v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}

How does the bag-of-words model handle unknown words that are not in the vocabulary?

Implementation

from src.types import Document, Vocab

def vocabulary(documents: list[Document]) -> Vocab:
    vocab = set()

    for document in documents:
        vocab.update(document)

    return {word: i for i, word in enumerate(sorted(list(vocab)))}

We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

from collections import Counter
from src.types import SparseVector

def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
    counts = Counter(document)
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

Finally, let us our bag-of-words model with the examples above:

documents = [
    ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
    ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
]

vocab = vocabulary(documents)
print(vocab)

print(bag_of_words(vocab, documents[0]))
print(bag_of_words(vocab, documents[1]))

{
    '.': 0,
    'John': 1,
    'Mary': 2,
    'The': 3,
    'a': 4,
    'book': 5,
    'bought': 6,
    'funny': 7,
    'gave': 8,
    'it': 9,
    'liked': 10,
    'the': 11,
    'to': 12,
    'was': 13
}
{0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
{0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}

References

Source: bag_of_words_model.py

Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials

Document Similarity

Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the corpus:

Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the and .

Euclidean Similarity

Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:

\lVert V_i - V_j \rVert = \sqrt{\sum_{k=1}^n (v_{ik} - v_{jk})^2}

Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:

import math
from src.types import SparseVector

def euclidean_distance(v1: SparseVector, v2: SparseVector) -> float:
    d = sum((v - v2.get(k, 0)) ** 2 for k, v in v1.items())
d += sum(v ** 2 for k, v in v2.items() if k not in v1)
return math.sqrt(d)

L6: ** k represents the power of k.

We then measure the Euclidean distance between the two vectors above:

print(euclidean_distance(vs[0], vs[0]))
print(euclidean_distance(vs[0], vs[1]))
print(euclidean_distance(vs[0], vs[2]))

0.0
1.347450458032576
1.3756015678855296

The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.

Cosine Similarity

The cosine similarity between two vectors can be measured as follow:

\frac{V_i\cdot V_j}{\lVert V_i\rVert\lVert V_j\rVert} = \frac{\sum_{\forall k} (v_{ik} \cdot v_{jk})}{\sqrt{\sum_{\forall k} (v_{ik})^2} \cdot \sqrt{\sum_{\forall k} (v_{jk})^2}}

Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

def cosine_similarity(v1: SparseVector, v2: SparseVector) -> float:
    n = sum(v * v2.get(k, 0) for k, v in v1.items())
    d = math.sqrt(sum(v ** 2 for k, v in v1.items()))
    d *= math.sqrt(sum(v ** 2 for k, v in v2.items()))
    return n / d

We then measure the Euclidean distance between the two vectors above:

print(cosine_similarity(vs[0], vs[0]))
print(cosine_similarity(vs[0], vs[1]))
print(cosine_similarity(vs[0], vs[2]))

0.9999999999999999
0.5775130451716284
0.4826178600593854

The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.

Why do these metrics determine that D1 is more similar to D2 than to D3?

Document Classification

Supervised Learning

Data Split

L7: the module, the module.

We then print the number of documents in each set:

def join_documents(dataset: dict[int, list[Document]]) -> list[Document]:
    return [document for documents in dataset.values() for document in documents]

trn = collect('dat/document_classification/trn')
dev = collect('dat/document_classification/dev')
tst = collect('dat/document_classification/tst')
print(len(join_documents(trn)), len(join_documents(dev)), len(join_documents(tst)))

82 14 14

Vectorization

To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:

corpus = join_documents(trn)
vocab = vocabulary(join_documents(trn))
dfs = document_frequencies(vocab, corpus)
D = len(corpus)

Why do we use only the training set to collect the vocabulary?

def vectorize(vocab: Vocab, dfs: SparseVector, D: int, docset: dict[int, list[Document]]) -> list[tuple[int, SparseVector]]:
    vs = []

    for book_id, documents in docset.items():
        for document in documents:
            vs.append((book_id, tf_idf(vocab, dfs, D, document)))

    return vs

We then vectorize all documents in each set:

trn_vs = vectorize(vocab, dfs, D, trn)
dev_vs = vectorize(vocab, dfs, D, dev)
tst_vs = vectorize(vocab, dfs, D, tst)

Classification

Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:

def knn(trn_vs: list[tuple[int, SparseVector]], v: SparseVector, k: int = 1) -> tuple[int, float]:
    sims = [(book_id, cosine_similarity(v, t)) for book_id, t in trn_vs]
    sims.sort(key=lambda x: x[1], reverse=True)
    return Counter(sims[:k]).most_common(1)[0][0]

L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .
L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .

Finally, we test our classification model on the development set:

correct = 0
    
for g_book_id, document in dev_vs:
    p_book_id, p_score = knn(trn_vs, document)
    if g_book_id == p_book_id: correct += 1
    print('Gold: {}, Auto: {}, Score: {:.2f}'.format(g_book_id, p_book_id, p_score))

print('Accuracy: {} ({}/{})'.format(100 * correct / len(dev_vs), correct, len(dev_vs)))

Gold: 1, Auto: 1, Score: 0.49
Gold: 1, Auto: 1, Score: 0.27
Gold: 3, Auto: 3, Score: 0.36
Gold: 3, Auto: 3, Score: 0.32
Gold: 5, Auto: 5, Score: 0.29
Gold: 5, Auto: 5, Score: 0.54
Gold: 0, Auto: 0, Score: 0.32
Gold: 0, Auto: 0, Score: 0.26
Gold: 6, Auto: 6, Score: 0.48
Gold: 6, Auto: 6, Score: 0.49
Gold: 2, Auto: 2, Score: 0.37
Gold: 2, Auto: 2, Score: 0.31
Gold: 4, Auto: 4, Score: 0.56
Gold: 4, Auto: 4, Score: 0.60
Accuracy: 100.0 (14/14)

What are the potential weaknesses or limitations of this classification model?

References

Source:

, Wikipedia.

Term Weighting

Term Frequency

L1: , , .
L3: Define a tokenizer function using a expression.

We then retrieve documents in and create a vocaburay dictionary:

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

L6: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

from elit_tokenizer import EnglishTokenizer

ds = [
    "As dawn broke, the first light kissed the golden mane of Aslan, the rightful king of Narnia.",
    "The White Witch's icy breath froze the once lush meadows, casting a shadow over Narnia.",
    "Lucy's footsteps echoed in the halls of Cair Paravel, where legends were born."
]

etok = EnglishTokenizer()
documents = [etok.decode(d).tokens  for d in ds]
print_tfs(vocab, documents)

[('the', 3), (',', 2), ('of', 2), ('.', 1), ('As', 1), ('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('first', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
[("'s", 1), (',', 1), ('.', 1), ('Narnia', 1), ('The', 1), ('White', 1), ('Witch', 1), ('a', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('once', 1), ('over', 1), ('shadow', 1), ('the', 1)]
[("'s", 1), (',', 1), ('.', 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('in', 1), ('legends', 1), ('of', 1), ('the', 1), ('were', 1), ('where', 1)]

If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

Stopwords

Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:

from string import punctuation

stopwords = {line.strip().lower() for line in open('dat/stopwords.txt')}
is_stopwords = lambda w: w.lower() in stopwords or w in punctuation

L1: .

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

sw_tokenizer = lambda s: [word for word in s.split() if not is_stopwords(word)]
corpus = read_corpus('dat/chronicles_of_narnia.txt', sw_tokenizer)
vocab = vocabulary(corpus)

Finally, let us print the TFs of the same documents using the updated vocabulary:

print_tfs(vocab, documents)

[('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
[("'s", 1), ('Narnia', 1), ('White', 1), ('Witch', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('shadow', 1)]
[("'s", 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('legends', 1)]

Document Frequency

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

from collections import Counter
from src.types import SparseVector

def document_frequencies(vocab: Vocab, corpus: list[Document]) -> SparseVector:
    counts = Counter()
    for document in corpus:
        counts.update(set(document))
    return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}

We then compare the term and document frequencies of all terms in the above documents:

corpus = read_corpus('dat/chronicles_of_narnia.txt')
vocab = vocabulary(corpus)
words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]

dfs = document_frequencies(vocab, corpus)
for document in documents:
    bow = bag_of_words(vocab, document)
    tf_df = [(words[tid], tf, dfs[tid]) for tid, tf in sorted(bow.items())]
    tf_df = sorted(tf_df, key=lambda x: (-x[1], x[2]))
    print(' '.join(document))
    print('\n'.join(['{:>10}  {}  {:>5}'.format(*t) for t in tf_df]))

L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.
L11: The tuple t is into three arguments and passed to the function.

As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
       the  3   9574
        of  2   5355
         ,  2  10578
  rightful  1      1
      dawn  1      6
    kissed  1     26
     broke  1     35
      king  1     40
      mane  1     40
    golden  1     52
        As  1    161
     light  1    203
     first  1    401
    Narnia  1    512
     Aslan  1    706
         .  1  19747
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
   casting  1      1
     froze  1      2
       icy  1      3
    shadow  1     38
     White  1     44
    breath  1     86
     Witch  1    246
      once  1    378
      over  1    431
    Narnia  1    512
       The  1   1352
        's  1   2404
         a  1   5456
       the  1   9574
         ,  1  10578
         .  1  19747
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
 footsteps  1      1
   legends  1      1
    echoed  1      2
     halls  1      4
      born  1     14
   Paravel  1     84
      Cair  1     86
     where  1    360
      Lucy  1    704
      were  1   1684
        's  1   2404
        in  1   3513
        of  1   5355
       the  1   9574
         ,  1  10578
         .  1  19747

What are the implications when a term has a high document frequency?

TF-IDF

Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

\begin{align*}     \mathbf{TF-IDF}(t, d, D) &= \mathbf{TF}(t, d) \cdot \mathbf{IDF}(t, D) \\     \mathbf{TF}(t, d) &= \frac{\#(t, d)}{|d|} \\     \mathbf{IDF}(t, D) &= \log\frac{|D|}{\mathbf{DF}(t, D)} \\ \end{align*}

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

def tf_idf(vocab: Vocab, dfs: SparseVector, D: int, document: Document) -> SparseVector:
    tf = lambda count: count / len(document)
    idf = lambda tid: math.log(D / dfs[tid])
    return {tid: tf(count) * idf(tid) for tid, count in bag_of_words(vocab, document).items()}

We then compute the TF-IDF scores of terms in the above documents:

for document in documents:
    tfidf = tf_idf(vocab, dfs, len(corpus), document)
    print(' '.join(document))
    print('\n'.join(['{:>10}  {:.2f}'.format(words[tid], score) for tid, score in sorted(tfidf.items(), key=lambda x: x[1], reverse=True)]))

As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
  rightful  0.50
      dawn  0.41
    kissed  0.34
     broke  0.32
      king  0.32
      mane  0.32
    golden  0.30
        As  0.25
     light  0.24
     first  0.20
    Narnia  0.19
     Aslan  0.17
        of  0.14
       the  0.13
         ,  0.08
         .  0.01
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
   casting  0.56
     froze  0.52
       icy  0.50
    shadow  0.35
     White  0.35
    breath  0.31
     Witch  0.25
      once  0.23
      over  0.22
    Narnia  0.21
       The  0.16
        's  0.12
         a  0.08
       the  0.05
         ,  0.04
         .  0.01
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
 footsteps  0.63
   legends  0.63
    echoed  0.58
     halls  0.54
      born  0.46
   Paravel  0.35
      Cair  0.35
     where  0.26
      Lucy  0.22
      were  0.16
        's  0.14
        in  0.12
        of  0.09
       the  0.05
         ,  0.05
         .  0.01

Should we still apply stop words when using TF-IDF scores to represent the documents?

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

Sublinear scaling on TF:
Normalized TF:
Normalized IDF:
Probabilistic IDF:

References

Source:

Vector Space Models

Contents

Bag-of-Words Model

Overview

Implementation

References

Term Weighting

Term Frequency

Stopwords

Document Frequency

TF-IDF

References

Document Similarity

Euclidean Similarity

Cosine Similarity

Document Classification

Supervised Learning

Data Split

Vectorization

Classification

References

Homework

Data

Submission

Extra Credit

Vector Space Models

Contents

Bag-of-Words Model

Overview

Implementation

References

Homework

Data

Submission

Extra Credit

Document Similarity

Euclidean Similarity

Cosine Similarity

Document Classification

Supervised Learning

Data Split

Vectorization

Classification

References

Term Weighting

Term Frequency

Stopwords

Document Frequency

TF-IDF

References