1 of 1

Term Weighting

Term Frequency

The frequency of a word $w$ 's occurrences in a document $D$ is called the Term Frequency (TF) of $w \in D$ . TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.

However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:

L3: , , .
L5: Define a tokenizer function using a expression.

We then retrieve documents in and create a vocaburay dictionary:

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

L5: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.

Q3: If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

Stopwords

One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.

Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:

L1: .

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

Finally, let us print the TFs of the same documents using the updated vocabulary:

Q4: Stop words can be filtered either during the creation of vocabulary dictionary or when generating the bag-of-words representations. Which approach is preferable and why?

Document Frequency

Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.

Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

We then compare the term and document frequencies of all terms in the above documents:

L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.
L11: The tuple t is into three arguments and passed to the function.

Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.

Q5: What are the implications when a term has a high document frequency?

TF-IDF

Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).

Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

We then compute the TF-IDF scores of terms in the above documents:

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

Sublinear scaling on TF:
Normalized TF:
Normalized IDF:

Q6: Should we still apply stop words when using TF-IDF scores to represent the documents?

References

Source:

Term Weighting

Term Frequency

from typing import Callable

def read_corpus(filename: str, tokenizer: Callable[[str], list[str]] | None = None) -> list[Document]:
    fin = open(filename)
    if tokenizer is None: tokenizer = lambda s: s.split()
    return [tokenizer(line) for line in fin]

L3: , , .
L5: Define a tokenizer function using a expression.

We then retrieve documents in and create a vocaburay dictionary:

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

L5: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

Q3: If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

Stopwords

Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:

L1: .

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

Finally, let us print the TFs of the same documents using the updated vocabulary:

Q4: Stop words can be filtered either during the creation of vocabulary dictionary or when generating the bag-of-words representations. Which approach is preferable and why?

Document Frequency

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

We then compare the term and document frequencies of all terms in the above documents:

L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.
L11: The tuple t is into three arguments and passed to the function.

Q5: What are the implications when a term has a high document frequency?

TF-IDF

Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

We then compute the TF-IDF scores of terms in the above documents:

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

Sublinear scaling on TF:
Normalized TF:
Normalized IDF:

Q6: Should we still apply stop words when using TF-IDF scores to represent the documents?

References

Source:

Term Weighting

hashtagTerm Frequency

hashtagStopwords

hashtagDocument Frequency

hashtagTF-IDF

hashtagReferences

Term Weighting

hashtagTerm Frequency

hashtagStopwords

hashtagDocument Frequency

hashtagTF-IDF

hashtagReferences

Term Frequency

Stopwords

Document Frequency

TF-IDF

References

Term Frequency

Stopwords

Document Frequency

TF-IDF

References