Term Weighting

Term Frequency

The frequency of a word ww's occurrences in a document DD is called the Term Frequency (TF) of wDw \in D. TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.

However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:

from typing import Callable

def read_corpus(filename: str, tokenizer: Callable[[str], list[str]] | None = None) -> list[Document]:
    fin = open(filename)
    if tokenizer is None: tokenizer = lambda s: s.split()
    return [tokenizer(line) for line in fin]

We then retrieve documents in chronicles_of_narnia.txt and create a vocaburay dictionary:

from src.bag_of_words_model import vocabulary

corpus = read_corpus('dat/chronicles_of_narnia.txt')
vocab = vocabulary(corpus)

Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the bag of words model:

from src.bag_of_words_model import bag_of_words, Document, Vocab

def print_tfs(vocab: Vocab, documents: list[Document]):
    tfs = [bag_of_words(vocab, document) for document in documents]
    words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]
    for tf in tfs:
        print([(words[index], count) for index, count in sorted(tf.items(), key=lambda x: x[1], reverse=True)])
  • L5: The underscore (_) is used to indicate that the variable is not being used in the loop.

At last, let us print the TFs of all terms in the following three documents:

In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.

Stopwords

One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.

Let us retrieve a set of commonly used stop words from stopwords.txt and define a function to determine if a term should be considered a stop word:

Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

Finally, let us print the TFs of the same documents using the updated vocabulary:

Document Frequency

Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.

Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.

Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

We then compare the term and document frequencies of all terms in the above documents:

  • L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.

  • L11: The tuple t is unpacked into three arguments and passed to the format() function.

Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.

TF-IDF

Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).

Given a term tt in a document dDd \in D where DD is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

  TFIDF(t,d,D)=TF(t,d)IDF(t,D)  TF(t,d)=#(t,d)d  IDF(t,D)=logDDF(t,D)\begin{align*}     \mathbf{TF-IDF}(t, d, D) &= \mathbf{TF}(t, d) \cdot \mathbf{IDF}(t, D) \\     \mathbf{TF}(t, d) &= \frac{\#(t, d)}{|d|} \\     \mathbf{IDF}(t, D) &= \log\frac{|D|}{\mathbf{DF}(t, D)} \\ \end{align*}

In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.

Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

We then compute the TF-IDF scores of terms in the above documents:

Various of TF-IDF have been proposed to enhance the representation in certain contexts:

  • Sublinear scaling on TF:

  • Normalized TF: α+(1α)TF(t,d)max(TF(,d))\alpha + (1-\alpha)\frac{\mathbf{TF}(t,d)}{\max(\mathbf{TF}(*,d))}

  • Normalized IDF: max(IDF(,D))IDF(t,D)\frac{\max(\mathbf{IDF}(*,D))}{\mathbf{IDF}(t,D)}

  • Probabilistic IDF: DDF(t,D)DF(t,D)\frac{|D| - \mathbf{DF}(t,D)}{\mathbf{DF}(t,D)}

References

Last updated

Was this helpful?