The frequency of a word w's occurrences in a document D is called the Term Frequency (TF) of w∈D. TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.
However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:
from typing import Callabledefread_corpus(filename:str,tokenizer: Callable[[str], list[str]]|None=None) -> list[Document]: fin =open(filename)if tokenizer isNone: tokenizer =lambdas: s.split()return [tokenizer(line)for line in fin]
from src.bag_of_words_model import vocabularycorpus =read_corpus('dat/chronicles_of_narnia.txt')vocab =vocabulary(corpus)
Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the bag of words model:
from src.bag_of_words_model import bag_of_wordsfrom src.types import Vocab, Documentdefprint_tfs(vocab: Vocab,documents: list[Document]): tfs = [bag_of_words(vocab, document)for document in documents] words = [word for word, _ insorted(vocab.items(), key=lambdax: x[1])]for tf in tfs:print([(words[index], count) for index, count insorted(tf.items(), key=lambdax: x[1], reverse=True)])
L6: The underscore (_) is used to indicate that the variable is not being used in the loop.
At last, let us print the TFs of all terms in the following three documents:
from elit_tokenizer import EnglishTokenizerds = ["As dawn broke, the first light kissed the golden mane of Aslan, the rightful king of Narnia.","The White Witch's icy breath froze the once lush meadows, casting a shadow over Narnia.","Lucy's footsteps echoed in the halls of Cair Paravel, where legends were born."]etok =EnglishTokenizer()documents = [etok.decode(d).tokens for d in ds]print_tfs(vocab, documents)
In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.
If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?
Stopwords
One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.
Let us retrieve a set of commonly used stop words from stopwords.txt and define a function to determine if a term should be considered a stop word:
from string import punctuationstopwords ={line.strip().lower()for line inopen('dat/stopwords.txt')}is_stopwords =lambdaw: w.lower()in stopwords or w in punctuation
Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:
sw_tokenizer =lambdas: [word for word in s.split()ifnotis_stopwords(word)]corpus =read_corpus('dat/chronicles_of_narnia.txt', sw_tokenizer)vocab =vocabulary(corpus)
Finally, let us print the TFs of the same documents using the updated vocabulary:
Note that stop words can be filtered either during the creation of the vocabulary dictionary or when generating the bag-of-words representations. Do both approaches produce the same results? Which approach is preferable?
Document Frequency
Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.
Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.
Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:
from collections import Counterfrom src.types import SparseVectordefdocument_frequencies(vocab: Vocab,corpus: list[Document]) -> SparseVector: counts =Counter()for document in corpus: counts.update(set(document))return{vocab[word]: count for word, count insorted(counts.items())if word in vocab}
We then compare the term and document frequencies of all terms in the above documents:
corpus =read_corpus('dat/chronicles_of_narnia.txt')vocab =vocabulary(corpus)words = [word for word, _ insorted(vocab.items(), key=lambdax: x[1])]dfs =document_frequencies(vocab, corpus)for document in documents: bow =bag_of_words(vocab, document) tf_df = [(words[tid], tf, dfs[tid]) for tid, tf insorted(bow.items())] tf_df =sorted(tf_df, key=lambdax: (-x[1], x[2]))print(' '.join(document))print('\n'.join(['{:>10}{}{:>5}'.format(*t) for t in tf_df]))
L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.
L11: The tuple t is unpacked into three arguments and passed to the format() function.
As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
the 3 9574
of 2 5355
, 2 10578
rightful 1 1
dawn 1 6
kissed 1 26
broke 1 35
king 1 40
mane 1 40
golden 1 52
As 1 161
light 1 203
first 1 401
Narnia 1 512
Aslan 1 706
. 1 19747
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
casting 1 1
froze 1 2
icy 1 3
shadow 1 38
White 1 44
breath 1 86
Witch 1 246
once 1 378
over 1 431
Narnia 1 512
The 1 1352
's 1 2404
a 1 5456
the 1 9574
, 1 10578
. 1 19747
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
footsteps 1 1
legends 1 1
echoed 1 2
halls 1 4
born 1 14
Paravel 1 84
Cair 1 86
where 1 360
Lucy 1 704
were 1 1684
's 1 2404
in 1 3513
of 1 5355
the 1 9574
, 1 10578
. 1 19747
Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.
What are the implications when a term has a high document frequency?
TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).
In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.
Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:
We then compute the TF-IDF scores of terms in the above documents:
for document in documents: tfidf =tf_idf(vocab, dfs, len(corpus), document)print(' '.join(document))print('\n'.join(['{:>10}{:.2f}'.format(words[tid], score) for tid, score insorted(tfidf.items(), key=lambdax: x[1], reverse=True)]))
As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
rightful 0.50
dawn 0.41
kissed 0.34
broke 0.32
king 0.32
mane 0.32
golden 0.30
As 0.25
light 0.24
first 0.20
Narnia 0.19
Aslan 0.17
of 0.14
the 0.13
, 0.08
. 0.01
The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
casting 0.56
froze 0.52
icy 0.50
shadow 0.35
White 0.35
breath 0.31
Witch 0.25
once 0.23
over 0.22
Narnia 0.21
The 0.16
's 0.12
a 0.08
the 0.05
, 0.04
. 0.01
Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
footsteps 0.63
legends 0.63
echoed 0.58
halls 0.54
born 0.46
Paravel 0.35
Cair 0.35
where 0.26
Lucy 0.22
were 0.16
's 0.14
in 0.12
of 0.09
the 0.05
, 0.05
. 0.01
Should we still apply stop words when using TF-IDF scores to represent the documents?
Various of TF-IDF have been proposed to enhance the representation in certain contexts: