arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Document Classification

Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.

hashtag
Supervised Learning

Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.

hashtag
Data Split

Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.

It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.

The directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the file, following a file-naming convention of A_B, where A denotes the book ID and B indicates the chapter ID.

Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:

  • L7: the module, the module.

We then print the number of documents in each set:

circle-exclamation

Q8: What potential problems might arise from the above data splitting approach, and what alternative method could mitigate these issues?

hashtag
Vectorization

To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:

Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:

We then vectorize all documents in each set:

circle-exclamation

Q9: Why do we use only the training set to collect the vocabulary?

hashtag
Classification

Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:

  • L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .

  • L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .

Finally, we test our classification model on the development set:

circle-exclamation

Q10: What are the primary weaknesses and limitations of the K-Nearest Neighbors (KNN) classification model when applied to document classification?

hashtag
References

Source:

  1. , Wikipedia.

kkk
vvv
ttt
ttt
kkk
vvv
document_classificationarrow-up-right
chronicles_of_narnia.txtarrow-up-right
globarrow-up-right
osarrow-up-right
document_classification.pyarrow-up-right
K-nearest neighborsarrow-up-right
from src.bag_of_words_model import Document
import glob, os

def collect(dirpath: str) -> dict[int, list[Document]]:
    books = dict()

    for filename in glob.glob(os.path.join(dirpath, '*.txt')):
        t = os.path.basename(filename).split('_')
        book_id = int(t[0])
        fin = open((filename))
        books.setdefault(book_id, list()).append(fin.read().split())

    return books
def join_documents(dataset: dict[int, list[Document]]) -> list[Document]:
    return [document for documents in dataset.values() for document in documents]

trn = collect('dat/document_classification/trn')
dev = collect('dat/document_classification/dev')
tst = collect('dat/document_classification/tst')
print(len(join_documents(trn)), len(join_documents(dev)), len(join_documents(tst)))
82 14 14
corpus = join_documents(trn)
vocab = vocabulary(join_documents(trn))
dfs = document_frequencies(vocab, corpus)
D = len(corpus)
def vectorize(vocab: Vocab, dfs: SparseVector, D: int, docset: dict[int, list[Document]]) -> list[tuple[int, SparseVector]]:
    vs = []

    for book_id, documents in docset.items():
        for document in documents:
            vs.append((book_id, tf_idf(vocab, dfs, D, document)))

    return vs
trn_vs = vectorize(vocab, dfs, D, trn)
dev_vs = vectorize(vocab, dfs, D, dev)
tst_vs = vectorize(vocab, dfs, D, tst)
def knn(trn_vs: list[tuple[int, SparseVector]], v: SparseVector, k: int = 1) -> tuple[int, float]:
    sims = [(book_id, cosine_similarity(v, t)) for book_id, t in trn_vs]
    sims.sort(key=lambda x: x[1], reverse=True)
    return Counter(sims[:k]).most_common(1)[0][0]
correct = 0
    
for g_book_id, document in dev_vs:
    p_book_id, p_score = knn(trn_vs, document)
    if g_book_id == p_book_id: correct += 1
    print('Gold: {}, Auto: {}, Score: {:.2f}'.format(g_book_id, p_book_id, p_score))

print('Accuracy: {} ({}/{})'.format(100 * correct / len(dev_vs), correct, len(dev_vs)))
Gold: 1, Auto: 1, Score: 0.49
Gold: 1, Auto: 1, Score: 0.27
Gold: 3, Auto: 3, Score: 0.36
Gold: 3, Auto: 3, Score: 0.32
Gold: 5, Auto: 5, Score: 0.29
Gold: 5, Auto: 5, Score: 0.54
Gold: 0, Auto: 0, Score: 0.32
Gold: 0, Auto: 0, Score: 0.26
Gold: 6, Auto: 6, Score: 0.48
Gold: 6, Auto: 6, Score: 0.49
Gold: 2, Auto: 2, Score: 0.37
Gold: 2, Auto: 2, Score: 0.31
Gold: 4, Auto: 4, Score: 0.56
Gold: 4, Auto: 4, Score: 0.60
Accuracy: 100.0 (14/14)

Homework

HW3: Vector Space Models

Your task is to develop a sentiment analyzer train on the Stanford Sentiment Treebankarrow-up-right:

  • Create a vector_space_models.pyarrow-up-right file in the src/homework/arrow-up-right directory.

  • Define a function named sentiment_analyzer() that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.

  • Use the -nearest neighbors algorithm for the classification. Find the optimal value of using the development set, and then hardcode this value into your function before submission.

hashtag
Data

The directory contains the following two files:

  • : a training set consisting of 8,544 labeled documents.

  • : a development set consisting of 1,101 labeled documents.

Each line is a document, which is formatted as follows:

Below are the explanations of what each label signifies:

  • 0: Very negative

  • 1: Negative

  • 2: Neutral

hashtag
Submission

Commit and push the vector_space_models.py file to your GitHub repository.

hashtag
Extra Credit

Define a function named sentiment_analyzer_extra() that gives an improved sentiment analyzer.

hashtag
Rubric

  • Code Submission (1 point)

  • Program Execution (1 point)

  • Development Set Accuracy (4 points)

  • Evaluation Set Accuracy (4 points)

3: Positive

  • 4: Very positive

  • Concept Quiz (2 points)

  • kkk
    kkk
    sentiment_treebankarrow-up-right
    sst_trn.tstarrow-up-right
    sst_dev.tstarrow-up-right
    [Label]\t[Document]

    Vector Space Models

    A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.

    hashtag
    Contents

    • Bag-of-Words Model

    Term Weighting
    Similarity Metrics
    Document Classification
    Homework

    Document Similarity

    Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the chronicles_of_narnia.txtarrow-up-right corpus:

    from src.bag_of_words_model import vocabulary
    from src.
    
    {980: 0.31, 7363: 0.52, 7920
    

    Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the Euclidean distance and Cosine similarity.

    hashtag
    Euclidean Similarity

    Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

    Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:

    Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:

    • L6: ** k represents the power of k.

    We then measure the Euclidean distance between the two vectors above:

    The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.

    hashtag
    Cosine Similarity

    Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).

    The cosine similarity between two vectors can be measured as follow:

    Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

    We then measure the Euclidean distance between the two vectors above:

    The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.

    The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.

    circle-exclamation

    Q7: Why is Cosine Similarity generally preferred over Euclidean Distance in most NLP applications?

    term_weighing
    import
    read_corpus
    ,
    document_frequencies
    ,
    tf_idf
    if __name__ == '__main__':
    corpus = read_corpus('dat/chronicles_of_narnia.txt')
    vocab = vocabulary(corpus)
    dfs = document_frequencies(vocab, corpus)
    D = len(corpus)
    documents = [
    'I like this movie very much'.split(),
    'I hate this movie very much'.split(),
    'I love this movie so much'.split()
    ]
    vs = [tf_idf(vocab, dfs, D, document) for document in documents]
    for v in vs: print(vs)
    :
    0.70
    ,
    11168
    :
    0.51
    ,
    11833
    :
    0.51
    }
    {980: 0.31, 6423: 1.24, 7920: 0.70, 10325: 0.53, 11168: 0.51}
    Vi=[vi1,…,vin]V_i = [v_{i1}, \dots, v_{in}]Vi​=[vi1​,…,vin​]
    Vj=[vj1,…,vjn]V_j = [v_{j1}, \dots, v_{jn}]Vj​=[vj1​,…,vjn​]
    DiD_iDi​
    DjD_jDj​
    βˆ₯Viβˆ’Vjβˆ₯=βˆ‘k=1n(vikβˆ’vjk)2\lVert V_i - V_j \rVert = \sqrt{\sum_{k=1}^n (v_{ik} - v_{jk})^2}βˆ₯Viβ€‹βˆ’Vj​βˆ₯=k=1βˆ‘n​(vikβ€‹βˆ’vjk​)2​
    v0v_0v0​
    v1v_1v1​
    v0v_0v0​
    v2v_2v2​
    v1v_1v1​
    v0v_0v0​
    v2v_2v2​
    Viβ‹…Vjβˆ₯Viβˆ₯βˆ₯Vjβˆ₯=βˆ‘βˆ€k(vikβ‹…vjk)βˆ‘βˆ€k(vik)2β‹…βˆ‘βˆ€k(vjk)2\frac{V_i\cdot V_j}{\lVert V_i\rVert\lVert V_j\rVert} = \frac{\sum_{\forall k} (v_{ik} \cdot v_{jk})}{\sqrt{\sum_{\forall k} (v_{ik})^2} \cdot \sqrt{\sum_{\forall k} (v_{jk})^2}}βˆ₯Vi​βˆ₯βˆ₯Vj​βˆ₯Vi​⋅Vj​​=βˆ‘βˆ€k​(vik​)2β€‹β‹…βˆ‘βˆ€k​(vjk​)2β€‹βˆ‘βˆ€k​(vik​⋅vjk​)​
    v0v_0v0​
    v1v_1v1​
    v0v_0v0​
    v2v_2v2​
    SpareVector
    import math
    from src.bag_of_words_model import SparseVector
    
    def euclidean_distance(v1: SparseVector, v2: SparseVector) -> float:
        d = sum((v - v2.get(k, 0)) ** 2 for k, v in v1.items())
    d += sum(v ** 2 for k, v in v2.items() if k not in v1)
    return math.sqrt(d)
    print(euclidean_distance(vs[0], vs[0]))
    print(euclidean_distance(vs[0], vs[1]))
    print(euclidean_distance(vs[0], vs[2]))
    0.0
    1.347450458032576
    1.3756015678855296
    def cosine_similarity(v1: SparseVector, v2: SparseVector) -> float:
        n = sum(v * v2.get(k, 0) for k, v in v1.items())
        d = math.sqrt(sum(v ** 2 for k, v in v1.items()))
        d *= math.sqrt(sum(v ** 2 for k, v in v2.items()))
        return n / d
    print(cosine_similarity(vs[0], vs[0]))
    print(cosine_similarity(vs[0], vs[1]))
    print(cosine_similarity(vs[0], vs[2]))
    0.9999999999999999
    0.5775130451716284
    0.4826178600593854

    Term Weighting

    hashtag
    Term Frequency

    The frequency of a word www's occurrences in a document DDD is called the Term Frequency (TF) of w∈Dw \in Dw∈D. TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.

    However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:

    • L3: , , .

    • L5: Define a tokenizer function using a expression.

    We then retrieve documents in and create a vocaburay dictionary:

    Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :

    • L5: The underscore (_) is used to indicate that the variable is not being used in the loop.

    At last, let us print the TFs of all terms in the following three documents:

    In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.

    circle-exclamation

    Q3: If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?

    hashtag
    Stopwords

    One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.

    Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:

    • L1: .

    Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:

    Finally, let us print the TFs of the same documents using the updated vocabulary:

    circle-exclamation

    Q4: Stop words can be filtered either during the creation of vocabulary dictionary or when generating the bag-of-words representations. Which approach is preferable and why?

    hashtag
    Document Frequency

    Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.

    Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.

    Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:

    We then compare the term and document frequencies of all terms in the above documents:

    • L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.

    • L11: The tuple t is into three arguments and passed to the function.

    Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.

    circle-exclamation

    Q5: What are the implications when a term has a high document frequency?

    hashtag
    TF-IDF

    Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).

    Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:

    In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.

    Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:

    We then compute the TF-IDF scores of terms in the above documents:

    Various of TF-IDF have been proposed to enhance the representation in certain contexts:

    • Sublinear scaling on TF:

    • Normalized TF:

    • Normalized IDF:

    circle-exclamation

    Q6: Should we still apply stop words when using TF-IDF scores to represent the documents?

    hashtag
    References

    1. Source:

    from typing import Callable
    
    def read_corpus(filename: str, tokenizer: Callable[[str], list[str]] | None = None) -> list[Document]:
        fin = open(filename)
        if tokenizer is None: tokenizer = lambda s: s.split()
        return [tokenizer(line) for line in fin]
    Probabilistic IDF: ∣Dβˆ£βˆ’DF(t,D)DF(t,D)\frac{|D| - \mathbf{DF}(t,D)}{\mathbf{DF}(t,D)}DF(t,D)∣Dβˆ£βˆ’DF(t,D)​
    ttt
    d∈Dd \in Dd∈D
    DDD
    Β Β TFβˆ’IDF(t,d,D)=TF(t,d)β‹…IDF(t,D)Β Β TF(t,d)=#(t,d)∣d∣  IDF(t,D)=log⁑∣D∣DF(t,D)\begin{align*} Β  Β  \mathbf{TF-IDF}(t, d, D) &= \mathbf{TF}(t, d) \cdot \mathbf{IDF}(t, D) \\ Β  Β  \mathbf{TF}(t, d) &= \frac{\#(t, d)}{|d|} \\ Β  Β  \mathbf{IDF}(t, D) &= \log\frac{|D|}{\mathbf{DF}(t, D)} \\ \end{align*} Β Β TFβˆ’IDF(t,d,D)Β Β TF(t,d)Β Β IDF(t,D)​=TF(t,d)β‹…IDF(t,D)=∣d∣#(t,d)​=logDF(t,D)∣Dβˆ£β€‹β€‹
    \left\{ \begin{array}{cl} Β 1 + \log\mathbf{TF}(t,d) & \mbox{if $\mathbf{TF}(t,d) > 0$}\\ Β 0 & \mbox{otherwise} \end{array} \right.
    Ξ±+(1βˆ’Ξ±)TF(t,d)max⁑(TF(βˆ—,d))\alpha + (1-\alpha)\frac{\mathbf{TF}(t,d)}{\max(\mathbf{TF}(*,d))}Ξ±+(1βˆ’Ξ±)max(TF(βˆ—,d))TF(t,d)​
    max⁑(IDF(βˆ—,D))IDF(t,D)\frac{\max(\mathbf{IDF}(*,D))}{\mathbf{IDF}(t,D)}IDF(t,D)max(IDF(βˆ—,D))​
    union typesarrow-up-right
    callable objectsarrow-up-right
    default argumentsarrow-up-right
    lambdaarrow-up-right
    chronicles_of_narnia.txtarrow-up-right
    bag of words model
    stopwords.txtarrow-up-right
    string.punctuationarrow-up-right
    unpackedarrow-up-right
    format()arrow-up-right
    term_weighting.pyarrow-up-right
    from src.bag_of_words_model import vocabulary
    
    corpus = read_corpus('dat/chronicles_of_narnia.txt')
    vocab = vocabulary(corpus)
    from src.bag_of_words_model import bag_of_words, Document, Vocab
    
    def print_tfs(vocab: Vocab, documents: list[Document]):
        tfs = [bag_of_words(vocab, document) for document in documents]
        words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]
        for tf in tfs:
            print([(words[index], count) for index, count in sorted(tf.items(), key=lambda x: x[1], reverse=True)])
    from elit_tokenizer import EnglishTokenizer
    
    ds = [
        "As dawn broke, the first light kissed the golden mane of Aslan, the rightful king of Narnia.",
        "The White Witch's icy breath froze the once lush meadows, casting a shadow over Narnia.",
        "Lucy's footsteps echoed in the halls of Cair Paravel, where legends were born."
    ]
    
    etok = EnglishTokenizer()
    documents = [etok.decode(d).tokens  for d in ds]
    print_tfs(vocab, documents)
    [('the', 3), (',', 2), ('of', 2), ('.', 1), ('As', 1), ('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('first', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
    [("'s", 1), (',', 1), ('.', 1), ('Narnia', 1), ('The', 1), ('White', 1), ('Witch', 1), ('a', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('once', 1), ('over', 1), ('shadow', 1), ('the', 1)]
    [("'s", 1), (',', 1), ('.', 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('in', 1), ('legends', 1), ('of', 1), ('the', 1), ('were', 1), ('where', 1)]
    from string import punctuation
    
    stopwords = {line.strip().lower() for line in open('dat/stopwords.txt')}
    is_stopwords = lambda w: w.lower() in stopwords or w in punctuation
    sw_tokenizer = lambda s: [word for word in s.split() if not is_stopwords(word)]
    corpus = read_corpus('dat/chronicles_of_narnia.txt', sw_tokenizer)
    vocab = vocabulary(corpus)
    print_tfs(vocab, documents)
    [('Aslan', 1), ('Narnia', 1), ('broke', 1), ('dawn', 1), ('golden', 1), ('king', 1), ('kissed', 1), ('light', 1), ('mane', 1), ('rightful', 1)]
    [("'s", 1), ('Narnia', 1), ('White', 1), ('Witch', 1), ('breath', 1), ('casting', 1), ('froze', 1), ('icy', 1), ('shadow', 1)]
    [("'s", 1), ('Cair', 1), ('Lucy', 1), ('Paravel', 1), ('born', 1), ('echoed', 1), ('footsteps', 1), ('halls', 1), ('legends', 1)]
    from collections import Counter
    from src.types import SparseVector
    
    def document_frequencies(vocab: Vocab, corpus: list[Document]) -> SparseVector:
        counts = Counter()
        for document in corpus:
            counts.update(set(document))
        return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}
    corpus = read_corpus('dat/chronicles_of_narnia.txt')
    vocab = vocabulary(corpus)
    words = [word for word, _ in sorted(vocab.items(), key=lambda x: x[1])]
    
    dfs = document_frequencies(vocab, corpus)
    for document in documents:
        bow = bag_of_words(vocab, document)
        tf_df = [(words[tid], tf, dfs[tid]) for tid, tf in sorted(bow.items())]
        tf_df = sorted(tf_df, key=lambda x: (-x[1], x[2]))
        print(' '.join(document))
        print('\n'.join(['{:>10}  {}  {:>5}'.format(*t) for t in tf_df]))
    As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
           the  3   9574
            of  2   5355
             ,  2  10578
      rightful  1      1
          dawn  1      6
        kissed  1     26
         broke  1     35
          king  1     40
          mane  1     40
        golden  1     52
            As  1    161
         light  1    203
         first  1    401
        Narnia  1    512
         Aslan  1    706
             .  1  19747
    The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
       casting  1      1
         froze  1      2
           icy  1      3
        shadow  1     38
         White  1     44
        breath  1     86
         Witch  1    246
          once  1    378
          over  1    431
        Narnia  1    512
           The  1   1352
            's  1   2404
             a  1   5456
           the  1   9574
             ,  1  10578
             .  1  19747
    Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
     footsteps  1      1
       legends  1      1
        echoed  1      2
         halls  1      4
          born  1     14
       Paravel  1     84
          Cair  1     86
         where  1    360
          Lucy  1    704
          were  1   1684
            's  1   2404
            in  1   3513
            of  1   5355
           the  1   9574
             ,  1  10578
             .  1  19747
    def tf_idf(vocab: Vocab, dfs: SparseVector, D: int, document: Document) -> SparseVector:
        tf = lambda count: count / len(document)
        idf = lambda tid: math.log(D / dfs[tid])
        return {tid: tf(count) * idf(tid) for tid, count in bag_of_words(vocab, document).items()}
    for document in documents:
        tfidf = tf_idf(vocab, dfs, len(corpus), document)
        print(' '.join(document))
        print('\n'.join(['{:>10}  {:.2f}'.format(words[tid], score) for tid, score in sorted(tfidf.items(), key=lambda x: x[1], reverse=True)]))
    As dawn broke , the first light kissed the golden mane of Aslan , the rightful king of Narnia .
      rightful  0.50
          dawn  0.41
        kissed  0.34
         broke  0.32
          king  0.32
          mane  0.32
        golden  0.30
            As  0.25
         light  0.24
         first  0.20
        Narnia  0.19
         Aslan  0.17
            of  0.14
           the  0.13
             ,  0.08
             .  0.01
    The White Witch 's icy breath froze the once lush meadows , casting a shadow over Narnia .
       casting  0.56
         froze  0.52
           icy  0.50
        shadow  0.35
         White  0.35
        breath  0.31
         Witch  0.25
          once  0.23
          over  0.22
        Narnia  0.21
           The  0.16
            's  0.12
             a  0.08
           the  0.05
             ,  0.04
             .  0.01
    Lucy 's footsteps echoed in the halls of Cair Paravel , where legends were born .
     footsteps  0.63
       legends  0.63
        echoed  0.58
         halls  0.54
          born  0.46
       Paravel  0.35
          Cair  0.35
         where  0.26
          Lucy  0.22
          were  0.16
            's  0.14
            in  0.12
            of  0.09
           the  0.05
             ,  0.05
             .  0.01

    Bag-of-Words Model

    hashtag
    Overview

    In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.

    Consider a corpus containing the following two documents:

    The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:

    Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:

    Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:

    circle-exclamation

    Q1: One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?

    hashtag
    Implementation

    Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:

    We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:

    Finally, let us our bag-of-words model with the examples above:

    circle-exclamation

    Q2: Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?

    hashtag
    References

    1. Source:

    2. , Wikipedia

    3. , Working With Text Data, scikit-learn Tutorials

    tokenized
    D1 = ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.']
    D2 = ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
    W = [
        '.',        # 0
        'John',     # 1
        'Mary',     # 2
        'The',      # 3
        'a',        # 4
        'book',     # 5
        'bought',   # 6
        'funny',    # 7
        'gave',     # 8
        'it',       # 9
        'liked',    # 10
        'the',      # 11
        'to',       # 12
        'was'       # 13
    ]
    Di=[wi,1,…,wi,n]D_i = [w_{i,1}, \ldots, w_{i,n}]Di​=[wi,1​,…,wi,n​]
    wi,jw_{i,j}wi,j​
    jjj
    DiD_iDi​
    DiD_iDi​
    vi=[count(wj∈Di):βˆ€wj∈W]∈R∣W∣v_i = [\mathrm{count}(w_j \in D_i) : \forall w_j \in W] \in \mathbb{R}^{|W|}vi​=[count(wjβ€‹βˆˆDi​):βˆ€wjβ€‹βˆˆW]∈R∣W∣
    wjw_jwj​
    jjj
    WWW
    viv_ivi​
    wjw_jwj​
    DiD_iDi​
    viv_ivi​
    WWW
    DiD_iDi​
    viv_ivi​
    bag_of_words_model.pyarrow-up-right
    Bag-of-Words Modelarrow-up-right
    Bags of wordsarrow-up-right
    #     0  1  2  3  4  5  6  7  8  9 10 11 12 13 
    v1 = [2, 1, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1]
    v2 = [2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]
    v1 = {0:2, 1:1, 3:1, 4:1, 5:2, 6:1, 7:1, 13:1}
    v2 = {0:2, 1:1, 2:2, 5:1, 8:1, 9:1, 10:1, 11:1, 12:1}
    from typing import TypeAlias
    
    Document: TypeAlias = list[str]
    Vocab: TypeAlias = dict[str, int]
    
    def vocabulary(documents: list[Document]) -> Vocab:
        vocab = set()
    
        for document in documents:
            vocab.update(document)
    
        return {word: i for i, word in enumerate(sorted(list(vocab)))}
    from collections import Counter
    
    SparseVector: TypeAlias = dict[int, int | float]
    
    def bag_of_words(vocab: Vocab, document: Document) -> SparseVector:
        counts = Counter(document)
        return {vocab[word]: count for word, count in sorted(counts.items()) if word in vocab}
    documents = [
        ['John', 'bought', 'a', 'book', '.', 'The', 'book', 'was', 'funny', '.'],
        ['Mary', 'liked', 'the', 'book', '.', 'John', 'gave', 'it', 'to', 'Mary', '.']
    ]
    
    vocab = vocabulary(documents)
    print(vocab)
    
    print(bag_of_words(vocab, documents[0]))
    print(bag_of_words(vocab, documents[1]))
    {
        '.': 0,
        'John': 1,
        'Mary': 2,
        'The': 3,
        'a': 4,
        'book': 5,
        'bought': 6,
        'funny': 7,
        'gave': 8,
        'it': 9,
        'liked': 10,
        'the': 11,
        'to': 12,
        'was': 13
    }
    {0: 2, 1: 1, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 13: 1}
    {0: 2, 1: 1, 2: 2, 5: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1}