A vector space model is a computational framework to represent text documents as vectors in a high-dimensional space such that each document is represented as a vector, and each dimension of the vector corresponds to a particular term in the vocabulary.
In the bag-of-words model, a document is represented as a set or a "bag" of words, disregarding any structure but maintaining information about the frequency of every word.
Consider a corpus containing the following two tokenized documents:
The corpus contains a total of 14 words, and the entire vocabulary can be represented as a list of all word types in this corpus:
One limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
How does the bag-of-words model handle unknown words that are not in the vocabulary?
Let us define a function that takes a list of documents, where each document is represented as a list of tokens, and returns a dictionary, where keys are words and values are their corresponding unique IDs:
We then define a function that takes the vocabulary dictionary and a document, and returns a bag-of-words in a sparse vector representation:
Finally, let us our bag-of-words model with the examples above:
Source: bag_of_words_model.py
Bag-of-Words Model, Wikipedia
Bags of words, Working With Text Data, scikit-learn Tutorials
Your task is to develop a sentiment analyzer train on the Stanford Sentiment Treebank:
Create a vector_space_models.py file in the src/homework/ directory.
Define a function named sentiment_analyzer()
that takes two parameters, a list of training documents and a list of test documents for classification, and returns the predicted sentiment labels along with the respective similarity scores.
Use the -nearest neighbors algorithm for the classification. Find the optimal value of using the development set, and then hardcode this value into your function before submission.
The sentiment_treebank directory contains the following two files:
sst_trn.tst: a training set consisting of 8,544 labeled documents.
sst_dev.tst: a development set consisting of 1,101 labeled documents.
Each line is a document, which is formatted as follows:
Below are the explanations of what each label signifies:
0
: Very negative
1
: Negative
2
: Neutral
3
: Positive
4
: Very positive
Commit and push the vector_space_models.py file to your GitHub repository.
Define a function named sentiment_analyzer_extra()
that gives an improved sentiment analyzer.
Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the corpus:
Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the and .
Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.
L6: ** k
represents the power of k
.
We then measure the Euclidean distance between the two vectors above:
Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).
The cosine similarity between two vectors can be measured as follow:
Let us define a function that takes two sparse vectors and returns the cosine similarity between them:
We then measure the Euclidean distance between the two vectors above:
Why do these metrics determine that D1 is more similar to D2 than to D3?
The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.
Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.
Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.
It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.
The directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the file, following a file-naming convention of A_B
, where A
denotes the book ID and B
indicates the chapter ID.
Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:
We then print the number of documents in each set:
To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:
Why do we use only the training set to collect the vocabulary?
Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:
We then vectorize all documents in each set:
Finally, we test our classification model on the development set:
What are the potential weaknesses or limitations of this classification model?
The frequency of a word 's occurrences in a document is called the Term Frequency (TF) of . TF is often used to determine the importance of a term within a document such that terms that appear more frequently are considered more relevant to the document's content.
However, TF alone does not always reflect the semantic importance of a term. To demonstrate this limitation, let us define a function that takes a filepath to a corpus and returns a list of documents, with each document represented as a separate line in the corpus:
L1: , , .
L3: Define a tokenizer function using a expression.
We then retrieve documents in and create a vocaburay dictionary:
Let us define a function that takes a vocabulary dictionary, a tokenizer, and a list of documents, and prints the TFs of all terms in each document using the :
L6: The underscore (_
) is used to indicate that the variable is not being used in the loop.
At last, let us print the TFs of all terms in the following three documents:
In the first document, terms that are typically considered semantically important, such as "Aslan" or "Narnia," receive a TF of 1, whereas functional terms such as "the" or punctuation like "," or "." receive higher TFs.
If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?
One simple approach to addressing this issue is to discard common terms with little sematnic values, referred to as stop words, which occur frequently but do not convey significant information about the content of the text. By removing stop words, the focus can be placed on the more meaningful content words, which are often more informative for downstream tasks.
Next, we define a tokenizer that excludes stop words during the tokenization process, and use it to retrieve the vocabulary:
Finally, let us print the TFs of the same documents using the updated vocabulary:
Note that stop words can be filtered either during the creation of the vocabulary dictionary or when generating the bag-of-words representations. Do both approaches produce the same results? Which approach is preferable?
Filtering out stop words allows us to generate less noisy vector representations. However, in the above examples, all terms now have the same TF of 1, treating them equally important. A more sophisticated weighting approach involves incorporating information about terms across multiple documents.
Document Frequency (DF) is a measure to quantify how often a term appears across a set of documents within a corpus such that it represents the number of documents within the corpus that contain a particular term.
Let us define a function that takes a vocabulary dictionary and a corpus, and returns a dictionary whose keys are term IDs and values are their corresponding document frequencies:
We then compare the term and document frequencies of all terms in the above documents:
L9: Sort the list of tuples by the second item first in descending order then by the third item in ascending order.
Notice that functional terms with high TFs such as "the" or "of," as well as punctuation, also have high DFs. Thus, it is possible to estimate more semantically important term scores through appropriate weighting between these two types of frequencies.
What are the implications when a term has a high document frequency?
Term Frequency - Inverse Document Frequency (TF-IDF) is used to measure the importance of a term in a document relative to a corpus of documents by combining two metrics: term frequency (TF) and inverse document frequency (IDF).
In this formulation, TF is calculated using the normalized count of the term's occurrences in the document instead of the raw count. IDF measures how rare a term is across a corpus of documents and is calculated as the logarithm of the ratio of the total number of documents in the corpus to the DF of the term.
Let us define a function that takes a vocabulary dictionary, a DF dictionary, the size of all documents, and a document, and returns the TF-IDF scores of all terms in the document:
We then compute the TF-IDF scores of terms in the above documents:
Should we still apply stop words when using TF-IDF scores to represent the documents?
Various of TF-IDF have been proposed to enhance the representation in certain contexts:
Let be a document, where is the 'th word in . A vector representation for can be defined as , where is the 'th word in and each dimension in is the frequency of 's occurrences in such that:
Notice that the bag-of-words model often results in a highly sparse vector, with many dimensions in being 0 in practice, as most words in the vocabulary do not occur in document . Therefore, it is more efficient to represent as a sparse vector:
Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:
Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:
The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.
The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.
L7: the module, the module.
Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:
L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .
L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .
Source:
, Wikipedia.
Let us retrieve a set of commonly used stop words from and define a function to determine if a term should be considered a stop word:
L1: .
L11: The tuple t
is into three arguments and passed to the function.
Given a term in a document where is a set of all documents in a corpus, its TF-IDF score can be measured as follow:
Sublinear scaling on TF:
Normalized TF:
Normalized IDF:
Probabilistic IDF:
Source: