Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.
Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.
It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.
The document_classification directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the chronicles_of_narnia.txt file, following a file-naming convention of A_B
, where A
denotes the book ID and B
indicates the chapter ID.
Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:
We then print the number of documents in each set:
To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:
Why do we use only the training set to collect the vocabulary?
Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:
We then vectorize all documents in each set:
Finally, we test our classification model on the development set:
What are the potential weaknesses or limitations of this classification model?
Source: document_classification.py
K-nearest neighbors, Wikipedia.
Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:
L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .
L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .