Document Classification
Document classification, also known as text classification, is a task that involves assigning predefined categories or labels to documents based on their content, used to automatically organize, categorize, or label large collections of textual documents.
Supervised Learning
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, with each data point (instance) being associated with a corresponding target label or output. The goal of supervised learning is to learn a mapping function from input features to output labels, which enables the algorithm to make predictions or decisions on unseen data.
Data Split
Supervised learning typically involves dividing the entire dataset into training, development, and evaluation sets. The training set is used to train a model, the development set to tune the model's hyperparameters, and the evaluation set to assess the best model tuned on the development set.
It is critical to ensure that the evaluation set is never used to tune the model during training. Common practice involves splitting the dataset such as 80/10/10 or 75/10/15 for training, development, and evaluation sets, respectively.
The document_classification directory contains the training (trn), development (dev), and evaluation (tst) sets comprising 82, 14, and 14 documents, respectively. Each document is a chapter from the chronicles_of_narnia.txt file, following a file-naming convention of A_B
, where A
denotes the book ID and B
indicates the chapter ID.
Let us define a function that takes a path to a directory containing training documents and returns a dictionary, where each key in the dictionary corresponds to a book label, and its associated value is a list of documents within that book:
We then print the number of documents in each set:
Q8: What potential problems might arise from the above data splitting approach, and what alternative method could mitigate these issues?
Vectorization
To vectorize the documents, let us gather the vocabulary and their document frequencies from the training set:
Let us create a function that takes the vocabulary, document frequencies, document length, and a document set, and returns a list of tuples, where each tuple consists of a book ID and a sparse vector representing a document in the corresponding book:
We then vectorize all documents in each set:
Q9: Why do we use only the training set to collect the vocabulary?
Classification
Let us develop a classification model using the K-nearest neighbors algorithm [1] that takes the training vector set, a document, and , and returns the predicted book ID of the document and its similarity score:
L2: Measure the similarity between the input document and every document in the training set and save it with the book ID of .
L3-4: Return the most common book ID among the top- documents in the training set that are most similar to .
Finally, we test our classification model on the development set:
Q10: What are the primary weaknesses and limitations of the K-Nearest Neighbors (KNN) classification model when applied to document classification?
References
Source: document_classification.py
K-nearest neighbors, Wikipedia.
Last updated
Was this helpful?