Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the distributional hypothesis.
LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.
Let us define a function that reads a corpus, and returns a list of all documents in the corpus and a dictionary whose keys and values are terms and their unique indices, respectively:
We then define a function that takes and , and returns the document-term matrix such that indicates the frequency of the 'th term in within the 'th document:
L1: The numpy package.
L3: numpy.array
Let us create a document-term matrix from the corpus, dat/chronicles_of_narnia.txt:
L3: time.time()
With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:
Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:
Why is the performance of document_term_matrix()
significantly slower than document_term_matrix_np()
?
Singular values are non-negative values listed in decreasing order that represent the importance of each topic.
For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:
L4: numpy.diag()
What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?
From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.
By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?
From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.
Source: latent_semantic_analysis.py
Latent Semantic Analysis, Wikipedia.
Singular Value Decomposition, Wikipedia.
The 'th row of is considered the document vector of the 'th document in the corpus, while the transpose of 'th column of is considered the term vector of the 'th term in the vocabulary.
LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix into three matrices, , , and , where are orthogonal matrices and is a diagonal matrix containing singular values, such that .
An orthogonal matrix is a square matrix whose rows and columns are orthogonal such that , where is the identity matrix.
This results , , and such that:
In , each row represents a document and each column represent a topic.
In , each diagonal cell represents the weight of the corresponding topic.
In , each column represents a term and each row represents a topic.
The last two singular values in are actually non-negative values, and , respectively.
The first four singular values in appear to be sufficiently larger than the others; thus, let us reduce their dimensions to such that , , and :
Given the LSA results, an embedding of the 'th document can be obtained as :
Finally, an embedding of the 'th term can be achieved as :
is not transposed in L3 of the above code. Should we use S.transpose()
instead?