Latent Semantic Analysis

Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the distributional hypothesis.

Document-Term Matrix

LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.

Let us define a function that reads a corpus, and returns a list DD of all documents in the corpus and a dictionary TT whose keys and values are terms and their unique indices, respectively:

from src.types import Document, Vocab

def retrieve(filename: str) -> tuple[list[Document], Vocab]:
    documents = [line.split() for line in open(filename)]
    t = {word for document in documents for word in document}
    terms = {term: j for j, term in enumerate(sorted(list(t)))}
    return documents, terms

We then define a function that takes DD and TT, and returns the document-term matrix XRD×TX \in \mathbb{R}^{|D| \times |T|} such that Xi,jX_{i,j} indicates the frequency of the jj'th term in TT within the ii'th document:

import numpy as np

def document_term_matrix(documents: list[Document], terms: Vocab) -> np.array:
    def doc_vector(document: list[str]) -> list[int]:
        v = [0] * len(terms)
        for term in document:
            v[terms[term]] += 1
        return v
    return np.array([doc_vector(document) for document in documents])

Let us create a document-term matrix from the corpus, dat/chronicles_of_narnia.txt:

import time

D, T = retrieve('dat/chronicles_of_narnia.txt')

st = time.time()
X = document_term_matrix(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))

With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:

def document_term_matrix_np(documents: list[Document], terms: Vocab) -> np.array:
    X = np.zeros((len(documents), len(vocab)), dtype=int)
    for i, document in enumerate(documents):
        for term in document:
            X[i, vocab[term]] += 1
    return X

Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:

st = time.time()
X = document_term_matrix_np(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))

The ii'th row of XX is considered the document vector of the ii'th document in the corpus, while the transpose of jj'th column of XX is considered the term vector of the jj'th term in the vocabulary.

Why is the performance of document_term_matrix() significantly slower than document_term_matrix_np()?

Dimensionality Reduction

LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix XX into three matrices, UU, Σ\Sigma, and VV, where U,VRD×σU, V \in \mathbb{R}^{|D| \times \sigma} are orthogonal matrices and ΣRσ×σ\Sigma \in \mathbb{R}^{\sigma \times \sigma} is a diagonal matrix containing singular values, such that X=UΣVTX = U\Sigma V^T.

  • An orthogonal matrix is a square matrix whose rows and columns are orthogonal such that OOT=OTO=IO O^T = O^T O = I, where II is the identity matrix.

  • Singular values are non-negative values listed in decreasing order that represent the importance of each topic.

For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:

dat/latent_semantic_analysis.txt
love white cat
love black cat
hate white cat
hate black cat
love white dog
love black dog
hate white dog
hate black dog
D, T = retrieve('dat/latent_semantic_analysis.txt')
X = document_term_matrix_np(D, T)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
S = np.diag(S)

This results UR8×6U \in \mathbb{R}^{8 \times 6}, ΣR6×6\Sigma \in \mathbb{R}^{6 \times 6}, and VTR6×6V^T \in \mathbb{R}^{6 \times 6} such that:

  • In UU, each row represents a document and each column represent a topic.

  • In Σ\Sigma, each diagonal cell represents the weight of the corresponding topic.

  • In VTV^T, each column represents a term and each row represents a topic.

def print_np(matrix: np.array):
    print(matrix.shape)
    for row in matrix:
        print(' '.join(['{:8.4f}'.format(c) for c in row]))
        
print_np(U)
print_np(S)
print_np(Vt)

The last two singular values in Σ\Sigma are actually non-negative values, 3.72393692×10163.72393692 \times 10^{−16} and 2.29674157×10162.29674157 \times 10^{−16}, respectively.

The first four singular values in Σ\Sigma appear to be sufficiently larger than the others; thus, let us reduce their dimensions to k=4k = 4 such that UR8×4U' \in \mathbb{R}^{8 \times 4}, ΣR4×4\Sigma' \in \mathbb{R}^{4 \times 4}, and VTR4×6V'^{T} \in \mathbb{R}^{4 \times 6}:

k = 4
U = U[:, :k]
S = S[:k, :k]
Vt = Vt[:k, :]

What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?

Document Embedding

Given the LSA results, an embedding of the ii'th document can be obtained as di=Ui,ΣR1×4d_i = U_{i,*} \cdot \Sigma \in \mathbb{R}^{1 \times 4}:

for i, document in enumerate(D):
    t = np.dot(U[i], S)
    print('{}: {}'.format(' '.join(document), ['{:5.2f}'.format(f) for f in t]))

From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.

By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?

Word Embedding

Finally, an embedding of the jj'th term can be achieved as tj=ΣV,jT=Vj,ΣTR1×4t_j = \Sigma \cdot V^T_{*,j} = V_{j,*} \cdot \Sigma^T \in \mathbb{R}^{1 \times 4}:

V = Vt.transpose()
for term, j in sorted(T.items(), key=lambda x: x[1]):
    t = np.dot(V[j], S)
    print('{:>5}: {}'.format(term, ['{:5.2f}'.format(f) for f in t]))

From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.

Σ\Sigma is not transposed in L3 of the above code. Should we use S.transpose() instead?

References

Source: latent_semantic_analysis.py

Last updated

Copyright © 2023 All rights reserved