arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Latent Semantic Analysis

Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the distributional hypothesis.

hashtag
Document-Term Matrix

LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.

Let us define a function that reads a corpus, and returns a list of all documents in the corpus and a dictionary whose keys and values are terms and their unique indices, respectively:

We then define a function that takes and , and returns the document-term matrix such that indicates the frequency of the 'th term in within the 'th document:

  • L1: The package.

  • L3:

Let us create a document-term matrix from the corpus, :

  • L1: .

  • L3:

With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:

Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:

The 'th row of is considered the document vector of the 'th document in the corpus, while the transpose of 'th column of is considered the term vector of the 'th term in the vocabulary.

circle-exclamation

Q3: Why is the performance of document_term_matrix() significantly slower than document_term_matrix_np()?

hashtag
Dimensionality Reduction

LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix into three matrices, , , and , where are orthogonal matrices and is a diagonal matrix containing singular values, such that .

  • An is a square matrix whose rows and columns are orthogonal such that , where is the identity matrix.

  • are non-negative values listed in decreasing order that represent the importance of each topic.

For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:

chevron-righthashtag
  • L3:

  • L4:

This results , , and such that:

  • In , each row represents a document and each column represent a topic.

  • In , each diagonal cell represents the weight of the corresponding topic.

  • In , each column represents a term and each row represents a topic.

circle-info

The last two singular values in are actually non-negative values, and , respectively.

The first four singular values in appear to be sufficiently larger than the others; thus, let us reduce their dimensions to such that , , and :

circle-exclamation

Q4: What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?

hashtag
Document Embedding

Given the LSA results, an embedding of the 'th document can be obtained as :

From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.

circle-exclamation

Q5: By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?

hashtag
Word Embedding

Finally, an embedding of the 'th term can be achieved as :

From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.

circle-exclamation

Q6: is not transposed in L3 of the above code. Should we use S.transpose() instead?

hashtag
References

Source:

  1. , Wikipedia.

  2. , Wikipedia.

DDD
TTT
DDD
TTT
X∈R∣D∣×∣T∣X \in \mathbb{R}^{|D| \times |T|}X∈R∣D∣×∣T∣
Xi,jX_{i,j}Xi,j​
jjj
TTT
iii
iii
XXX
iii
jjj
XXX
jjj
XXX
UUU
Σ\SigmaΣ
VVV
U,V∈R∣D∣×σU, V \in \mathbb{R}^{|D| \times \sigma}U,V∈R∣D∣×σ
Σ∈Rσ×σ\Sigma \in \mathbb{R}^{\sigma \times \sigma}Σ∈Rσ×σ
X=UΣVTX = U\Sigma V^TX=UΣVT
OOT=OTO=IO O^T = O^T O = IOOT=OTO=I
III
U∈R8×6U \in \mathbb{R}^{8 \times 6}U∈R8×6
Σ∈R6×6\Sigma \in \mathbb{R}^{6 \times 6}Σ∈R6×6
VT∈R6×6V^T \in \mathbb{R}^{6 \times 6}VT∈R6×6
UUU
Σ\SigmaΣ
VTV^TVT
Σ\SigmaΣ
3.72393692×10−163.72393692 \times 10^{−16}3.72393692×10−16
2.29674157×10−162.29674157 \times 10^{−16}2.29674157×10−16
Σ\SigmaΣ
k=4k = 4k=4
U′∈R8×4U' \in \mathbb{R}^{8 \times 4}U′∈R8×4
Σ′∈R4×4\Sigma' \in \mathbb{R}^{4 \times 4}Σ′∈R4×4
V′T∈R4×6V'^{T} \in \mathbb{R}^{4 \times 6}V′T∈R4×6
iii
di=Ui,∗⋅Σ∈R1×4d_i = U_{i,*} \cdot \Sigma \in \mathbb{R}^{1 \times 4}di​=Ui,∗​⋅Σ∈R1×4
jjj
tj=Σ⋅V∗,jT=Vj,∗⋅ΣT∈R1×4t_j = \Sigma \cdot V^T_{*,j} = V_{j,*} \cdot \Sigma^T \in \mathbb{R}^{1 \times 4}tj​=Σ⋅V∗,jT​=Vj,∗​⋅ΣT∈R1×4
Σ\SigmaΣ
numpyarrow-up-right
numpy.arrayarrow-up-right
dat/chronicles_of_narnia.txtarrow-up-right
Basic date and time typesarrow-up-right
time.time()arrow-up-right
orthogonal matrixarrow-up-right
Singular valuesarrow-up-right
dat/latent_semantic_analysis.txtarrow-up-right
numpy.linalg.svd()arrow-up-right
numpy.diag()arrow-up-right
numpy.dot()arrow-up-right
latent_semantic_analysis.pyarrow-up-right
Latent Semantic Analysisarrow-up-right
Singular Value Decompositionarrow-up-right
from src.types import Document, Vocab

def retrieve(filename: str) -> tuple[list[Document], Vocab]:
    documents = [line.split() for line in open(filename)]
    t = {word for document in documents for word in document}
    terms = {term: j for j, term in enumerate(sorted(list(t)))}
    return documents, terms
import numpy as np

def document_term_matrix(documents: list[Document], terms: Vocab) -> np.array:
    def doc_vector(document: list[str]) -> list[int]:
        v = [0] * len(terms)
        for term in document:
            v[terms[term]] += 1
        return v
    return np.array([doc_vector(document) for document in documents])
import time

D, T = retrieve('dat/chronicles_of_narnia.txt')

st = time.time()
X = document_term_matrix(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) = 17.87
def document_term_matrix_np(documents: list[Document], terms: Vocab) -> np.array:
    X = np.zeros((len(documents), len(vocab)), dtype=int)
    for i, document in enumerate(documents):
        for term in document:
            X[i, vocab[term]] += 1
    return X
st = time.time()
X = document_term_matrix_np(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) =  0.55
love white cat
love black cat
hate white cat
hate black cat
love white dog
love black dog
hate white dog
hate black dog
D, T = retrieve('dat/latent_semantic_analysis.txt')
X = document_term_matrix_np(D, T)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
S = np.diag(S)
def print_np(matrix: np.array):
    print(matrix.shape)
    for row in matrix:
        print(' '.join(['{:8.4f}'.format(c) for c in row]))
        
print_np(U)
print_np(S)
print_np(Vt)
(8, 6)
 -0.3536  -0.4969  -0.0552   0.3536  -0.6648  -0.0396
 -0.3536  -0.4969  -0.0552  -0.3536   0.3750   0.5911
 -0.3536  -0.0552   0.4969   0.3536   0.4847  -0.3598
 -0.3536  -0.0552   0.4969  -0.3536  -0.1949  -0.1917
 -0.3536   0.0552  -0.4969   0.3536   0.3187  -0.1450
 -0.3536   0.0552  -0.4969  -0.3536  -0.0289  -0.4065
 -0.3536   0.4969   0.0552   0.3536  -0.1386   0.5444
 -0.3536   0.4969   0.0552  -0.3536  -0.1512   0.0071
(6, 6)
  3.4641   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   2.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   2.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   2.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
(6, 6)
 -0.4082  -0.4082  -0.4082  -0.4082  -0.4082  -0.4082
  0.0000  -0.5522   0.5522   0.4417  -0.4417   0.0000
  0.0000   0.4417  -0.4417   0.5522  -0.5522   0.0000
 -0.7071  -0.0000  -0.0000  -0.0000  -0.0000   0.7071
 -0.2614  -0.3151  -0.3151   0.5765   0.5765  -0.2614
 -0.5148   0.4838   0.4838   0.0310   0.0310  -0.5148
k = 4
U = U[:, :k]
S = S[:k, :k]
Vt = Vt[:k, :]
for i, document in enumerate(D):
    t = np.dot(U[i], S)
    print('{}: {}'.format(' '.join(document), ['{:5.2f}'.format(f) for f in t]))
love white cat: [-1.22, -0.99, -0.11,  0.71]
love black cat: [-1.22, -0.99, -0.11, -0.71]
hate white cat: [-1.22, -0.11,  0.99,  0.71]
hate black cat: [-1.22, -0.11,  0.99, -0.71]
love white dog: [-1.22,  0.11, -0.99,  0.71]
love black dog: [-1.22,  0.11, -0.99, -0.71]
hate white dog: [-1.22,  0.99,  0.11,  0.71]
hate black dog: [-1.22,  0.99,  0.11, -0.71]
V = Vt.transpose()
for term, j in sorted(T.items(), key=lambda x: x[1]):
    t = np.dot(V[j], S)
    print('{:>5}: {}'.format(term, ['{:5.2f}'.format(f) for f in t]))
black: [-1.41,  0.00,  0.00, -1.41]
  cat: [-1.41, -1.10,  0.88, -0.00]
  dog: [-1.41,  1.10, -0.88, -0.00]
 hate: [-1.41,  0.88,  1.10, -0.00]
 love: [-1.41, -0.88, -1.10, -0.00]
white: [-1.41,  0.00,  0.00,  1.41]