NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Document-Term Matrix
  • Dimensionality Reduction
  • Document Embedding
  • Word Embedding
  • References

Was this helpful?

Export as PDF
  1. Distributional Semantics

Latent Semantic Analysis

PreviousWord RepresentationsNextNeural Networks

Last updated 1 month ago

Was this helpful?

Latent Semantic Analysis (LSA) [1] analyzes relationships between a set of documents and the terms they contain. It is based on the idea that words that are used in similar contexts tend to have similar meanings, which is in line with the .

Document-Term Matrix

LSA starts with a matrix representation of the documents in a corpus and the terms (words) they contain. This matrix, known as the document-term matrix, has documents as rows and terms as columns, with each cell representing the frequency of a term in a document.

Let us define a function that reads a corpus, and returns a list DDD of all documents in the corpus and a dictionary TTT whose keys and values are terms and their unique indices, respectively:

from src.types import Document, Vocab

def retrieve(filename: str) -> tuple[list[Document], Vocab]:
    documents = [line.split() for line in open(filename)]
    t = {word for document in documents for word in document}
    terms = {term: j for j, term in enumerate(sorted(list(t)))}
    return documents, terms

We then define a function that takes DDD and TTT, and returns the document-term matrix X∈R∣D∣×∣T∣X \in \mathbb{R}^{|D| \times |T|}X∈R∣D∣×∣T∣ such that Xi,jX_{i,j}Xi,j​ indicates the frequency of the jjj'th term in TTT within the iii'th document:

import numpy as np

def document_term_matrix(documents: list[Document], terms: Vocab) -> np.array:
    def doc_vector(document: list[str]) -> list[int]:
        v = [0] * len(terms)
        for term in document:
            v[terms[term]] += 1
        return v
    return np.array([doc_vector(document) for document in documents])
import time

D, T = retrieve('dat/chronicles_of_narnia.txt')

st = time.time()
X = document_term_matrix(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) = 17.87

With this current implementation, it takes over 17 seconds to create the document-term matrix, which is unacceptably slow given the small size of the corpus. Let us improve this function by first creating a 2D matrix in NumPy and then updating the frequency values:

def document_term_matrix_np(documents: list[Document], terms: Vocab) -> np.array:
    X = np.zeros((len(documents), len(vocab)), dtype=int)
    for i, document in enumerate(documents):
        for term in document:
            X[i, vocab[term]] += 1
    return X

Using this updated function, we see a noticeable enhancement in speed, about 0.5 seconds, to create the document-term matrix:

st = time.time()
X = document_term_matrix_np(D, T)
et = time.time()
print("|D| = {}, |T| = {}, Process Time (sec) = {:.2f}".format(len(X), len(X[0]), et - st))
|D| = 22603, |T| = 12361, Process Time (sec) =  0.55

The iii'th row of XXX is considered the document vector of the iii'th document in the corpus, while the transpose of jjj'th column of XXX is considered the term vector of the jjj'th term in the vocabulary.

Q3: Why is the performance of document_term_matrix() significantly slower than document_term_matrix_np()?

Dimensionality Reduction

LSA applies Singular Value Decomposition (SVD) [2] to decompose the document-term matrix XXX into three matrices, UUU, Σ\SigmaΣ, and VVV, where U,V∈R∣D∣×σU, V \in \mathbb{R}^{|D| \times \sigma}U,V∈R∣D∣×σ are orthogonal matrices and Σ∈Rσ×σ\Sigma \in \mathbb{R}^{\sigma \times \sigma}Σ∈Rσ×σ is a diagonal matrix containing singular values, such that X=UΣVTX = U\Sigma V^TX=UΣVT.

For simplicity, let us create a document-term matrix from a small corpus consisting of only eight documents and apply SVD to it:

D, T = retrieve('dat/latent_semantic_analysis.txt')
X = document_term_matrix_np(D, T)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
S = np.diag(S)

This results U∈R8×6U \in \mathbb{R}^{8 \times 6}U∈R8×6, Σ∈R6×6\Sigma \in \mathbb{R}^{6 \times 6}Σ∈R6×6, and VT∈R6×6V^T \in \mathbb{R}^{6 \times 6}VT∈R6×6 such that:

  • In UUU, each row represents a document and each column represent a topic.

  • In Σ\SigmaΣ, each diagonal cell represents the weight of the corresponding topic.

  • In VTV^TVT, each column represents a term and each row represents a topic.

def print_np(matrix: np.array):
    print(matrix.shape)
    for row in matrix:
        print(' '.join(['{:8.4f}'.format(c) for c in row]))
        
print_np(U)
print_np(S)
print_np(Vt)
(8, 6)
 -0.3536  -0.4969  -0.0552   0.3536  -0.6648  -0.0396
 -0.3536  -0.4969  -0.0552  -0.3536   0.3750   0.5911
 -0.3536  -0.0552   0.4969   0.3536   0.4847  -0.3598
 -0.3536  -0.0552   0.4969  -0.3536  -0.1949  -0.1917
 -0.3536   0.0552  -0.4969   0.3536   0.3187  -0.1450
 -0.3536   0.0552  -0.4969  -0.3536  -0.0289  -0.4065
 -0.3536   0.4969   0.0552   0.3536  -0.1386   0.5444
 -0.3536   0.4969   0.0552  -0.3536  -0.1512   0.0071
(6, 6)
  3.4641   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   2.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   2.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   2.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
(6, 6)
 -0.4082  -0.4082  -0.4082  -0.4082  -0.4082  -0.4082
  0.0000  -0.5522   0.5522   0.4417  -0.4417   0.0000
  0.0000   0.4417  -0.4417   0.5522  -0.5522   0.0000
 -0.7071  -0.0000  -0.0000  -0.0000  -0.0000   0.7071
 -0.2614  -0.3151  -0.3151   0.5765   0.5765  -0.2614
 -0.5148   0.4838   0.4838   0.0310   0.0310  -0.5148

The last two singular values in Σ\SigmaΣ are actually non-negative values, 3.72393692×10−163.72393692 \times 10^{−16}3.72393692×10−16 and 2.29674157×10−162.29674157 \times 10^{−16}2.29674157×10−16, respectively.

The first four singular values in Σ\SigmaΣ appear to be sufficiently larger than the others; thus, let us reduce their dimensions to k=4k = 4k=4 such that U′∈R8×4U' \in \mathbb{R}^{8 \times 4}U′∈R8×4, Σ′∈R4×4\Sigma' \in \mathbb{R}^{4 \times 4}Σ′∈R4×4, and V′T∈R4×6V'^{T} \in \mathbb{R}^{4 \times 6}V′T∈R4×6:

k = 4
U = U[:, :k]
S = S[:k, :k]
Vt = Vt[:k, :]

Q4: What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?

Document Embedding

Given the LSA results, an embedding of the iii'th document can be obtained as di=Ui,∗⋅Σ∈R1×4d_i = U_{i,*} \cdot \Sigma \in \mathbb{R}^{1 \times 4}di​=Ui,∗​⋅Σ∈R1×4:

for i, document in enumerate(D):
    t = np.dot(U[i], S)
    print('{}: {}'.format(' '.join(document), ['{:5.2f}'.format(f) for f in t]))
love white cat: [-1.22, -0.99, -0.11,  0.71]
love black cat: [-1.22, -0.99, -0.11, -0.71]
hate white cat: [-1.22, -0.11,  0.99,  0.71]
hate black cat: [-1.22, -0.11,  0.99, -0.71]
love white dog: [-1.22,  0.11, -0.99,  0.71]
love black dog: [-1.22,  0.11, -0.99, -0.71]
hate white dog: [-1.22,  0.99,  0.11,  0.71]
hate black dog: [-1.22,  0.99,  0.11, -0.71]

From the output, although interpreting the meaning of the first topic (column) is challenging, we can infer that the second, third, and fourth topics represent "animal", "sentiment", and "color", respectively. This reveals a limitation of LSA, as higher singular values do not necessarily guarantee the discovery of more meaningful topics.

Q5: By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?

Word Embedding

Finally, an embedding of the jjj'th term can be achieved as tj=Σ⋅V∗,jT=Vj,∗⋅ΣT∈R1×4t_j = \Sigma \cdot V^T_{*,j} = V_{j,*} \cdot \Sigma^T \in \mathbb{R}^{1 \times 4}tj​=Σ⋅V∗,jT​=Vj,∗​⋅ΣT∈R1×4:

V = Vt.transpose()
for term, j in sorted(T.items(), key=lambda x: x[1]):
    t = np.dot(V[j], S)
    print('{:>5}: {}'.format(term, ['{:5.2f}'.format(f) for f in t]))
black: [-1.41,  0.00,  0.00, -1.41]
  cat: [-1.41, -1.10,  0.88, -0.00]
  dog: [-1.41,  1.10, -0.88, -0.00]
 hate: [-1.41,  0.88,  1.10, -0.00]
 love: [-1.41, -0.88, -1.10, -0.00]
white: [-1.41,  0.00,  0.00,  1.41]

From the output, we can infer that the fourth topic still represents "color", whereas the meanings of "animal" and "sentiment" are distributed across the second and third topics. This suggests that each column does not necessarily represent a unique topic; rather, it is a combination across multiple columns that may represent a set of topics.

Q6: Σ\SigmaΣ is not transposed in L3 of the above code. Should we use S.transpose() instead?

References

L1: The package.

L3:

Let us create a document-term matrix from the corpus, :

L1: .

L3:

An is a square matrix whose rows and columns are orthogonal such that OOT=OTO=IO O^T = O^T O = IOOT=OTO=I, where III is the identity matrix.

are non-negative values listed in decreasing order that represent the importance of each topic.

love white cat
love black cat
hate white cat
hate black cat
love white dog
love black dog
hate white dog
hate black dog

L3:

L4:

Source:

, Wikipedia.

, Wikipedia.

distributional hypothesis
numpy
numpy.array
dat/chronicles_of_narnia.txt
Basic date and time types
time.time()
orthogonal matrix
Singular values
dat/latent_semantic_analysis.txt
numpy.linalg.svd()
numpy.diag()
numpy.dot()
latent_semantic_analysis.py
Latent Semantic Analysis
Singular Value Decomposition