Latent Semantic Analysis
Document-Term Matrix
from src.types import Document, Vocab
def retrieve(filename: str) -> tuple[list[Document], Vocab]:
documents = [line.split() for line in open(filename)]
t = {word for document in documents for word in document}
terms = {term: j for j, term in enumerate(sorted(list(t)))}
return documents, termsimport numpy as np
def document_term_matrix(documents: list[Document], terms: Vocab) -> np.array:
def doc_vector(document: list[str]) -> list[int]:
v = [0] * len(terms)
for term in document:
v[terms[term]] += 1
return v
return np.array([doc_vector(document) for document in documents])Dimensionality Reduction
Document Embedding
Word Embedding
References
Last updated
Was this helpful?