NLP Essentials
GitHub Author
  • Overview
    • Syllabus
    • Schedule
    • Development Environment
    • Homework
  • Text Processing
    • Frequency Analysis
    • Tokenization
    • Lemmatization
    • Regular Expressions
    • Homework
  • Language Models
    • N-gram Models
    • Smoothing
    • Maximum Likelihood Estimation
    • Entropy and Perplexity
    • Homework
  • Vector Space Models
    • Bag-of-Words Model
    • Term Weighting
    • Document Similarity
    • Document Classification
    • Homework
  • Distributional Semantics
    • Distributional Hypothesis
    • Word Representations
    • Latent Semantic Analysis
    • Neural Networks
    • Word2Vec
    • Homework
  • Contextual Encoding
    • Subword Tokenization
    • Recurrent Neural Networks
    • Transformer
    • Encoder-Decoder Framework
    • Homework
  • NLP Tasks & Applications
    • Text Classification
    • Sequence Tagging
    • Structure Parsing
    • Relation Extraction
    • Question Answering
    • Machine Translation
    • Text Summarization
    • Dialogue Management
    • Homework
  • Projects
    • Speed Dating
    • Team Formation
    • Proposal Pitch
    • Proposal Report
    • Live Demonstration
    • Final Report
    • Team Projects
      • Team Projects (2024)
    • Project Ideas
      • Project Ideas (2024)
Powered by GitBook

Copyright © 2023 All rights reserved

On this page
  • Euclidean Similarity
  • Cosine Similarity

Was this helpful?

Export as PDF
  1. Vector Space Models

Document Similarity

PreviousTerm WeightingNextDocument Classification

Last updated 1 month ago

Was this helpful?

Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the corpus:

from src.bag_of_words_model import vocabulary
from src.term_weighing import read_corpus, document_frequencies, tf_idf

if __name__ == '__main__':
    corpus = read_corpus('dat/chronicles_of_narnia.txt')
    vocab = vocabulary(corpus)
    dfs = document_frequencies(vocab, corpus)
    D = len(corpus)

    documents = [
        'I like this movie very much'.split(),
        'I hate this movie very much'.split(),
        'I love this movie so much'.split()
    ]

    vs = [tf_idf(vocab, dfs, D, document) for document in documents]
    for v in vs: print(vs)
{980: 0.31, 7363: 0.52, 7920: 0.70, 11168: 0.51, 11833: 0.51}
{980: 0.31, 6423: 1.24, 7920: 0.70, 10325: 0.53, 11168: 0.51}

Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the and .

Euclidean Similarity

Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

import math
from src.bag_of_words_model import SparseVector

def euclidean_distance(v1: SparseVector, v2: SparseVector) -> float:
    d = sum((v - v2.get(k, 0)) ** 2 for k, v in v1.items())
d += sum(v ** 2 for k, v in v2.items() if k not in v1)
return math.sqrt(d)
  • L6: ** k represents the power of k.

We then measure the Euclidean distance between the two vectors above:

print(euclidean_distance(vs[0], vs[0]))
print(euclidean_distance(vs[0], vs[1]))
print(euclidean_distance(vs[0], vs[2]))
0.0
1.347450458032576
1.3756015678855296

Cosine Similarity

Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).

The cosine similarity between two vectors can be measured as follow:

Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

def cosine_similarity(v1: SparseVector, v2: SparseVector) -> float:
    n = sum(v * v2.get(k, 0) for k, v in v1.items())
    d = math.sqrt(sum(v ** 2 for k, v in v1.items()))
    d *= math.sqrt(sum(v ** 2 for k, v in v2.items()))
    return n / d

We then measure the Euclidean distance between the two vectors above:

print(cosine_similarity(vs[0], vs[0]))
print(cosine_similarity(vs[0], vs[1]))
print(cosine_similarity(vs[0], vs[2]))
0.9999999999999999
0.5775130451716284
0.4826178600593854

The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.

Q7: Why is Cosine Similarity generally preferred over Euclidean Distance in most NLP applications?

Let Vi=[vi1,…,vin]V_i = [v_{i1}, \dots, v_{in}]Vi​=[vi1​,…,vin​] and Vj=[vj1,…,vjn]V_j = [v_{j1}, \dots, v_{jn}]Vj​=[vj1​,…,vjn​] be two vectors representing documents DiD_iDi​ and DjD_jDj​. The Euclidean distance between the two vectors can be measured as follow:

∥Vi−Vj∥=∑k=1n(vik−vjk)2\lVert V_i - V_j \rVert = \sqrt{\sum_{k=1}^n (v_{ik} - v_{jk})^2}∥Vi​−Vj​∥=k=1∑n​(vik​−vjk​)2​

Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:

The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between v0v_0v0​ and v1v_1v1​ is shorter than the distance between v0v_0v0​ and v2v_2v2​, implying that v1v_1v1​ is more similar to v0v_0v0​ than v2v_2v2​, which contradicts our intuition.

Vi⋅Vj∥Vi∥∥Vj∥=∑∀k(vik⋅vjk)∑∀k(vik)2⋅∑∀k(vjk)2\frac{V_i\cdot V_j}{\lVert V_i\rVert\lVert V_j\rVert} = \frac{\sum_{\forall k} (v_{ik} \cdot v_{jk})}{\sqrt{\sum_{\forall k} (v_{ik})^2} \cdot \sqrt{\sum_{\forall k} (v_{jk})^2}}∥Vi​∥∥Vj​∥Vi​⋅Vj​​=∑∀k​(vik​)2​⋅∑∀k​(vjk​)2​∑∀k​(vik​⋅vjk​)​

The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between v0v_0v0​ and v1v_1v1​ is greater than the similarity between v0v_0v0​ and v2v_2v2​, which again contradicts our intuition.

SpareVector
chronicles_of_narnia.txt
Euclidean distance
Cosine similarity