arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Document Similarity

Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the chronicles_of_narnia.txtarrow-up-right corpus:

from src.bag_of_words_model import vocabulary
from src.
{980: 0.31, 7363: 0.52, 7920

Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the Euclidean distance and Cosine similarity.

hashtag
Euclidean Similarity

Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.

Let and be two vectors representing documents and . The Euclidean distance between the two vectors can be measured as follow:

Let us define a function that takes two vectors in our notation and returns the Euclidean distance between them:

  • L6: ** k represents the power of k.

We then measure the Euclidean distance between the two vectors above:

The Euclidean distance between two identical vectors is 0 (L1). Interestingly, the distance between and is shorter than the distance between and , implying that is more similar to than , which contradicts our intuition.

hashtag
Cosine Similarity

Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).

The cosine similarity between two vectors can be measured as follow:

Let us define a function that takes two sparse vectors and returns the cosine similarity between them:

We then measure the Euclidean distance between the two vectors above:

The Cosine similarity between two identical vectors is 1, although it is calculated as 0.99 due to limitations in decimal points (L1). Similar to the Euclidean distance case, the similarity between and is greater than the similarity between and , which again contradicts our intuition.

The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.

circle-exclamation

Q7: Why is Cosine Similarity generally preferred over Euclidean Distance in most NLP applications?

term_weighing
import
read_corpus
,
document_frequencies
,
tf_idf
if __name__ == '__main__':
corpus = read_corpus('dat/chronicles_of_narnia.txt')
vocab = vocabulary(corpus)
dfs = document_frequencies(vocab, corpus)
D = len(corpus)
documents = [
'I like this movie very much'.split(),
'I hate this movie very much'.split(),
'I love this movie so much'.split()
]
vs = [tf_idf(vocab, dfs, D, document) for document in documents]
for v in vs: print(vs)
:
0.70
,
11168
:
0.51
,
11833
:
0.51
}
{980: 0.31, 6423: 1.24, 7920: 0.70, 10325: 0.53, 11168: 0.51}
Vi=[vi1,…,vin]V_i = [v_{i1}, \dots, v_{in}]Vi​=[vi1​,…,vin​]
Vj=[vj1,…,vjn]V_j = [v_{j1}, \dots, v_{jn}]Vj​=[vj1​,…,vjn​]
DiD_iDi​
DjD_jDj​
βˆ₯Viβˆ’Vjβˆ₯=βˆ‘k=1n(vikβˆ’vjk)2\lVert V_i - V_j \rVert = \sqrt{\sum_{k=1}^n (v_{ik} - v_{jk})^2}βˆ₯Viβ€‹βˆ’Vj​βˆ₯=k=1βˆ‘n​(vikβ€‹βˆ’vjk​)2​
v0v_0v0​
v1v_1v1​
v0v_0v0​
v2v_2v2​
v1v_1v1​
v0v_0v0​
v2v_2v2​
Viβ‹…Vjβˆ₯Viβˆ₯βˆ₯Vjβˆ₯=βˆ‘βˆ€k(vikβ‹…vjk)βˆ‘βˆ€k(vik)2β‹…βˆ‘βˆ€k(vjk)2\frac{V_i\cdot V_j}{\lVert V_i\rVert\lVert V_j\rVert} = \frac{\sum_{\forall k} (v_{ik} \cdot v_{jk})}{\sqrt{\sum_{\forall k} (v_{ik})^2} \cdot \sqrt{\sum_{\forall k} (v_{jk})^2}}βˆ₯Vi​βˆ₯βˆ₯Vj​βˆ₯Vi​⋅Vj​​=βˆ‘βˆ€k​(vik​)2β€‹β‹…βˆ‘βˆ€k​(vjk​)2β€‹βˆ‘βˆ€k​(vik​⋅vjk​)​
v0v_0v0​
v1v_1v1​
v0v_0v0​
v2v_2v2​
SpareVector
import math
from src.bag_of_words_model import SparseVector

def euclidean_distance(v1: SparseVector, v2: SparseVector) -> float:
    d = sum((v - v2.get(k, 0)) ** 2 for k, v in v1.items())
d += sum(v ** 2 for k, v in v2.items() if k not in v1)
return math.sqrt(d)
print(euclidean_distance(vs[0], vs[0]))
print(euclidean_distance(vs[0], vs[1]))
print(euclidean_distance(vs[0], vs[2]))
0.0
1.347450458032576
1.3756015678855296
def cosine_similarity(v1: SparseVector, v2: SparseVector) -> float:
    n = sum(v * v2.get(k, 0) for k, v in v1.items())
    d = math.sqrt(sum(v ** 2 for k, v in v1.items()))
    d *= math.sqrt(sum(v ** 2 for k, v in v2.items()))
    return n / d
print(cosine_similarity(vs[0], vs[0]))
print(cosine_similarity(vs[0], vs[1]))
print(cosine_similarity(vs[0], vs[2]))
0.9999999999999999
0.5775130451716284
0.4826178600593854