Document Similarity
Let us vectorize the following three documents using the bag-of-words model with TF-IDF scores estimated from the chronicles_of_narnia.txt corpus:
Once the documents are vectorized, they can be compared within the respective vector space. Two common metrics for comparing document vectors are the Euclidean distance and Cosine similarity.
Euclidean Similarity
Euclidean distance is a measure of the straight-line distance between two vectors in Euclidean space such that it represents the magnitude of the differences between the two vectors.
Let us define a function that takes two vectors in our SpareVector notation and returns the Euclidean distance between them:
L6:
** k
represents the power ofk
.
We then measure the Euclidean distance between the two vectors above:
Cosine Similarity
Cosine similarity is a measure of similarity between two vectors in an inner product space such that it calculates the cosine of the angle between two vectors, where a value of 1 indicates that the vectors are identical (i.e., pointing in the same direction), a value of -1 indicates that they are exactly opposite, and a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular to each other).
The cosine similarity between two vectors can be measured as follow:
Let us define a function that takes two sparse vectors and returns the cosine similarity between them:
We then measure the Euclidean distance between the two vectors above:
Why do these metrics determine that D1 is more similar to D2 than to D3?
The following diagram illustrates the difference between the two metrics. The Euclidean distance measures the magnitude between two vectors, while the Cosine similarity measures their angle to the origin.
Last updated