- cosine similarity
- A measurement that quantifies the similarity between two or more vectors.
- It is the cosine of the angle between vectors.
- The cosine similarity is described mathematically as the division between the dot product of vectors and the product of the euclidean norms or magnitude of each vector.
- Reference : towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a
- Cosine Similarity is a value that is bound by a constrained range of 0 and 1.
- The similarity measurement is a measure of the cosine of the angle between the two non-zero vectors A and B.
- Suppose the angle between the two vectors was 90 degrees. In that case, the cosine similarity will have a value of 0; this means that the two vectors are orthogonal or perpendicular to each other.
- Smaller angle, more similar, close to 1
- Bigger angle, more dissimilar, close to 0
Let’s take this.
Document 1: [1, 1, 1, 1, 1, 0] let’s refer to this as A
Document 2: [1, 1, 1, 1, 0, 1] let’s refer to this as B1
1. Calculate the dot product between A and B: 1.1 + 1.1 + 1.1 + 1.1 + 1.0 + 0.1 = 4
2. Calculate the magnitude of the vector A: √1² + 1² + 1² + 1² + 1² + 0² = 2.2360679775
3. Calculate the magnitude of the vector A: √1² + 1² + 1² + 1² + 0²+ 1² =2.2360679775
4. Calculate the cosine similarity: (4) / (2.2360679775*2.2360679775) = 0.80 (80% similarity between the sentences in both document)
doc_soup = "Soup is a primarily liquid food, generally served warm or hot (but may be cool or cold), that is made by combining ingredients of meat or vegetables with stock, juice, water, or another liquid. "
doc_noodles = "Noodles are a staple food in many cultures. They are made from unleavened dough which is stretched, extruded, or rolled flat and cut into one of a variety of shapes."
doc_dosa = "Dosa is a type of pancake from the Indian subcontinent, made from a fermented batter. It is somewhat similar to a crepe in appearance. Its main ingredients are rice and black gram."
documents = [doc_trump, doc_election, doc_putin, doc_soup, doc_noodles, doc_dosa]
import gensim
from gensim.matutils import softcossim
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
print(gensim.__version__)
fasttext_model300=api.load('fasttext-wiki-news-subwords-300')
dictionary=corpora.Dictionary([simple_preprocess(doc) for doc in documents])
similarity_matrix=fasttext_model300.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)
sent_1 = dictionary.doc2bow(simple_preprocess(doc_trump))
sent_2 = dictionary.doc2bow(simple_preprocess(doc_election))
sent_3 = dictionary.doc2bow(simple_preprocess(doc_putin))
sent_4 = dictionary.doc2bow(simple_preprocess(doc_soup))
sent_5 = dictionary.doc2bow(simple_preprocess(doc_noodles))
sent_6 = dictionary.doc2bow(simple_preprocess(doc_dosa))
sentences = [sent_1, sent_2, sent_3, sent_4, sent_5, sent_6]
print(softcossim(sent_1, sent_2, similarity_matrix))
>>>0.5842470477718544
import numpy as np
import pandas as pd
def create_soft_cossim_matrix(sentences):
len_array = np.arange(len(sentences))
xx, yy = np.meshgrid(len_array, len_array)
cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
return cossim_mat
create_soft_cossim_matrix(sentences)
>>>0 1 2 3 4 5
0 1.00 0.58 0.56 0.28 0.34 0.40
1 0.58 1.00 0.54 0.25 0.31 0.43
2 0.56 0.54 1.00 0.19 0.25 0.36
3 0.28 0.25 0.19 1.00 0.50 0.38
4 0.34 0.31 0.25 0.50 1.00 0.56
5 0.40 0.43 0.36 0.38 0.56 1.00
Reference : www.machinelearningplus.com/nlp/cosine-similarity/
'Analyze Data > Measure of similarity' 카테고리의 다른 글
Vector Similarity-2. Euclidean distance (0) | 2021.03.03 |
---|---|
Vector Similarity-7. Sørensen similarity (0) | 2021.03.02 |
Vector Similarity-3. Jaccard Similarity (0) | 2021.03.02 |
Vector Similarity-6. Jaro-Winkler distance (0) | 2021.03.02 |
Vector Similarity-5. Jaro distance (0) | 2021.02.26 |