How to detect if two news articles have the same topic? (Python semantic similarity)
I'm trying to scrape 开发者_Go百科headlines and body text from articles on a few specific sites, similar to what Google does with Google News.
The problem is that across different sites, they may have articles on the same subject worded slightly differently.
Can anyone tell me what I need to know in order to write a comparison algorithm to auto-detect similar articles? Or, is there any library that can be used for text comparisons and return some type of similarity rating? Solutions that use Python are desired.
I think that the most easy way to do that would be to use a SentenceSimilarity model from the HuggingFace library, for example by using this model
First you have to
pip install sentence_transformers
Then the code is pretty simple, as you can see in the provided link:
from sentence_transformers import SentenceTransformer
import numpy as np
sentences = ["Text number 1", "Text number 2"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
embeddings = model.encode(sentences)
np.dot(embeddings[0], embeddings[1], out=None)
The result of the dot product will the the similarity score between the two strings. Basically, 1
means they are the same, -1
means they are opposite (for more details look here)
精彩评论