开发者

How to detect if two news articles have the same topic? (Python semantic similarity)

I'm trying to scrape 开发者_Go百科headlines and body text from articles on a few specific sites, similar to what Google does with Google News.

The problem is that across different sites, they may have articles on the same subject worded slightly differently.

Can anyone tell me what I need to know in order to write a comparison algorithm to auto-detect similar articles? Or, is there any library that can be used for text comparisons and return some type of similarity rating? Solutions that use Python are desired.


I think that the most easy way to do that would be to use a SentenceSimilarity model from the HuggingFace library, for example by using this model

First you have to

pip install sentence_transformers

Then the code is pretty simple, as you can see in the provided link:

from sentence_transformers import SentenceTransformer
import numpy as np

sentences = ["Text number 1", "Text number 2"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
embeddings = model.encode(sentences)
np.dot(embeddings[0], embeddings[1], out=None)

The result of the dot product will the the similarity score between the two strings. Basically, 1 means they are the same, -1 means they are opposite (for more details look here)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜