using python nltk to find similarity between two web pages?

2023-03-10 12:39 问答作者：

I want to find whether two web pages are si开发者_开发知识库milar or not. Can someone suggest if python nltk with wordnet similarity functions helpful and how? What is the best similarity function to be used in this case?

The spotsigs paper mentioned by joyceschan addresses content duplication detection and it contains plenty of food for thought.

If you are looking for a quick comparison of key terms, nltk standard functions might suffice.

With nltk you can pull synonyms of your terms by looking up the synsets contained by WordNet

>>> from nltk.corpus import wordnet

>>> wordnet.synsets('donation')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

>>> wordnet.synsets('donations')
[Synset('contribution.n.02'), Synset('contribution.n.03')]

It understands plurals and it also tells you which part of speech the synonym corresponds to

Synsets are stored in a tree with more specific terms at the leaves and more general ones at the root. The root terms are called hypernyms

You can measure similarity by how close the terms are to the common hypernym

Watch out for different parts of speech, according to the NLTK cookbook they don't have overlapping paths, so you shouldn't try to measure similarity between them.

Say, you have two terms donation and gift, you can get them from synsets but in this example I initialized them directly:

>>> d = wordnet.synset('donation.n.01')
>>> g = wordnet.synset('gift.n.01')

The cookbook recommends Wu-Palmer Similarity method

>>> d.wup_similarity(g)
0.93333333333333335

This approach gives you a quick way to determine if the terms used correspond to related concepts. Take a look at Natural Language Processing with Python to see what else you can do to help your analysis of text.

consider implementing Spotsigs

继续阅读：nltk python wordnet

using python nltk to find similarity between two web pages?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？