Detecting similar words among n text documents

2022-12-23 13:33 问答作者：

I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents incl开发者_Python百科ude the word "web".

Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes?

I am unfamiliar with datamining world. In general manner is there a term used for efforts of finding similarities between different documents? If there is then I will make my research easily.

Thanks.

I suppose that you are talking about stemming. If you want to use the R language, you'll have to work with the tm package.

Introduction to the tm Package
Text Mining Infrastructure in R

If not, I can only suggest this list of text mining tools

You can do it by producing a word-list with counts for each document, sorting the word-list alphabetically, and comparing two lists. This is O(n lg n).

Another approach is to use the full text search as provided by your database of choice.

继续阅读：data-mining design-patterns similarity

Detecting similar words among n text documents

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？