开发者

N-Gram, tf-idf and Cosine similarity in Perl

I am trying to do some pattern 'mining' in piece of multi word on each line. I have done the N-gram analysis using the Text::Ngrams module in perl which give me the frequency of each word . I am however quite confused about the finding patterns in this text.

The tf-idf finds frequency also I presume but how does this differ from the Ngram analysis that I did and how does the similarity measure also help.

Please are there any perl modules or snippets of code I could get to understand some of this concepts .

Please I am from a physics background but have to do some pattern recognizing so I am a little new to some of these , a good ref开发者_运维问答erence on this topics will be appreciated.


Assuming you have a bunch of N documents and you:

Want to find out if Document X (containing an article on how to be a bodybuilder) is similar to another Document Y whose contents you do not know. If Document Y would be "similar" to Document X, it might contain the usual terms one associates with bodybuilding - eg: weight-lifting, barbells, dumbells and maybe Arnold.

So, the similarity of Document X, Document Y would be pretty high. One way to measure this similarity, is using the Cosine Angle between these two documents.

Cosine Similarity Reference: http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

Use CPAN to search for Perl modules. For eg, to compute cosine similarity you could try the Text::Document module

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜