N-Gram, tf-idf and Cosine similarity in Perl
I am trying to do some pattern 'mining' in piece of multi word on each line. I have done the N-gram analysis using the Text::Ngrams module in perl which give me the frequency of each word . I am however quite confused about the finding patterns in this text.
The tf-idf finds frequency also I presume but how does this differ from the Ngram analysis that I did and how does the similarity measure also help.
Please are there any perl modules or snippets of code I could get to understand some of this concepts .
Please I am from a physics background but have to do some pattern recognizing so I am a little new to some of these , a good ref开发者_运维问答erence on this topics will be appreciated.
Assuming you have a bunch of N documents and you:
Want to find out if Document X (containing an article on how to be a bodybuilder) is similar to another Document Y whose contents you do not know. If Document Y would be "similar" to Document X, it might contain the usual terms one associates with bodybuilding - eg: weight-lifting, barbells, dumbells and maybe Arnold.
So, the similarity of Document X, Document Y would be pretty high. One way to measure this similarity, is using the Cosine Angle between these two documents.
Cosine Similarity Reference: http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
Use CPAN to search for Perl modules. For eg, to compute cosine similarity you could try the Text::Document module
精彩评论