开发者

Algorithm for sentence analysis and tokenization

I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency. It should be something similar to开发者_开发问答 http://www.codeproject.com/KB/recipes/Patterns.aspx Do you have anything written in C#?


This is very simple to implement.

  1. Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).

  2. A forloop to enumerate all the n-gram out and use Dictionary<string, int> to get the count.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜