Algorithm for sentence analysis and tokenization
I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency. It should be something similar to开发者_开发问答 http://www.codeproject.com/KB/recipes/Patterns.aspx Do you have anything written in C#?
This is very simple to implement.
Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).
A forloop to enumerate all the n-gram out and use
Dictionary<string, int>
to get the count.
精彩评论