Ruby Text Analysis
Is there any Ruby gem or else for text analysis? Word f开发者_JS百科requency, pattern detection and so forth (preferably with an understanding of french)
the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
Check the following thread, which contains more details and links:
Building openears compatible language model
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92's post lists some more Ruby NLP resources.
You can also Google for "ARPA Language Model" for more info
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!
The Mendicant Bug: NLP Resources for Ruby
contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.
If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.
Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.
I wrote the gem words_counted
for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.
精彩评论