Java Open Source Text Mining Frameworks [closed]

2022-12-20 15:18 问答作者：

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 11 years 开发者_如何转开发ago.

I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.

I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements.

I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK. It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast.

Thank you all for your interesting answers.

Best Regards, ukrania

Although not a specialized text mining framework, Weka has a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others.

It also has a few filters to wok with textual data like the StringToWordVector filter which can perform TF/IDF transformation.

Check out the Weka wiki website for more information.

Maybe have a look at Java Open Source NLP and Text Mining tools.

I've used LingPipe -- a suite of Java libraries for the linguistic analysis of human language -- for text mining (and other related) tasks.

It is a very well documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.

The source code is also very easy to follow and well documented which, for me, is always a big plus.

As for Machine Learning algorithms, there are plenty, from Naïve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).

In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.

You may already know about GATE: http://gate.ac.uk/

...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.

I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt http://sourceforge.net/projects/maxent/ for a course once.

Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.

We use lucene to process live streams from the internet. It has a native java api.

http://lucene.apache.org/java/docs/

You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.

http://lucene.apache.org/mahout/

继续阅读：frameworks information-retrieval machine-learning

Java Open Source Text Mining Frameworks [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？