Can you suggest me a good Java library to perform text classification with the Vector Space Model?
I need to extract the vector space representation of several documents and then to 开发者_开发技巧compute the cosine distance among them.
I'd like to use that distance to classify some new documents using a k-Nearest-Neighbor approach.
Do you have some suggestion on the libraries I could use?
So far I saw that both Weka and Apache Lucene should support Vector Space Model, which one do you think is the one that best fits my needs?
Weka and Lucene are two different approaches.
Weka is a general purpose toolbox for machine learning. If you want to build a flexible machine learning system, and you have the time/energy, and you want to be able to make any kind of changes, and fine tuning of parameters, and scale is not an issue then Weka is a good option.
Lucene is specialised for text and you should go for it if you want to have a quick solution, that can handle text easily, search for similar documents, and handle large amounts of data. That doesn't mean that Lucene in inferior, quite the opposite when we refer to text. So to implement easily a kNN, I would go for Lucene (good luck with scale - kNN has N^2 complexity).
精彩评论