Apache Lucene: how to convert collection index to another format?
I need to convert a开发者_如何学编程n index generated by Apache Lucene into another collection representation.
I currently have a collection of documents with many attributes.
I need to create document pairs with similarity measures from it, in order to pass them to classifiers.
Do you know any tutorial I could use to perform this ?
thanks
The similarity measures need to be based on a query. i.e. you query your Lucene document set and you get back a set of documents with relative scores.
If you want to compare every document with every other (is that right? it's hard to tell from the question) then you need to use a feature of each document as the basis for the queries.
For example, you could extract the top N terms (by frequency, excluding stop words) from each document. If you have X documents then you will have X queries. Then you execute each of your X queries against the index and you get back relative similarities of each document with every other. This is a matrix you could use for classification.
Another alternative would be to use the title, or synopsis of each document as the basis for the query (again, excluding stops).
精彩评论