开发者

Apache Lucene: how to convert collection index to another format?

I need to convert a开发者_如何学编程n index generated by Apache Lucene into another collection representation.

I currently have a collection of documents with many attributes.

I need to create document pairs with similarity measures from it, in order to pass them to classifiers.

Do you know any tutorial I could use to perform this ?

thanks


The similarity measures need to be based on a query. i.e. you query your Lucene document set and you get back a set of documents with relative scores.

If you want to compare every document with every other (is that right? it's hard to tell from the question) then you need to use a feature of each document as the basis for the queries.

For example, you could extract the top N terms (by frequency, excluding stop words) from each document. If you have X documents then you will have X queries. Then you execute each of your X queries against the index and you get back relative similarities of each document with every other. This is a matrix you could use for classification.

Another alternative would be to use the title, or synopsis of each document as the basis for the query (again, excluding stops).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜