Using a Lucene Index as an Input For Hadoop

2023-03-13 10:59 问答作者：

I am trying to build an adjacency list out of a corpus. I am thinking of using Map-Reduce because in-memory solutions have proven to be extremely expensive. The sequence of jobs that I think will work require that I start off with an inverted index and then have a map job that takes as input and then compute similarities. I don't particularly want to go through th开发者_运维技巧e pain of building my own inverted index --- I want to use a Lucene index which seems rather easy to generate. However, I am not really clear how I could take the Lucene index and generate pairs that Map in Hadoop can use? Could some one clarify how one goes about doing that?

What you need to do is use IndexReader.terms() to enumerate terms, IndexReader.docFreq(Term t) to get the number of documents that contain the term (for IDF), and IndexReader.termDocs(Term t) to get the DF value for each term, document pair. Using that info, you should be able to feed the data to the Mapper, which would then do its counting. Note that the termDocs call represents a document by its internal integer number, so you cannot modify the index while doing this computation, as you won't be able to map the document numbers back to documents after the index changes. To get around this, either don't change the index until the results of the reduce step are processed, or, once you have a document number, convert it to an external id by reading the appropriate field from the document, and feeding that to the Mapper.

继续阅读：lucene

Using a Lucene Index as an Input For Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？