Using a Lucene Index as an Input For Hadoop
I am trying to build an adjacency list out of a corpus. I am thinking of using Map-Reduce because in-memory solutions have proven to be extremely expensive. The sequence of jobs that I think will work require that I start off with an inverted index and then have a map job that takes as input and then compute similarities. I don't particularly want to go through th开发者_运维技巧e pain of building my own inverted index --- I want to use a Lucene index which seems rather easy to generate. However, I am not really clear how I could take the Lucene index and generate pairs that Map in Hadoop can use? Could some one clarify how one goes about doing that?
What you need to do is use IndexReader.terms()
to enumerate terms, IndexReader.docFreq(Term t)
to get the number of documents that contain the term (for IDF), and IndexReader.termDocs(Term t)
to get the DF value for each term, document pair. Using that info, you should be able to feed the data to the Mapper, which would then do its counting. Note that the termDocs
call represents a document by its internal integer number, so you cannot modify the index while doing this computation, as you won't be able to map the document numbers back to documents after the index changes. To get around this, either don't change the index until the results of the reduce step are processed, or, once you have a document number, convert it to an external id by reading the appropriate field from the document, and feeding that to the Mapper.
精彩评论