开发者

Indexing Lucene with Parallel Extensions

开发者_开发知识库

I'd like to speed-up the indexing of 10GB of data into a Lucene index. Would TPL be a good way to do this? Would I need to divided the data up into chunks and then have each thread start indexing chunks?

To keep the UI responsive would BackgroundWorker be the best approach, or Task, or something else?

Does SOLR already do something like this? Or would it still be worthwhile to code this myself.


Assuming you are using Java - I've had good experiences indexing using multiple threads. Lucene indexing is basically CPU-bound in my experience, meaning if you spawn N threads you can use all your N cores.

The Lucene IndexWriter handles the concurrency so you don't need to worry about that. Your threads can just call indexWriter.addDocument whenever they are ready to do so.

In one project, the documents came from a SELECT statement from a database. I created N threads and each one took the next document from the ResultSet and added it to the index. The thread exited when there were no more rows and the main thread waited on a CountDownLatch.

The second project was a bit more complex. The system was "crawling" a set of documents, i.e. it was not clear from the outset how many documents there were going to be. So it was necessary to maintain a "queue" of documents which had already been discovered. And in the process of analyzing and indexing those documents it was possible to discover more documents which were then also added to the queue. The queue was populated at the start with the initial / seed document. I created a class AutoStopThreadPool to manage the threads, you are welcome to download it if you like. (The JVM thread pools you need to "add" all the tasks then "wait for completion", which wasn't suitable as the processing of a task could result in the discovery of new tasks)


If you want multiple threads to write to a single IndexWriter then I would just spawn one thread which does something like

Parallel.ForEach(docs, d => { writer.Add(d,analyzer) });

So that .NET deals with splitting up the data.

At large index sizes, some people find performance improvements in having multiple indexes that they write to and then merge all the indexes together. My understanding is that this is really useful only for truly massive indexes, but if you want to do this then you will probably need to deal with splitting up the data yourself. In that case using a more full-featured library like tpl might be useful.

Solr is inherently multi-threaded, so you would do the exact same snippet as I gave you before, except instead of calling the writer directly you would call your REST/SolrNet method.

As a general rule, if you ask "Should I use Solr or make it myself?" the answer is almost always "use Solr". I can't think of any reason that you would want to make it yourself here, unless your jvm is really bad or you really hate java.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜