How to reindex all docs in Solr data
I am going to change so开发者_如何学编程me field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs? One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.
First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete>
then <commit/>
and then <optimize/>
. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/>
after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.
The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.
If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.
There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'
精彩评论