开发者

Solr Incremental backup on real-time system with heavy index

I implement search engine with solr that import minimal 2 million doc per day. User must can search on imported doc ASAP (near real-time).

I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).

I want to get backup incremental from solr index file during update or search.

after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.

how to solve this problem.

Note1:File copy r开发者_JS百科eally slow, when file size very large. therefore i can't use this way.

Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?


You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:

http://host:8080/solr/replication?command=backup&location=/home/jboss/backup

Obviously you could script that with wget+cron.

More details can be found here:

http://wiki.apache.org/solr/SolrReplication

The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.


Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.

Some ideas to work around it:

  • Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
  • Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜