开发者

Solr caching with EHCache/BigMemory

We are implementing a large Lucene/Solr setup with documents in excess of 150 million. We will also have a moderate amount document updates every day.

My question is really a two-part one:

What are the implications of using another caching implementation within Solr, i.e. EHCache instead of the native Solr LRUCache/FastLRUCache?

Terracotta has announced BigMemory that is meant to be used in conjunction with EHCache as an in-process off-heap cache. According to TC, this allows you to store large amounts of data without the GC overhead of the JVM. Is this a good idea to use with Solr? Will it actually hel开发者_Python百科p?

I would esp. like to hear from people with real production experience with EHCache/BigMemory and/or Solr Cache tuning.


Lots of thoughts on this topic. Though my response doesn't leverage EhCache in any way.

First, I don't believe documents should be stored in your search index. Search content should be stored there, not the entire document. What I mean by this is, what's returned from your search query should be document IDs. Not the contents of the documents themselves. The documents themselves should be stored and retrieved from a second system, probably the original file store they are indexed from to begin with. This will reduce index size, decrease your document cache size, decrease master slave replication time (this can become a bottleneck if you update often), and decrease the overhead in writing search responses.

Next, consider putting a reverse HTTP proxy in front of Solr. Although the query caches allow Solr to respond quickly, a cache like Varnish sitting in front of Solr is even faster. This unloads Solr, allowing it to spend time responding to queries it hasn't seen before. The second effect is that you can now throw most of your memory at document caches instead of query caches. If you followed my first suggestion your documents will be incredibly small, allowing you to keep most, if not all of them in memory.

A quick back of the envelope calculation for document sizes. I can easily provide a 32 bit int as an ID for 150 million documents. I still have 10x headroom for document growth. 150 million IDs takes up 600MB. Add in a fudge factor for Solr wrapping documents, and you can probably easily have all your Solr documents cached in 1-2GB. Considering getting 12GB-24GB or RAM is easy nowadays, and I'd say you could do this all on 1 box and get incredible performance. No need for anything extraneous like EhCache. Just gotta make sure you use your search index as efficiently as possible.

Regarding GC: I didn't see a lot of GC time spent on my Solr servers. Most of what needed to be collected was the very short lived objects involved with HTTP request and response cycle, which never gets out of eden space. The caches didn't have high turnover when tuned correctly. The only large changes were when a new index was loaded and caches were flushed, but that wasn't happening constantly.

EDIT: For background, I spent some considerable time tuning Solr caching for a large company that sells consoles and serves millions of searches per day from their Solr servers.


I'm not sure anyone has tried this yet. Certainly we would love to partner up with the Solr guys to find out how useful this would be. We might even be able to optimize it for the use case.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜