How to determine what the best dataset to cache?
I have a simple web service which serves XML datasets (they can be as big as 250MB). This data comes from complex queries executed against a database. To speed up the service, I would like to cache the result of some of t开发者_如何学JAVAhe queries. However I have a limited amount of RAM (~2GB). I don’t know in advance what the most requested XML dataset is. In addition, this can change over time (e.g. yesterday the dataset X is the most often requested, tomorrow it can be dataset Y).
I would like an "intelligent" cache algorithm which would cache the datasets which are the most likely to be requested. In this case, I cannot simply go with counters and cache the most often requested piece of data. I need some sort of time decay of the number of requests.
One option would be to do http://en.wikipedia.org/wiki/Exponential_smoothing of the time between requests, or of the number of requests in successive minutes. If your documents are indeed large, you have the option of keeping some information with the document when it is out of cache, so you could at least try a wider set of approaches than those typically used for page replacement in VM, such as LRU, which track requests only for objects in the cache.
Assuming you have web logs, you could work out what the hit rate of any number of different approaches would have been, just by trying them on the series of requests recorded in the logs.
You can use LRU. Every time you access something not in the cache, replace the the thing is the cache used longest ago, and set its age to 0, incrementing all other ages. Every time you have a cache hit, reset the element's age and increment all others. Can also be done by setting equal to current timestamp.
Note: LRU is often used as an approximation of the optimal algorithm which requires oracular knowledge: replace the one which will be not be used for the longest time. LRU works well when locality is good, and does not suffer from Belady's anomaly.
Why don't you read some articles on general cache structures?:
http://en.wikipedia.org/wiki/Cache
I'd like to recommend to read an article regarding CPU cache as well:
http://en.wikipedia.org/wiki/CPU_cache
For example, based on the notations of CPU cache, you could implement your cache as a fully associative cache with LRU replacement algorithm. You could also try a cache with 4-way set associative cache. (The definition of set in your case might be ambiguous, though)
In general, LRU is a near optimal cache replacement algorithm. LRU can be simply implemented by using time stamp, or there's a couple of approximated algorithms.
However, it really depends on the locality patterns (both spatial and temporal) of your workload. We can't simply say LRU is always good. So, you need to have better understanding on the behavior of your workload.
精彩评论