How to determine what the best dataset to cache?

2023-03-22 16:47 问答作者：

I have a simple web service which serves XML datasets (they can be as big as 250MB). This data comes from complex queries executed against a database. To speed up the service, I would like to cache the result of some of t开发者_如何学JAVAhe queries. However I have a limited amount of RAM (~2GB). I don’t know in advance what the most requested XML dataset is. In addition, this can change over time (e.g. yesterday the dataset X is the most often requested, tomorrow it can be dataset Y).

I would like an "intelligent" cache algorithm which would cache the datasets which are the most likely to be requested. In this case, I cannot simply go with counters and cache the most often requested piece of data. I need some sort of time decay of the number of requests.

One option would be to do http://en.wikipedia.org/wiki/Exponential_smoothing of the time between requests, or of the number of requests in successive minutes. If your documents are indeed large, you have the option of keeping some information with the document when it is out of cache, so you could at least try a wider set of approaches than those typically used for page replacement in VM, such as LRU, which track requests only for objects in the cache.

Assuming you have web logs, you could work out what the hit rate of any number of different approaches would have been, just by trying them on the series of requests recorded in the logs.

You can use LRU. Every time you access something not in the cache, replace the the thing is the cache used longest ago, and set its age to 0, incrementing all other ages. Every time you have a cache hit, reset the element's age and increment all others. Can also be done by setting equal to current timestamp.

Note: LRU is often used as an approximation of the optimal algorithm which requires oracular knowledge: replace the one which will be not be used for the longest time. LRU works well when locality is good, and does not suffer from Belady's anomaly.

Why don't you read some articles on general cache structures?:

http://en.wikipedia.org/wiki/Cache

I'd like to recommend to read an article regarding CPU cache as well:

http://en.wikipedia.org/wiki/CPU_cache

For example, based on the notations of CPU cache, you could implement your cache as a fully associative cache with LRU replacement algorithm. You could also try a cache with 4-way set associative cache. (The definition of set in your case might be ambiguous, though)

In general, LRU is a near optimal cache replacement algorithm. LRU can be simply implemented by using time stamp, or there's a couple of approximated algorithms.

However, it really depends on the locality patterns (both spatial and temporal) of your workload. We can't simply say LRU is always good. So, you need to have better understanding on the behavior of your workload.

继续阅读：algorithm caching

How to determine what the best dataset to cache?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？