When is the size of the database call more expensive than the frequency of calls?
Can someone give me a relative idea o开发者_Python百科f when it makes more sense to hit the database many times for small query results vs caching a large number of rows and querying that?
For example, if I have a query returning 2,000 results. And then I have additional queries on those results that take maybe 10-20 items, would it be better to cache the 2000 results or hit the database every time for each set of 10 or 20 results?
Other answers here are correct -- the RDBMS and your data are key factors. However, another key factor is how much time it will take to sort and/or index your data in memory versus in the database. We have one application where, for performance, we added code to grab about 10,000 records into an in-memory DataSet
and then do subqueries on that. As it turns out, keeping that data up to date and selecting out subsets is actually slower than just leaving all the data in the database.
So my advice is: do it the simplest possible way first, then profile it and see if you need to optimize for performance.
It depends on a variety of things. I will list some points that come to mind:
If you have a .Net web app that is caching data in the client, you do not want to pull 2k rows.
If you have a web service, they are almost always better Chunky than Chatty because of the added overhead of XML on the transport.
In a fairly decently normalized and optimized database, there really should be very few times that you have to pull 2k rows out at a time unless you are doing reports.
If the underlying data is changing at a rapid pace, then you should really be careful caching it on the middle tier or the presentation layer because what you present will you will be out of date.
Reports (any DSS) will pull and chomp through much larger data sets, but since they are not interactive, we denormalize and let them have their fun.
In cases of cascading dropdowns and such, AJAX techniques will prove to be more efficient and effective.
I guess I'm not really giving you one answer to your question. "It depends" is the best I can do.
Unless there is a big performance problem (e.g. a highly latent db connection), I'd stick with leaving the data in the database and letting the db take care of things for you. A lot of things are done efficiently on the database level, for example
- isolation levels (what happens if other transactions update the data you're caching)
- fast access using indexes (the db may be quicker to access a few rows than you searching through your cached items, especially if that data already is in the db cache like in your scenario)
- updates in your transaction to the cached data (do you want to deal with updating your cached data as well or do you "refresh" everything from the db)
There are a lot of potential issues you may run into if you do your own caching. You need to have a very good performance reason befor starting to take care of all that complexity.
So, the short answer: It depends, but unless you have some good reasons, this smells like premature optimizaton to me.
in general, network round trip latency is several orders of magnitude greater than the capacity of a database to generate and feed data onto the network, and the capacity of a client box to consume it from a network connection.
But look at the width of your network bus ( Bits/sec ) and compare that to the average round trip time for a database call...
On 100baseT ethernet, for example you are about 12 MBytes / sec data transfer rate. If your average round trip time is say, 200 ms, then your network bus can deliver 3 MBytes in each 200 ms round trip call..
If you're on gigabit ethernet, that number jumps to 30 Mbytes per round trip...
So if you split up a request for data into two round trips, well that's 400 ms, and each query would have to be over 3Mb (or 30Mb for gigibit ) before that would be faster...
This likely varies from RDBMS to RDBMS, but my experience has been that pulling in bulk is almost always better. After all, you're going to have to pull the 2000 records anyway, so you might as well do it all at once. And 2000 records isn't really a large amount, but that depends largely on what you're doing.
My advice is to profile and see what works best. RDBMSes can be tricky beasts performance-wise and caching can be just as tricky.
"I guess I'm not really giving you one answer to your question. "It depends" is the best I can do."
yes, "it depends". It depends on the volatility of the data that you are intending to cache, and it depends on the level of "accuracy" and reliability that you need for the responses that you generate from the data that you intend to cache.
If volatility on your "base" data is low, then any caching you do on those data has a higher probability of remaining valid and correct for a longer time.
If "caching-fault-tolerance" on the results you return to your users is zero percent, you have no option.
The type of data your bringing back affects the decision as well. You don't want to be caching volatile data or data for potential updates that may get stale.
精彩评论