Which is the fastest performing way to avoid n+1 issues and why?
I'm looking to add some utility methods to help avoid a lot of n+1 issues in a legacy application.
The common pattern is this:
select a.* /* over 10 columns */
from [table-A] a
where /* something */
Retrieved into a collection of ClassA
record instances
Then the sub-instances are lazy retrieved:
select b.* /* over 10 columns */
from [sub-table-B] b
where b.ParentId = @ClassA_ID
This results in an n+1 selects issue. Mostly this isn't a major problem as only a couple of ClassA
instances are being retrieved on an infrequently hit page, but in an increasing number of places this n+1 issue causes the pages to become too slow as the application has scaled.
I'm looking to replace a part of the existing data access code of this application so that the ClassA
instances and ClassB
instances are retrieved together.
I think there are 3 ways this could be done:
1) Get the ClassA
instances as we do now, then get the ClassB
instances in one aggregated call:
select b.*
from [sub-table-B] b
where b.ParentId in ( /* list of parent IDs */ )
This is two separate DB calls, and the query plan of the dynamic SQL will not be cacheable (due to the list of IDs).
2) Get the ClassB
instances with a sub query:
select b.*
from [sub-table-B] b
inner join [table-A] a
on b.ParentId = a.[ID]
where /* something */
This is also two DB calls, and query against [table-A]
has to be evaluated twice.
3) Get all together and de-dupicate the ClassA
instances:
select a.*, b.*
from [table-A] a
left outer join [sub-table-B] b
on a.[ID] = b.ParentId
where /* something */
This is just one DB call, but now we get the contents of [table-A]
repeated - the result set will be larger and the time sending the data from the DB to the client will be more.
So really this is 3 possible compromises:
- 2 DB calls, no query caching
- 2 DB calls, complex query evaluated twice
- 1 DB call, significantl开发者_Python百科y larger result set
I can test these three patterns for any one parent-child pair of tables, but I have loads of them. What I want to know is which pattern is consistently quicker? More importantly why? Is one of these compromises an obvious performance-killer?
What do existing mechanisms like Linq, EF and NHibernate use?
Is there a 4th way that's better than all 3?
I think EF and L2S use your third approach - there is definitly just one db call.
Normally more db roundtrips take more time than less db roundtrips with bigger resultsets.
Maybe there are some edge cases where you have massive data in table A and the bigger resultset increases transfer time to the clien too much.
But thats mainly a question of latency and bandwith between db server and client.
A 4th way might be to write a stored proc which returns more than one resultset. One for each table you query with just the records you need. That fits to your 1st approach but reduced to one roundtrip. But that would complicate things a bit and is not as flexible as the other approaches.
In my opinion, "which is the fastest way" depends on the latency and bandwidth to your database server, and also how big are your resultsets.
In a scenario where latency is the bottleneck (ADSL network?), and if your resultsets are not huge, you better have send one single query to your server. Bandwidth used will be bigger, due to the fact [table-A] record are sent multiple times, but globally speaking this might be the fastest way to get your data to the client.
Most modern databases (Oracle for sure if you use parameterised queries) will cache the query evaluation and you will encounter very little hit on them.
Some ORMs like Django's will allow you to create a custom query and return only partial results that you need to render the page. This is a good approach - if you see a DB hotspot optimise it, but otherwise leave ORM to do its bidding.
Remember, hardware is cheap (two days of consultant's work cost the same as a server upgrade), regardless what your finance manager says.
精彩评论