Specific SQL Query Optimization Help Needed
So I'm working on a data mining project where we're looking at code elements and their relationships and changes to these things over time. What we want is to ask some questions about how often related elements are changed. I've set it up as a view, but it's taking like 10 min to run. I believe the problem is that I'm having to do a lot of subtraction, concatenation, and string comparisons to compare entries (for our window size), but I don't know a good way to fix this. The query looks like
select aw.same
, rw.k
, count(distint concat_ws(',', r1.id, r2.id)) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id=FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
join authorwindow aw
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (开发者_如何转开发dmd2.FKrevID - dmd1.FKrevID) < rw.k
and case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;
Ok, so revisionwindow stores the revision windows we're interested in (10, 20, 50, 100) and authorwindow stores which author types we want (same, different, and don't care). Part of the problem is, we could have the same revision pair with different elements matching, so the only hack i could come up with was that ugly count(distinct concat()) thing. This should return a table with 12 rows, one for each combination of the author and revision windows. The entries under 'num' are the unique pairs of revisions related in the manner specified (in this case, both change methods and one of the methods calls the other). It works perfectly, it's just crazy slow (~10 min running time). I'm basically looking for any advice or help to make this work better without sacrificing accuracy.
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
The most damaging about this statement is the less than operator
<
not the arithmetic. B-trees cannot use this and forces a full table scan every time, any time. Gory details why this true: http://explainextended.com/2010/05/19/things-sql-needs-determining-range-cardinality/I doubt your
CASE
statement can be optimized by the backend and<>
operator suffers from the same problem as above. I would think about ways to join with=
operators, perhaps breaking up the query and usingUNION
statements so you can always use indexes.Your not using
EXPLAIN
. You need to start using it to optimize queries. You have no idea what indexes are being used and what are not, or if your condition is selective enough where they would even be helpful (if its not very selective see the last point) http://dev.mysql.com/doc/refman/5.0/en/explain.htmlSince this a data mining application you have a great opportunity to use temp tables of intermediate values. Since the data is probably dumped at periodic intervals (or maybe even only once!) it is easy to rebuild the long running temp table every so often without running the risk of data corruption (Or it may just not matter since you looking for aggregate patterns.)
I have taken queries that were running over 60 minutes and reduced them to less than 100 ms (instant) by building temp tables that cached the hard stuff. If you are not able to use any of the ideas above, this is probably the lowest lying fruit. Take all the 'hard stuff' - case joins and non equality joins and do it one place. Then add an index to your temp table :-) The trick is to make it general enough that you can query the temp table so you still have flexibility to ask different questions.
I suspect the two joins (join revisionwindow rw
) and (join authorwindow aw
) that do not have an ON condition but use the WHERE, cause this.
How many records do these two tables have? MySQL probably does first a CROSS JOIN on these and only later checks the complex (WHERE) conditions.
But please post the results of EXPLAIN
.
--EDIT--
Oops, I missed your last paragraph which explains that the two tables have 4 and 3 rows.
Can you try this: (where the concat has been replaced and the where clauses have been moved as JOIN ON ...)
select aw.same
, rw.k
, count(distint r1.id, r2.id) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id = dmd1.FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
on (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
join authorwindow aw
on case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;
精彩评论