开发者

Specific SQL Query Optimization Help Needed

So I'm working on a data mining project where we're looking at code elements and their relationships and changes to these things over time. What we want is to ask some questions about how often related elements are changed. I've set it up as a view, but it's taking like 10 min to run. I believe the problem is that I'm having to do a lot of subtraction, concatenation, and string comparisons to compare entries (for our window size), but I don't know a good way to fix this. The query looks like

select aw.same
     , rw.k
     , count(distint concat_ws(',', r1.id, r2.id)) as num  
  from deltamethoddeclaration dmd1
    join revision r1
      on r1.id=FKrevID 
    join methodinvocation mi
      on mi.FKcallerID = dmd1.FKMDID 
    join deltamethoddeclaration dmd2 
      on mi.FKcalleeID = dmd2.FKMDID
    join revision r2 
      on r2.id = dmd2.FKrevID
    join revisionwindow rw
    join authorwindow aw
  where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
    and (开发者_如何转开发dmd2.FKrevID - dmd1.FKrevID) < rw.k
    and case aw.same
          when 1 then
            r1.author = r2.author
          when 0 then
            r1.author <> r2.author
          else
            1=1
         end
  group by aw.same
         , rw.k
;

Ok, so revisionwindow stores the revision windows we're interested in (10, 20, 50, 100) and authorwindow stores which author types we want (same, different, and don't care). Part of the problem is, we could have the same revision pair with different elements matching, so the only hack i could come up with was that ugly count(distinct concat()) thing. This should return a table with 12 rows, one for each combination of the author and revision windows. The entries under 'num' are the unique pairs of revisions related in the manner specified (in this case, both change methods and one of the methods calls the other). It works perfectly, it's just crazy slow (~10 min running time). I'm basically looking for any advice or help to make this work better without sacrificing accuracy.


  • where (dmd1.FKrevID - dmd2.FKrevID) < rw.k

    The most damaging about this statement is the less than operator < not the arithmetic. B-trees cannot use this and forces a full table scan every time, any time. Gory details why this true: http://explainextended.com/2010/05/19/things-sql-needs-determining-range-cardinality/

  • I doubt your CASE statement can be optimized by the backend and <> operator suffers from the same problem as above. I would think about ways to join with = operators, perhaps breaking up the query and using UNION statements so you can always use indexes.

  • Your not using EXPLAIN. You need to start using it to optimize queries. You have no idea what indexes are being used and what are not, or if your condition is selective enough where they would even be helpful (if its not very selective see the last point) http://dev.mysql.com/doc/refman/5.0/en/explain.html

  • Since this a data mining application you have a great opportunity to use temp tables of intermediate values. Since the data is probably dumped at periodic intervals (or maybe even only once!) it is easy to rebuild the long running temp table every so often without running the risk of data corruption (Or it may just not matter since you looking for aggregate patterns.)

    I have taken queries that were running over 60 minutes and reduced them to less than 100 ms (instant) by building temp tables that cached the hard stuff. If you are not able to use any of the ideas above, this is probably the lowest lying fruit. Take all the 'hard stuff' - case joins and non equality joins and do it one place. Then add an index to your temp table :-) The trick is to make it general enough that you can query the temp table so you still have flexibility to ask different questions.


I suspect the two joins (join revisionwindow rw) and (join authorwindow aw) that do not have an ON condition but use the WHERE, cause this.

How many records do these two tables have? MySQL probably does first a CROSS JOIN on these and only later checks the complex (WHERE) conditions.

But please post the results of EXPLAIN.

--EDIT--
Oops, I missed your last paragraph which explains that the two tables have 4 and 3 rows.

Can you try this: (where the concat has been replaced and the where clauses have been moved as JOIN ON ...)

select aw.same
     , rw.k
     , count(distint r1.id, r2.id) as num  
  from deltamethoddeclaration dmd1
    join revision r1
      on r1.id = dmd1.FKrevID 
    join methodinvocation mi
      on mi.FKcallerID = dmd1.FKMDID 
    join deltamethoddeclaration dmd2 
      on mi.FKcalleeID = dmd2.FKMDID
    join revision r2 
      on r2.id = dmd2.FKrevID
    join revisionwindow rw
      on (dmd1.FKrevID - dmd2.FKrevID) < rw.k
         and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
    join authorwindow aw
      on case aw.same
           when 1 then
             r1.author = r2.author
           when 0 then
             r1.author <> r2.author
           else
             1=1
          end
  group by aw.same
         , rw.k
;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜