开发者

Remove redundant SQL code

Code

The following code calculates the slope and intercept for a linear regression against a slathering of data. It then applies the equation y = mx + b against the same result set to calculate the value of the regression line for each row.

How can the two queries be joined so that the data and its slope/intercept are calculated without executing the WHERE clause twice?

The general form of the problem is:

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines

I have SOME_CONDITION executing twice. This is shown below (updated with a STRAIGHT_JOIN optimization):

SELECT STRAIGHT_JOIN
  AVG(D.AMOUNT) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  CITY C,
  STATION S,
  STATION_DISTRICT SD,
  YEAR_REF Y,
  MONTH_REF M,
  DAILY D,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,

    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      --
      $X{ IN, C.ID, CityCode } AND

      -- Find all the stations within a specific unit radius ...
      --
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

      SD.ID = S.STATION_DISTRICT_ID AND

      -- Gather all known years for that station ...
      --
      Y.STATION_DISTRICT_ID = SD.ID AND

      -- The data before 1900 is shaky; insufficient after 2009.
      --
      Y.YEAR BETWEEN 1900 AND 2009 AND

      -- Filtered by all known months ...
      --
      M.YEAR_REF_ID = Y.ID AND

      -- Whittled down by category ...
      --
      M.CATEGORY_ID = $P{CategoryCode} AND

      -- Into the valid daily climate data.
      --
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb
WHERE
  -- For a specific city ...
  --
  $X{ IN, C.ID, CityCode } AND

  -- Find all the stations within a specific unit radius ...
  --
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

  SD.ID = S.STATION_DISTRICT_ID AND

  -- Gather all known years for that station ...
  --
  Y.STATION_DISTRICT_ID = SD.ID AND

  -- The data before 1900 is shaky; insufficient after 2009.
  --
  Y.YEAR BETWEEN 1900 AND 2009 AND

  -- Filtered by all known months ...
  --
  M.YEAR_REF_ID = Y.ID AND

  -- Whittled down by category ...
  --
  M.CATEGORY_ID = $P{CategoryCode} AND

  -- Into the valid daily climate data.
  --
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

Question

How do I execute the duplicate bits only once per query, instead of twice? The duplicate code:

  $X{ IN, C.ID, CityCode } AND
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
  SD.ID = S.STATION_DISTRICT_ID AND
  Y.STATION_DISTRICT_ID = SD.ID AND
  Y.YEAR BETWEEN 1900 AND 2009 AND
  M.YEAR_REF_ID = Y.ID AND
  M.CATEGORY_ID = $P{CategoryCode} AND
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

Update 1

Using variables and splitting the query seems to allow the cache to kick in as this now runs in 3.5 seconds, whereas it used to run in 7. Still, if there is any way to remove the duplicate code, I'd be grateful for any help.

Update 2

The above code does not run in JasperReports, and a VIEW, while a possible fix, would probably be extremely inefficient (because the WHERE clauses are parameterized).

Update 3

Validating distance using Unreason's suggestion of the Pythagorean formula with converging meridians:

  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
    POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )

(This is unrelated to the question, but should someone else want to know ...)

Update 4

The code, as shown, works in JasperReports, running against a MySQL database. JasperReports does not allow variables or multiple queries.

Update 5

Am looking for a solution that executes cleanly. ;-) I have written a number of partially working solutions, but MySQL, sadly, does not understand partially correct. See the discussions with Unreason for answers that almost work.

Update 6

I might be able to reuse variables from the first WHERE clause and compare them to the second (thereby eliminating some duplication -- the checks against $P{} values), but I'd really like the duplication eliminated.

Update 7

Comparing the YEAR clause, as hypothesized in the previous update, to eliminate the duplicate BETWEEN, does not work.

Related

How to eliminate dupli开发者_JAVA百科cate calculation in SQL?

Thank you!


You should be able to get everything you need in one go:

 SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR,
    MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
    Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,             
    ((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /                  
    (stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,                     
    ((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,   
    ((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
    (sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) / 
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
 FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
 WHERE
    $X{ IN, C.ID, CityCode } AND
    SQRT(
        POW( C.LATITUDE - S.LATITUDE, 2 ) +
        POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
    Y.YEAR BETWEEN 1900 AND 2009 AND
    M.YEAR_REF_ID = Y.ID AND
    M.CATEGORY_ID = $P{CategoryCode} AND
    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
 GROUP BY
    Y.YEAR

The things will not work straight from the query above (it has nonsensically combined aggregates and other errors); this can be a good time to check your formulas

If you decide to do sub queries do simplify the formulas, then:

  • you can grab (you do grab) all the necessary data in the inner most query and you don't have to repeat all the tables in the outer queries any more (just select the relevant columns from the t, they are already at your disposal)
  • you don't have to repeat the where condition


The problem is a bit more difficult than in your generalization. I would state it as the following:

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (YOUR_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines

You have a subset of data (limited by your condition), which is grouped and an aggregation is made for each group. Then, you merge down aggregations to a single value and you want to apply a function of the value to each group again. Obviously, you can not reuse the condition until the result of the grouped subquery can be referenced as an entity.

In MSSQL and Oracle, you would use WITH operator. In MySQL the only option is to use a temporary table. I assume that there is more than one year in your report (otherwise, the query would be much simplier).

UPD: I am sorry, I can not post the ready code now (can do it tomorrow), but I have an idea:

You can concatenate the data you need to output in the subquery with GROUP_CONCAT AND split it back in the outer query with FIND_IN_SET, and SUBSTRING_INDEX functions. the outer query will JOIN only the YEAR_REF and the result of the aggregation.

The condition in the outer query then will be just WHERE FIND_IN_SET(year, concatenated_years).

UPD:

Here is the version that uses GROUP_CONCAT to pass the required data to the outer JOIN.

My comments start with --newtover:. By the way, 1) I do not think STRAIGHT_JOIN adds any benefit, and 2) COUNT(*) has a special meaning in MySQL and should be used when you want to count rows.

SELECT STRAIGHT_JOIN
  -- newtover: extract the corresponding amount back
  SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  -- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
  YEAR_REF Y,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,
    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
    -- newtover: grouped fields for matching years and the corresponding amounts
    GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
    GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      $X{ IN, C.ID, CityCode } AND
      -- Find all the stations within a specific unit radius ...
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
      SD.ID = S.STATION_DISTRICT_ID AND
      -- Gather all known years for that station ...
      Y.STATION_DISTRICT_ID = SD.ID AND
      -- The data before 1900 is shaky; insufficient after 2009.
      Y.YEAR BETWEEN 1900 AND 2009 AND
      -- Filtered by all known months ...
      M.YEAR_REF_ID = Y.ID AND
      -- Whittled down by category ...
      M.CATEGORY_ID = $P{CategoryCode} AND
      -- Into the valid daily climate data.
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
    -- newtover: check if the year is in the list and store the index into the variable
    @pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
  Y.YEAR


As the SQL in the question was substantially hanged (now showing only relevant parts) here is my new answer

Assumption: The condition is really the same and no tricky column aliasing occurs between subquery and outer query

Answer: You can remove the where in the outer query.

SELECT
  /* aggregate data */
  ymxb.*
FROM (
  SELECT
    /* similar aggregate data */
  WHERE
    /* some condition */
  GROUP BY
    YEAR
) ymxb
GROUP BY
  YEAR

This should give you the same result.

(Also note that you could have removed the inner where and kept the outer one - results should be the same, however performance might not).

Finally, repeating the where clause probably does not have big impact on the performance - evaluating extra conditions (even expressions such as sqrt, etc) is very cheap compared to any I/O (and these conditions do not operate on any new columns, so all I/O had already been done)

Furthermore, your inner query and outer query use the same GROUP BY and the outer query gets it all data from subquery.

This makes any aggregate functions in the outer query redundant (the rows from the subquery, which are the source for the outer query, have already been grouped by year).

That makes the whole subselect redundant.


Are you able to use a temp table in your situation? Although it still requires you to use the WHERE clause twice, it should considerably boost your performance.

DROP TEMPORARY TABLE IF EXISTS TEMP_DATA

CREATE TEMPORARY TABLE TEMP_DATA 
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group)
    )

SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group

Hope this helps! --Dubs

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜