Remove redundant SQL code
Code
The following code calculates the slope and intercept for a linear regression against a slathering of data. It then applies the equation y = mx + b
against the same result set to calculate the value of the regression line for each row.
How can the two queries be joined so that the data and its slope/intercept are calculated without executing the WHERE
clause twice?
The general form of the problem is:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines
I have SOME_CONDITION
executing twice. This is shown below (updated with a STRAIGHT_JOIN
optimization):
SELECT STRAIGHT_JOIN
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Question
How do I execute the duplicate bits only once per query, instead of twice? The duplicate code:
$X{ IN, C.ID, CityCode } AND
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
Y.STATION_DISTRICT_ID = SD.ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update 1
Using variables and splitting the query seems to allow the cache to kick in as this now runs in 3.5 seconds, whereas it used to run in 7. Still, if there is any way to remove the duplicate code, I'd be grateful for any help.
Update 2
The above code does not run in JasperReports, and a VIEW, while a possible fix, would probably be extremely inefficient (because the WHERE clauses are parameterized).
Update 3
Validating distance using Unreason's suggestion of the Pythagorean formula with converging meridians:
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
(This is unrelated to the question, but should someone else want to know ...)
Update 4
The code, as shown, works in JasperReports, running against a MySQL database. JasperReports does not allow variables or multiple queries.
Update 5
Am looking for a solution that executes cleanly. ;-) I have written a number of partially working solutions, but MySQL, sadly, does not understand partially correct. See the discussions with Unreason for answers that almost work.
Update 6
I might be able to reuse variables from the first WHERE
clause and compare them to the second (thereby eliminating some duplication -- the checks against $P{}
values), but I'd really like the duplication eliminated.
Update 7
Comparing the YEAR
clause, as hypothesized in the previous update, to eliminate the duplicate BETWEEN
, does not work.
Related
How to eliminate dupli开发者_JAVA百科cate calculation in SQL?
Thank you!
You should be able to get everything you need in one go:
SELECT
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /
(stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,
((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
(power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,
((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
(sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) /
(power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
FROM
CITY C,
STATION S,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
$X{ IN, C.ID, CityCode } AND
SQRT(
POW( C.LATITUDE - S.LATITUDE, 2 ) +
POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
The things will not work straight from the query above (it has nonsensically combined aggregates and other errors); this can be a good time to check your formulas
If you decide to do sub queries do simplify the formulas, then:
- you can grab (you do grab) all the necessary data in the inner most query and you don't have to repeat all the tables in the outer queries any more (just select the relevant columns from the t, they are already at your disposal)
- you don't have to repeat the where condition
The problem is a bit more difficult than in your generalization. I would state it as the following:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (YOUR_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines
You have a subset of data (limited by your condition), which is grouped and an aggregation is made for each group. Then, you merge down aggregations to a single value and you want to apply a function of the value to each group again. Obviously, you can not reuse the condition until the result of the grouped subquery can be referenced as an entity.
In MSSQL and Oracle, you would use WITH
operator. In MySQL the only option is to use a temporary table. I assume that there is more than one year in your report (otherwise, the query would be much simplier).
UPD: I am sorry, I can not post the ready code now (can do it tomorrow), but I have an idea:
You can concatenate the data you need to output in the subquery with GROUP_CONCAT
AND split it back in the outer query with FIND_IN_SET
, and SUBSTRING_INDEX
functions. the outer query will JOIN only the YEAR_REF and the result of the aggregation.
The condition in the outer query then will be just WHERE FIND_IN_SET(year, concatenated_years)
.
UPD:
Here is the version that uses GROUP_CONCAT to pass the required data to the outer JOIN.
My comments start with --newtover:
. By the way, 1) I do not think STRAIGHT_JOIN adds any benefit, and 2) COUNT(*)
has a special meaning in MySQL and should be used when you want to count rows.
SELECT STRAIGHT_JOIN
-- newtover: extract the corresponding amount back
SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
-- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
YEAR_REF Y,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
-- newtover: grouped fields for matching years and the corresponding amounts
GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
-- newtover: check if the year is in the list and store the index into the variable
@pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
Y.YEAR
As the SQL in the question was substantially hanged (now showing only relevant parts) here is my new answer
Assumption: The condition is really the same and no tricky column aliasing occurs between subquery and outer query
Answer: You can remove the where in the outer query.
SELECT
/* aggregate data */
ymxb.*
FROM (
SELECT
/* similar aggregate data */
WHERE
/* some condition */
GROUP BY
YEAR
) ymxb
GROUP BY
YEAR
This should give you the same result.
(Also note that you could have removed the inner where and kept the outer one - results should be the same, however performance might not).
Finally, repeating the where clause probably does not have big impact on the performance - evaluating extra conditions (even expressions such as sqrt, etc) is very cheap compared to any I/O (and these conditions do not operate on any new columns, so all I/O had already been done)
Furthermore, your inner query and outer query use the same GROUP BY and the outer query gets it all data from subquery.
This makes any aggregate functions in the outer query redundant (the rows from the subquery, which are the source for the outer query, have already been grouped by year).
That makes the whole subselect redundant.
Are you able to use a temp table in your situation? Although it still requires you to use the WHERE clause twice, it should considerably boost your performance.
DROP TEMPORARY TABLE IF EXISTS TEMP_DATA
CREATE TEMPORARY TABLE TEMP_DATA
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group)
)
SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group
Hope this helps! --Dubs
精彩评论