Database behavior HAVING-SUM vs WHERE / DISTINCT vs GROUP BY
Suppose I have a very large summery table where we keep the sum of the activity points, a row for every user, for every day and the sum of the activity - for each type a different column - that the user did that day:
CREATE TABLE summry_data
(
UserID INT NOT NULL,
ActivityDate DATE,
t1 INT NOT NULL,
t2 INT NOT NULL,
t3 INT NOT NULL,
t4 INT NOT NULL,
PRIMARY KEY(UserID, ActivityDate)
)
Every morning we populate the previous day's data. we insert a row for every user:
INSERT summery_data
SELECT UserID, '2010-12-16'
, SUM(IF(TypeID = 1, Points, 0))
, SUM(IF(TypeID = 2, Points, 0))
, SUM(IF(TypeID = 3, Points, 0))
, SUM(IF(TypeID = 4, Points, 0))
FROM activities
WHERE ActivityDate >= '2010-12-16' AND ActivityDate < '2010-12-17'
GROUP BY UserID
开发者_运维问答
The table data looks something like this:
UserID ActivityDate t1 t2 t3 t4
1 2010-01-01 0 82 0 0
1 2010-01-02 100 1 12 0
2 2010-01-01 0 0 0 41
2 2010-01-02 0 0 0 1
3 2010-01-02 0 0 0 106
3 2010-01-03 2 5 0 4
The table is very large (10M+ rows), if i want to get a list of user ID's who had any activity points for either t1, t2 or t3 (but we do not want to count t4), on any day. my end result would include UserID 1 and 3.
which of the following queries are better:
SELECT DISTINCT UserID
FROM summery_data
WHERE t1 > 0 OR t2 > 0 OR t3 > 0
vs
SELECT UserID
FROM summery_data
GROUP BY UserID
HAVING SUM(t1) > 0 OR SUM(t2) > 0 OR SUM(t3) > 0
in order to understand which will be faster, i have some question about what goes on behind the scenes:
a DISTINCT query, how does the database insure that only 1 UserID will be added to the result set, does it check each UserID to see if it already exists in the set? or since the table is clusterd by UserID anyway, just keep a variable - while scanning the rows - of the last UserID added to the result set?
in a DISTINCT query, Once the database find a single row that matches the criteria for the current UserID, does it stop checking the predicate in the where clause until it hits the next UserID?
in a GROUP BY query, while summing the t1 column, once the database find a record that the column t1 > 0, which would match the HAVING, does it stop summing the other t1 rows for the current UserID (since the predicate is > 0 which is already true)? or at least does it not sum the other columns (t2 and t3) since there is no need for that? or does the database first first do the summing of t1, t2 and t3 before evaluating the HAVING clause?
Note: I am using MySql as the database server, however i would like to know if Sql Server or any other database systems would work differently.
Any help is greatly appreciated.
Your queries are not identical in case you allow negative numbers in any of (t1, t2, t3, t4). Consider the following data:
user_id T1 T2 T3 T4
------- --- --- --- ---
1 -2 0 0 0
1 2 0 0 0
2 1 0 0 0
2 2 0 0 0
Your first query (distinct) will include both user 1 and 2, as there are at least one row for each user with a T1 value > 0.
The second query (gby having) will exclude user 1 as the sum of T1 values is 0 (even though values within the group are > 0). This is also a good example of the difference between having and where. (WHERE operate on idividual rows; HAVING operates on the group as a whole).
The rest of the answer is not only highly vendor dependant, but also completely irrelevant from a SQL perspective, since it is the database that ultimately does the choices. Having said that, by knowing a little about it, you can influence the optimizer by writing your queries in a certain way.
Question 1
I know of three stretegies a database can use to produce a list of distinct values. Which one to use will be determined by the estimated cost of using that operation.
Sorting. Sort the resultset. Run through the sorted result, and keep track of the previous value. This is potentially very expensive (slow) if it cannot fit into memory.
Hashing. A hash function is applied to all rows in the resultset. The result is stored in an intermediate hashtable. This is often faster than sorting.
Index walk. This is basically the same technique as sorting, but as the index is already sorted, that step is skipped.
Question 2
The database if free to evaluate your predicates in any order it wants. You cannot easily decide this yourself. The optimizer can use heuristics or statistics to find the optimal evaluation order. It has to obey the same boolean principles as the rest of us. When any of (t1=1 or t2=2 or t3=3) is true, we can stop evaluating the others.
Question 3
No. This is explained by my example above regardin WHERE/HAVING.
A lot of your specific questions are implementation dependent.
SQL queries are declarative. They do not specify the means of obtaining the answer, they just indicate what you are looking for. The DMBS (database management system) determines how these are put into practice. Most SELECT queries contain some type of table-scanning iteration (unless this is overcome by an index on the field in question), but you don't see looping explicitly in the middle of the query.
What I can recommend to you definitively is that you do not use aggregate functions such as sums if you are not interested in the actual values of the sums. Use DISTINCT if what you want is to get those UserIds that have positive values in any of those three fields in any row. This at least gives the DMBS a chance to do the right thing and optimize that query.
It is possible that index could help this query, but not that substantially. Where indexing really helps is doing things like equality joins across different tables (this could involve m*n time when you equi-join a table with m rows to a table with n). Here all you want to do is filter so long as one of those 3 fields is positive. You will, worst-case, look at every row once. An index on UserId could help, in conjunction with DISTNCT, to exclude checking rows with a User that you've already decided to include.
精彩评论