Database behavior HAVING-SUM vs WHERE / DISTINCT vs GROUP BY

2023-01-30 20:25 问答作者：

Suppose I have a very large summery table where we keep the sum of the activity points, a row for every user, for every day and the sum of the activity - for each type a different column - that the user did that day:

CREATE TABLE summry_data
(
    UserID INT NOT NULL,
    ActivityDate DATE,
    t1 INT NOT NULL,
    t2 INT NOT NULL,
    t3 INT NOT NULL,
    t4 INT NOT NULL,
    PRIMARY KEY(UserID, ActivityDate)
)

Every morning we populate the previous day's data. we insert a row for every user:

INSERT summery_data
SELECT UserID, '2010-12-16'
    , SUM(IF(TypeID = 1, Points, 0))
    , SUM(IF(TypeID = 2, Points, 0))
    , SUM(IF(TypeID = 3, Points, 0))
    , SUM(IF(TypeID = 4, Points, 0))
FROM activities
WHERE ActivityDate >= '2010-12-16' AND ActivityDate < '2010-12-17'
GROUP BY UserID
开发者_运维问答

The table data looks something like this:

UserID  ActivityDate   t1   t2  t3  t4
1       2010-01-01      0   82  0   0
1       2010-01-02      100 1   12  0
2       2010-01-01      0   0   0   41
2       2010-01-02      0   0   0   1
3       2010-01-02      0   0   0   106
3       2010-01-03      2   5   0   4

The table is very large (10M+ rows), if i want to get a list of user ID's who had any activity points for either t1, t2 or t3 (but we do not want to count t4), on any day. my end result would include UserID 1 and 3.

which of the following queries are better:

SELECT DISTINCT UserID
FROM summery_data
WHERE t1 > 0 OR t2 > 0 OR t3 > 0

SELECT UserID
FROM summery_data
GROUP BY UserID
HAVING SUM(t1) > 0 OR SUM(t2) > 0 OR SUM(t3) > 0

in order to understand which will be faster, i have some question about what goes on behind the scenes:

a DISTINCT query, how does the database insure that only 1 UserID will be added to the result set, does it check each UserID to see if it already exists in the set? or since the table is clusterd by UserID anyway, just keep a variable - while scanning the rows - of the last UserID added to the result set?
in a DISTINCT query, Once the database find a single row that matches the criteria for the current UserID, does it stop checking the predicate in the where clause until it hits the next UserID?
in a GROUP BY query, while summing the t1 column, once the database find a record that the column t1 > 0, which would match the HAVING, does it stop summing the other t1 rows for the current UserID (since the predicate is > 0 which is already true)? or at least does it not sum the other columns (t2 and t3) since there is no need for that? or does the database first first do the summing of t1, t2 and t3 before evaluating the HAVING clause?

Note: I am using MySql as the database server, however i would like to know if Sql Server or any other database systems would work differently.

Any help is greatly appreciated.

Your queries are not identical in case you allow negative numbers in any of (t1, t2, t3, t4). Consider the following data:

user_id   T1   T2   T3   T4
-------  ---  ---  ---  ---
   1      -2   0    0    0
   1       2   0    0    0
   2       1   0    0    0
   2       2   0    0    0

Your first query (distinct) will include both user 1 and 2, as there are at least one row for each user with a T1 value > 0.

The second query (gby having) will exclude user 1 as the sum of T1 values is 0 (even though values within the group are > 0). This is also a good example of the difference between having and where. (WHERE operate on idividual rows; HAVING operates on the group as a whole).

The rest of the answer is not only highly vendor dependant, but also completely irrelevant from a SQL perspective, since it is the database that ultimately does the choices. Having said that, by knowing a little about it, you can influence the optimizer by writing your queries in a certain way.

Question 1

I know of three stretegies a database can use to produce a list of distinct values. Which one to use will be determined by the estimated cost of using that operation.

Sorting. Sort the resultset. Run through the sorted result, and keep track of the previous value. This is potentially very expensive (slow) if it cannot fit into memory.

Hashing. A hash function is applied to all rows in the resultset. The result is stored in an intermediate hashtable. This is often faster than sorting.

Index walk. This is basically the same technique as sorting, but as the index is already sorted, that step is skipped.

Question 2

The database if free to evaluate your predicates in any order it wants. You cannot easily decide this yourself. The optimizer can use heuristics or statistics to find the optimal evaluation order. It has to obey the same boolean principles as the rest of us. When any of (t1=1 or t2=2 or t3=3) is true, we can stop evaluating the others.

Question 3

No. This is explained by my example above regardin WHERE/HAVING.

A lot of your specific questions are implementation dependent.

SQL queries are declarative. They do not specify the means of obtaining the answer, they just indicate what you are looking for. The DMBS (database management system) determines how these are put into practice. Most SELECT queries contain some type of table-scanning iteration (unless this is overcome by an index on the field in question), but you don't see looping explicitly in the middle of the query.

What I can recommend to you definitively is that you do not use aggregate functions such as sums if you are not interested in the actual values of the sums. Use DISTINCT if what you want is to get those UserIds that have positive values in any of those three fields in any row. This at least gives the DMBS a chance to do the right thing and optimize that query.

It is possible that index could help this query, but not that substantially. Where indexing really helps is doing things like equality joins across different tables (this could involve m*n time when you equi-join a table with m rows to a table with n). Here all you want to do is filter so long as one of those 3 fields is positive. You will, worst-case, look at every row once. An index on UserId could help, in conjunction with DISTNCT, to exclude checking rows with a User that you've already decided to include.

继续阅读：distinct group-by performance sql

Database behavior HAVING-SUM vs WHERE / DISTINCT vs GROUP BY

Question 1

Question 2

Question 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Question 1

Question 2

Question 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？