开发者

Is it safe to include extra columns in the SELECT list of a SQLite GROUP BY query?

I have a simple SQLite table called "message":

sequence INTEGER PRIMARY KEY
type TEXT
content TEXT

I want to get the content of the last message of each type (as determined by its sequence). To my surprise, the following simple query works:

SELECT MAX(sequence), type, content
FROM message
GROUP BY type

Surprise, because I know that MSSQL or Postgres would refuse to include a column in the SELECT list that is not part of the GROUP BY clause or an aggregate function and I'd have to do a join, like this:

SELECT m.sequence, m.type, m.content
FROM
(
    SELECT MAX(sequence) as sequence, type
    FROM message
    GROUP BY type
) g
JO开发者_开发百科IN message m
ON g.sequence = m.message_sequence

My question is: is it safe to use the first, much simpler, form of the query in SQLite? It intuitively makes sense that it selects the "content" value that matches the "MAX(sequence)" value, but the documentation doesn't seem to talk about this at all. Of course, if sequence was not unique then the result would be undefined. But if sequence is unique, as in my case, is this guaranteed or is it simply a lucky implementation detail that's subject to change?


You can use these queries "safely," that is, without getting ambiguous results, if the extra columns are functionally dependent on the column(s) you group by:

SELECT c.parent_id, COUNT(*), p.any_column
FROM child_table c 
JOIN parent_table p USING (parent_id)
GROUP BY c.parent_id;

The example above would work in SQLite, and produce an unambiguous result, because there's no way p.any_column could have multiple values per group. However, this query is strictly in violation of the SQL standard, and most brands of RDBMS would raise an error.

It's too easy to write a query that produces ambiguous results, though. When you name a column that has multiple values per group, you can't control which value is returned in your result set.

In practice, MySQL returns the value from the first row with respect to physical storage, and SQLite returns the value from the last row. But it's totally implementation-dependent and not reliable. If the next version of either software changes its internals, you could get different query results after you upgrade. So it's best not to rely on this behavior.


Regarding your example, where content should "intuitively" have the value from the row where sequence is MAX. But is this really intuitive? Consider these other cases:

SELECT MAX(sequence), MIN(sequence), type, content
FROM message
GROUP BY type

So which row now supplies the value for content? The row where sequence is MAX, or the row where sequence is MIN?

What if you use a non-unique column (e.g. date), and there are multiple rows with the same MAX value for date, but different values for content?

SELECT MAX(date), type, content
FROM message
GROUP BY type

What about other aggregate functions like AVG() or SUM()? It could be that the value of the aggregate corresponds to no individual row in the table. So now which row should supply the value for content?

SELECT AVG(sequence), type, content
FROM message
GROUP BY type


I don't know of any database which will "intuitively" solve this sort of problem, where you want to get related row values for a group based upon the result of an aggregate for a specific column. For SQLite, I think you had better stick with your second query.

Since you mentioned PostgreSQL, it's worth noting that it supports some non-standard syntax which accomplishes this, in the form of DISTINCT ON:

select distinct on (type) sequence, type, content
from message
order by sequence desc

(There could be some errors in that, as I don't have a psql prompt in front of me, but that's the gist of it.)

See http://www.postgresql.org/docs/8.4/interactive/queries-select-lists.html


I bet it just uses a random value for sequence field. MySQL docs for instance explicitly say so.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜