MySQL claims that I can use columns in SELECT that aren't in GROUP BY, but I can't with equal performance
The MySQL documents state in section 11.5.3 that, despite what the SQL standard may say, it's just fine to use columns in the SELECT clause that aren't in the GROUP BY clause, so long as they are functionally dependent on the grouped key.
MySQL extends the use of GROUP BY so that you can use nonaggregated columns or calculations in the select list that do not appe开发者_开发百科ar in the GROUP BY clause. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. For example, you need not group on customer.name in the following query:
SELECT order.custid, customer.name, MAX(payments) FROM order,customer WHERE order.custid = customer.custid GROUP BY order.custid;
In standard SQL, you would have to add customer.name to the GROUP BY clause. In MySQL, the name is redundant.
Sounds reasonable. However, though I can select those columns, it seems to have an adverse effect on performance.
EXPLAIN SELECT o.id FROM objects o GROUP BY o.id;
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | o | range | NULL | PRIMARY | 3 | NULL | 5262 | Using index for group-by |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
(I realize that this query is pretty silly; it's just the simplest version of a more complex query that has the same issue.) When selecting just the primary key ID I group by, then MySQL uses the primary key index. However, when I include other columns, MySQL does not.
EXPLAIN SELECT o.id, o.name FROM objects o GROUP BY o.id;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | o | ALL | NULL | NULL | NULL | NULL | 5261 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
That use of filesort instead of the index really sets me back. I'm currently looking to select *
from this table, so would like to avoid having to repeat all columns in the group and index them. Is there any way to get MySQL to use the primary key index, as I expect it to?
use a derived table for the group by and join back on whatever table you want to select from
Since it doesn't look like there's a simple answer, I'm going with a cheap solution for the moment.
What I would do would be something like the following:
SELECT o1.* FROM objects o1 WHERE o1.id IN (SELECT o2.id FROM objects o2 WHERE mycondition GROUP BY o2.id)
However, according to how it gets EXPLAIN
ed, the MySQL optimizer views the subquery as being dependent, which is always a really, really nasty performance killer. I think that's a bug in the query optimizer brought about by the fact that it's the same table, even though it's aliased. As such, I'll be using one query to fetch the IDs, and putting them IN
the second query that fetches o.*
. It gets reasonable performance, and isn't too painful.
This question is still open to answers with cleaner solutions that perform as well, if not better :)
In the first query, the only field you're accessing is in an index, so mysql only has to look at the index file. But in the second query, you're now pulling a column from the table itself which requires also reading the table data. The first query isn't really using your primary key index the way it might if you had a WHERE
clause. It's only using it for the group by, but it's still looking at every entry in the index.
The difference between the first query and the second is just that the second has to look at every row in the full table (aka a table scan), rather than just every primary key value in the index.
As far as optimizing goes, if your real query has no cumulative functions (SUM, COUNT, etc) as in your example then should see a major improvement just doing:
SELECT DISTINCT o.id, o.name FROM objects o
However, if that's only true of your simple example and your query really does require a GROUP BY then your next best bet is to increase your tmp_table_size and max_heap_table_size variables to allow for more rows to fit in memory at once.
精彩评论