Get count matches in query on large table very slow
I have a mysql table "items" with 2 integer fields: seid and tiid
The table has about 35000000 records, so it's very large.seid tiid
-----------
1 1
2 2
2 3
2 4
3 4
4 1
4 2
The table has a primary key on both fields, an index on seid and an index on tiid.
Someone types in 1 or more tiid values and now I would like to get the seid with most results.
For example when someone types 1,2,3, I would like to get seid 2 and 4 as result. They both have 开发者_C百科2 matches on the tiid values.
My query so far:
SELECT COUNT(*) as c, seid
FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid
HAVING c = (SELECT COUNT(*) as c, seid
FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid
ORDER BY c DESC
LIMIT 1)
But this query is extremly slow, because of the large table.
Does anyone know how to construct a better query for this purpose?
So I found 2 solutions, 1st one:
SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list
FROM (
SELECT COUNT(*) as c, seid FROM items
WHERE tiid IN (1,2,3)
GROUP BY seid ORDER BY c DESC
) T1
GROUP BY c
ORDER BY c DESC
LIMIT 1;
+---+-----------+
| c | seid_list |
+---+-----------+
| 2 | 2,4 |
+---+-----------+
Edit:
EXPLAIN SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list FROM ( SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3) GROUP BY seid ORDER BY c DESC ) T1 GROUP BY c ORDER BY c DESC LIMIT 1;
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using filesort |
| 2 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
Re-Edit:
This 1st solution has one problem, with billions of rows the result field may be too big. So Here's another solution which avoid as well the double-rainbow effect by applying the clasical max memorisation/check with a MySQl variable:
SELECT c,seid
FROM (
SELECT c,seid,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
FROM (
SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3)
GROUP BY seid
ORDER BY c DESC
) res1
,(SELECT @mmax:=0) initmax
ORDER BY c DESC
) res2 WHERE mymax>0;
+---+------+
| c | seid |
+---+------+
| 2 | 4 |
| 2 | 2 |
+---+------+
explain:
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | Using where |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | items | range | PRIMARY,tiid_idx | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
That requires you to traverse the large table twice. Maybe caching the result will help halve the time taken, but there doesn't look like much more optization is possible.
DROP temporary table if exists TMP_COUNTED;
create temporary table TMP_COUNTED
select seid, COUNT(*) as C
from items
where tiid in (1,2,3)
group by seid;
CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C);
SELECT *
FROM TMP_COUNTED
WHERE C = (SELECT MAX(C) FROM seid)
Pre-calculate the count of all the unique tiid values and store them.
Refresh this count hourly, daily or weekly. Or try and keep the count correct by updating them. This then will eliminate the need to do a count. Counts are always slow.
I have a table called product_category which has a composite primary key consisting of 2 unsigned integer fields and with no additional secondary indexes:
create table product_category
(
prod_id int unsigned not null,
cat_id mediumint unsigned not null,
primary key (cat_id, prod_id) -- note the clustered composite index !!
)
engine = innodb;
The table currently has 125 million rows
select count(*) as c from product_category;
c
=
125,524,947
with the following index/cardinalities:
show indexes from product_category;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality
===== ========== ======== ============ =========== ========= ===========
product_category 0 PRIMARY 1 cat_id A 1162276
product_category 0 PRIMARY 2 prod_id A 125525826
If I run a query similar to yours (1st run nothing cached and with cold/empty buffers):
select
prod_id, count(*) as c
from
product_category
where
cat_id between 1600 and 2000 -- using between to include a wider range of data
group by
prod_id
having c = (
select count(*) as c from product_category
where cat_id between 1600 and 2000
group by prod_id order by c desc limit 1
)
order by prod_id;
I get the following results:
(cold run)
+---------+---+
| prod_id | c |
+---------+---+
| 34957 | 4 |
| 717812 | 4 |
| 816612 | 4 |
| 931111 | 4 |
+---------+---+
4 rows in set (0.18 sec)
(2nd run)
+---------+---+
| prod_id | c |
+---------+---+
| 34957 | 4 |
| 717812 | 4 |
| 816612 | 4 |
| 931111 | 4 |
+---------+---+
4 rows in set (0.14 sec)
The explain plan is as follows:
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
| 2 | SUBQUERY | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
If I run regilero's query:
SELECT c,prod_id
FROM (
SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
FROM (
SELECT COUNT(*) as c, prod_id FROM product_category WHERE
cat_id between 1600 and 2000
GROUP BY prod_id
ORDER BY c DESC
) res1
,(SELECT @mmax:=0) initmax
ORDER BY c DESC
) res2 WHERE mymax>0;
I get the following results:
(cold)
+---+---------+
| c | prod_id |
+---+---------+
| 4 | 931111 |
| 4 | 34957 |
| 4 | 717812 |
| 4 | 816612 |
+---+---------+
4 rows in set (0.17 sec)
(2nd run)
+---+---------+
| c | prod_id |
+---+---------+
| 4 | 34957 |
| 4 | 717812 |
| 4 | 816612 |
| 4 | 931111 |
+---+---------+
4 rows in set (0.13 sec)
The explain plan is as follows:
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 92760 | Using where |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 92760 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
Finally trying cyberwiki's approach:
drop procedure if exists cyberkiwi_variant;
delimiter #
create procedure cyberkiwi_variant()
begin
create temporary table tmp engine=memory
select prod_id, count(*) as c from
product_category where cat_id between 1600 and 2000
group by prod_id order by c desc;
select max(c) into @max from tmp;
select * from tmp where c = @max;
drop temporary table if exists tmp;
end#
delimiter ;
call cyberkiwi_variant();
I get the following results:
(cold and 2nd run)
+---------+---+
| prod_id | c |
+---------+---+
| 816612 | 4 |
| 931111 | 4 |
| 34957 | 4 |
| 717812 | 4 |
+---------+---+
4 rows in set (0.14 sec)
The explain plan is as follows:
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| 1 | SIMPLE | product_category | range | PRIMARY | PRIMARY | 3 | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
So it seems that all of the methods tested have approx. the same runtimes of between 0.14 and 0.18 seconds which seems pretty performant to me considering the size of the table and the number of rows queried.
Hope this helps - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
If I'm understanding your requirements, you might try something like this
select seid, tiid, count(*) from items where tiid in (1,2,3)
group by seid, tiid
order by seid
精彩评论