开发者

Get count matches in query on large table very slow

I have a mysql table "items" with 2 integer fields: seid and tiid

The table has about 35000000 records, so it's very large.

seid  tiid  
-----------
1     1  
2     2  
2     3  
2     4  
3     4  
4     1  
4     2

The table has a primary key on both fields, an index on seid and an index on tiid.

Someone types in 1 or more tiid values and now I would like to get the seid with most results.

For example when someone types 1,2,3, I would like to get seid 2 and 4 as result. They both have 开发者_C百科2 matches on the tiid values.

My query so far:

SELECT COUNT(*) as c, seid
  FROM items
 WHERE tiid IN (1,2,3) 
GROUP BY seid
HAVING c = (SELECT COUNT(*) as c, seid
              FROM items
             WHERE tiid IN (1,2,3) 
          GROUP BY seid
          ORDER BY c DESC 
             LIMIT 1)

But this query is extremly slow, because of the large table.

Does anyone know how to construct a better query for this purpose?


So I found 2 solutions, 1st one:

SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list 
FROM (
    SELECT COUNT(*) as c, seid FROM items 
    WHERE tiid IN (1,2,3) 
    GROUP BY seid ORDER BY c DESC
) T1 
GROUP BY c 
ORDER BY c DESC
LIMIT 1;
+---+-----------+
| c | seid_list |
+---+-----------+
| 2 | 2,4       | 
+---+-----------+

Edit:

EXPLAIN SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list  FROM (     SELECT COUNT(*) as c, seid FROM items      WHERE tiid IN (1,2,3)      GROUP BY seid ORDER BY c DESC ) T1  GROUP BY c  ORDER BY c DESC LIMIT 1;
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table      | type  | possible_keys    | key     | key_len | ref  | rows | Extra                                                     |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL             | NULL    | NULL    | NULL |    3 | Using filesort                                            | 
|  2 | DERIVED     | items      | range | PRIMARY,tiid_idx | PRIMARY | 4       | NULL |    4 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+

Re-Edit:

This 1st solution has one problem, with billions of rows the result field may be too big. So Here's another solution which avoid as well the double-rainbow effect by applying the clasical max memorisation/check with a MySQl variable:

SELECT c,seid
  FROM (
   SELECT c,seid,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
     FROM (
       SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3)
        GROUP BY seid
        ORDER BY c DESC
    ) res1
   ,(SELECT @mmax:=0) initmax
   ORDER BY c DESC
 ) res2 WHERE mymax>0;
+---+------+
| c | seid |
+---+------+
| 2 |    4 | 
| 2 |    2 | 
+---+------+

explain:

+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys    | key     | key_len | ref  | rows | Extra                                                     |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL             | NULL    | NULL    | NULL |    3 | Using where                                               | 
|  2 | DERIVED     | <derived4> | system | NULL             | NULL    | NULL    | NULL |    1 | Using filesort                                            | 
|  2 | DERIVED     | <derived3> | ALL    | NULL             | NULL    | NULL    | NULL |    3 |                                                           | 
|  4 | DERIVED     | NULL       | NULL   | NULL             | NULL    | NULL    | NULL | NULL | No tables used                                            | 
|  3 | DERIVED     | items      | range  | PRIMARY,tiid_idx | PRIMARY | 4       | NULL |    4 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+


That requires you to traverse the large table twice. Maybe caching the result will help halve the time taken, but there doesn't look like much more optization is possible.

DROP temporary table if exists TMP_COUNTED;

create temporary table TMP_COUNTED
select seid, COUNT(*) as C
from items
where tiid in (1,2,3)
group by seid;

CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C);

SELECT *
FROM TMP_COUNTED
WHERE C = (SELECT MAX(C) FROM seid)


Pre-calculate the count of all the unique tiid values and store them.

Refresh this count hourly, daily or weekly. Or try and keep the count correct by updating them. This then will eliminate the need to do a count. Counts are always slow.


I have a table called product_category which has a composite primary key consisting of 2 unsigned integer fields and with no additional secondary indexes:

create table product_category
(
prod_id int unsigned not null,
cat_id mediumint unsigned not null,
primary key (cat_id, prod_id) -- note the clustered composite index !!
)
engine = innodb;

The table currently has 125 million rows

select count(*) as c from product_category;
c
=
125,524,947

with the following index/cardinalities:

show indexes from product_category;

Table              Non_unique   Key_name    Seq_in_index    Column_name Collation   Cardinality
=====              ==========   ========    ============    =========== =========   ===========
product_category    0            PRIMARY                1    cat_id      A           1162276
product_category    0            PRIMARY                2    prod_id     A           125525826

If I run a query similar to yours (1st run nothing cached and with cold/empty buffers):

select 
 prod_id, count(*) as c
from
 product_category 
where 
  cat_id between 1600 and 2000 -- using between to include a wider range of data
group by
 prod_id 
having c = (
  select count(*) as c from product_category 
  where cat_id between 1600 and 2000
  group by  prod_id order by c desc limit 1
)
order by prod_id;

I get the following results:

(cold run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.18 sec)

(2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.14 sec)

The explain plan is as follows:

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
|  2 | SUBQUERY    | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

If I run regilero's query:

SELECT c,prod_id
  FROM (
   SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
     FROM (
       SELECT COUNT(*) as c, prod_id FROM product_category WHERE
        cat_id between 1600 and 2000
        GROUP BY prod_id
        ORDER BY c DESC
    ) res1
   ,(SELECT @mmax:=0) initmax
   ORDER BY c DESC
 ) res2 WHERE mymax>0;

I get the following results:

(cold) 
+---+---------+
| c | prod_id |
+---+---------+
| 4 |  931111 |
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
+---+---------+
4 rows in set (0.17 sec)

(2nd run)
+---+---------+
| c | prod_id |
+---+---------+
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
| 4 |  931111 |
+---+---------+
4 rows in set (0.13 sec)

The explain plan is as follows:

+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type   | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 | Using where                                               |
|  2 | DERIVED     | <derived4>       | system | NULL          | NULL    | NULL   | NULL |      1 | Using filesort                                            |
|  2 | DERIVED     | <derived3>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 |                                                           |
|  4 | DERIVED     | NULL             | NULL   | NULL          | NULL    | NULL   | NULL |   NULL | No tables used                                            |
|  3 | DERIVED     | product_category | range  | PRIMARY       | PRIMARY | 3      | NULL | 194622 | Using where; Using index; Using temporary; Using filesort   |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

Finally trying cyberwiki's approach:

drop procedure if exists cyberkiwi_variant;

delimiter #

create procedure cyberkiwi_variant()
begin

create temporary table tmp engine=memory
 select prod_id, count(*) as c from
 product_category where cat_id between 1600 and 2000
 group by prod_id order by c desc; 

select max(c) into @max from tmp;

select * from tmp where c = @max;

drop temporary table if exists tmp;

end#

delimiter ;

call cyberkiwi_variant();

I get the following results:

(cold and 2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|  816612 | 4 |
|  931111 | 4 |
|   34957 | 4 |
|  717812 | 4 |
+---------+---+
4 rows in set (0.14 sec)

The explain plan is as follows:

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | SIMPLE      | product_category | range | PRIMARY       | PRIMARY | 3  | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

So it seems that all of the methods tested have approx. the same runtimes of between 0.14 and 0.18 seconds which seems pretty performant to me considering the size of the table and the number of rows queried.

Hope this helps - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html


If I'm understanding your requirements, you might try something like this

select seid, tiid, count(*) from items where tiid in (1,2,3)
group by seid, tiid
order by seid
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜