开发者

Optimizing MySQL GROUP BY/ORDER BY for calculating set intersection

This scenario is simplified somewhat to make the problem clearer. My situation involves a set of data records in MySQL.

CREATE TABLE `records` (                                          
  `id` bigint(20) NOT NULL,                                                       
  `property1` bigint(20) NOT NULL,
  `property2` bigint(20) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `property1` (`property1`),
  KEY `property2` (`property2`)
);

From each record, we generate and store a variable number of keys (hashes) based on the record data.

CREATE TABLE `rkeys` (
  `rKey` bigint(20) NOT NULL,
  `rId` bigint(20) NOT NULL,
  KEY `rKey` (`rKey`),
  KEY `rId` (`rId`),
  FOREIGN KEY (`rId`) REFERENCES `records` (`id`)
);

(The key values are hashes to distribute them over the keyspace more evenly.)

There may be, for example, 5 million records and 50 million keys.

What开发者_JAVA百科 I'm attempting to do is a fuzzy search on the key set -- match a record against the records in the database with the most keys in common. The results also need to be filtered against the properties in the records table.

The query I've been working from looks like this:

SELECT rkeys.rId, records.property1, SUM(1) as score 
FROM rkeys, records
WHERE 
   (rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14) AND 
    rkeys.rId = records.id AND 
    records.property1 = 1 AND
    records.property2 = 2 
GROUP BY rId ORDER BY score DESC;

The performance is ok if the number of records with any given key is fairly small; the problem is if I hit a key that appears in several thousand records (say 5000). All of a sudden, the GROUP BY/ORDER BY performance falls off a cliff (15-20s per query). Note that smoothing out the key distribution is not really an option -- the record data itself is unevenly distributed.

The join against the records problem doesn't seem to be the core of the problem -- I'm just including it for context. I still see the same problem if all I want to do is this:

SELECT rId, SUM(1) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC;

EXPLAIN output:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: rkeys
         type: index
possible_keys: rKey
          key: rKey
      key_len: 8
          ref: NULL
         rows: 1
        Extra: Using where; Using temporary; Using filesort

Is there a way I can restructure this table or query in order to speed this operation up?


Have you tried adding nonclustered indices (indexes) to those fields? I haven't seen Keys do this automatically in the past, aside from some implicit clustered index creation that primary key declaration does in some SQL engines.


I'm a noob, but try a compound index on (rKey, rId) or (rId, rKey)


try something like this

SELECT rId, Count(*) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC

and add an index to the table

(rKey,rId)

Though replacing Sum with count, shouldn't make to much difference. (in MSSQL anyway)


"The key values are hashes to distribute them over the keyspace more evenly" is actually a bad idea for performance. Once the data gets too large to be cached, you will be slowed down by the randomness.

The table Keys smells a lot like a many:many mapping table. Here are several tips on improving the performance of such a table. And it is likely to speed up your SELECT.

Your SELECT should improve significantly with this 'composite' and 'covering' index:

INDEX(property1, property2, id)

(More tips in creating optimal indexes.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜