Optimizing MySQL GROUP BY/ORDER BY for calculating set intersection

2023-02-03 17:10 问答作者：

This scenario is simplified somewhat to make the problem clearer. My situation involves a set of data records in MySQL.

CREATE TABLE `records` (                                          
  `id` bigint(20) NOT NULL,                                                       
  `property1` bigint(20) NOT NULL,
  `property2` bigint(20) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `property1` (`property1`),
  KEY `property2` (`property2`)
);

From each record, we generate and store a variable number of keys (hashes) based on the record data.

CREATE TABLE `rkeys` (
  `rKey` bigint(20) NOT NULL,
  `rId` bigint(20) NOT NULL,
  KEY `rKey` (`rKey`),
  KEY `rId` (`rId`),
  FOREIGN KEY (`rId`) REFERENCES `records` (`id`)
);

(The key values are hashes to distribute them over the keyspace more evenly.)

There may be, for example, 5 million records and 50 million keys.

What开发者_JAVA百科 I'm attempting to do is a fuzzy search on the key set -- match a record against the records in the database with the most keys in common. The results also need to be filtered against the properties in the records table.

The query I've been working from looks like this:

SELECT rkeys.rId, records.property1, SUM(1) as score 
FROM rkeys, records
WHERE 
   (rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14) AND 
    rkeys.rId = records.id AND 
    records.property1 = 1 AND
    records.property2 = 2 
GROUP BY rId ORDER BY score DESC;

The performance is ok if the number of records with any given key is fairly small; the problem is if I hit a key that appears in several thousand records (say 5000). All of a sudden, the GROUP BY/ORDER BY performance falls off a cliff (15-20s per query). Note that smoothing out the key distribution is not really an option -- the record data itself is unevenly distributed.

The join against the records problem doesn't seem to be the core of the problem -- I'm just including it for context. I still see the same problem if all I want to do is this:

SELECT rId, SUM(1) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC;

EXPLAIN output:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: rkeys
         type: index
possible_keys: rKey
          key: rKey
      key_len: 8
          ref: NULL
         rows: 1
        Extra: Using where; Using temporary; Using filesort

Is there a way I can restructure this table or query in order to speed this operation up?

Have you tried adding nonclustered indices (indexes) to those fields? I haven't seen Keys do this automatically in the past, aside from some implicit clustered index creation that primary key declaration does in some SQL engines.

I'm a noob, but try a compound index on (rKey, rId) or (rId, rKey)

try something like this

SELECT rId, Count(*) as score 
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC

and add an index to the table

(rKey,rId)

Though replacing Sum with count, shouldn't make to much difference. (in MSSQL anyway)

"The key values are hashes to distribute them over the keyspace more evenly" is actually a bad idea for performance. Once the data gets too large to be cached, you will be slowed down by the randomness.

The table Keys smells a lot like a many:many mapping table. Here are several tips on improving the performance of such a table. And it is likely to speed up your SELECT.

Your SELECT should improve significantly with this 'composite' and 'covering' index:

INDEX(property1, property2, id)

(More tips in creating optimal indexes.)

继续阅读：group-by optimization performance sql

Optimizing MySQL GROUP BY/ORDER BY for calculating set intersection

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？