Optimizing MySQL GROUP BY/ORDER BY for calculating set intersection
This scenario is simplified somewhat to make the problem clearer. My situation involves a set of data records in MySQL.
CREATE TABLE `records` (
`id` bigint(20) NOT NULL,
`property1` bigint(20) NOT NULL,
`property2` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `property1` (`property1`),
KEY `property2` (`property2`)
);
From each record, we generate and store a variable number of keys (hashes) based on the record data.
CREATE TABLE `rkeys` (
`rKey` bigint(20) NOT NULL,
`rId` bigint(20) NOT NULL,
KEY `rKey` (`rKey`),
KEY `rId` (`rId`),
FOREIGN KEY (`rId`) REFERENCES `records` (`id`)
);
(The key values are hashes to distribute them over the keyspace more evenly.)
There may be, for example, 5 million records and 50 million keys.
What开发者_JAVA百科 I'm attempting to do is a fuzzy search on the key set -- match a record against the records in the database with the most keys in common. The results also need to be filtered against the properties in the records table.
The query I've been working from looks like this:
SELECT rkeys.rId, records.property1, SUM(1) as score
FROM rkeys, records
WHERE
(rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14) AND
rkeys.rId = records.id AND
records.property1 = 1 AND
records.property2 = 2
GROUP BY rId ORDER BY score DESC;
The performance is ok if the number of records with any given key is fairly small; the problem is if I hit a key that appears in several thousand records (say 5000). All of a sudden, the GROUP BY/ORDER BY performance falls off a cliff (15-20s per query). Note that smoothing out the key distribution is not really an option -- the record data itself is unevenly distributed.
The join against the records problem doesn't seem to be the core of the problem -- I'm just including it for context. I still see the same problem if all I want to do is this:
SELECT rId, SUM(1) as score
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC;
EXPLAIN output:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: rkeys
type: index
possible_keys: rKey
key: rKey
key_len: 8
ref: NULL
rows: 1
Extra: Using where; Using temporary; Using filesort
Is there a way I can restructure this table or query in order to speed this operation up?
Have you tried adding nonclustered indices (indexes) to those fields? I haven't seen Keys do this automatically in the past, aside from some implicit clustered index creation that primary key declaration does in some SQL engines.
I'm a noob, but try a compound index on (rKey, rId) or (rId, rKey)
try something like this
SELECT rId, Count(*) as score
FROM rkeys
WHERE rKey = 10 OR rKey = 11 OR rKey = 13 OR rKey = 14
GROUP BY rId ORDER BY score DESC
and add an index to the table
(rKey,rId)
Though replacing Sum with count, shouldn't make to much difference. (in MSSQL anyway)
"The key values are hashes to distribute them over the keyspace more evenly" is actually a bad idea for performance. Once the data gets too large to be cached, you will be slowed down by the randomness.
The table Keys
smells a lot like a many:many mapping table. Here are several tips on improving the performance of such a table. And it is likely to speed up your SELECT
.
Your SELECT
should improve significantly with this 'composite' and 'covering' index:
INDEX(property1, property2, id)
(More tips in creating optimal indexes.)
精彩评论