Optimizing join query to fetch data from A with condition on B sorted by B
I have an item-to-item similarity matrix set up with these tables:
items (id, ...) (Primary key `id`)
similarities (item1_id, item2_id, similarity) (Index on `item1_id` and `item2_id`)
The similarities
tables contains pairs of ids with a similarity index, i.e:
item1_id item2_id similarity
1 2 0.3143
2 3 0.734
For efficient storage "reverse pairs" are omitted, i.e. there's only one pair (1,2), there's no redundant pair (2,1). That means the foreign key for an item may be either item1_id
or item2_id
.
Now I want to find items that are similar to a bunch of other items, sorted by descending similarity. I'm using this query:
SELECT `Item`.*
FROM `items` AS `Item`
LEFT JOIN `similarities` AS `Similarity`
ON (`Item`.`id` = `Similarity`.`item1_id`
AND开发者_Python百科 `Similarity`.`item2_id` IN (1, 2, 3, ...))
OR (`Item`.`id` = `Similarity`.`item2_id`
AND `Similarity`.`item1_id` IN (1, 2, ,3, ...))
WHERE `Similarity`.`item1_id` IN (1, 2, 3, ...)
OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
GROUP BY `Item`.`id`
ORDER BY `Similarity`.`similarity` desc
It's extremely slow though, it takes 4-5 seconds for ~100,000 items and ~30,000 similarity pairs. It seems the JOIN is extremely costly. Here's the query EXPLAIN
ed:
select_type table type possible_keys key key_len ref rows Extra
SIMPLE Similarity index_merge item1_id,item2_id item1_id,item2_id 110,110 NULL 31 Using sort_union(item1_id,...
SIMPLE Item ALL PRIMARY NULL NULL NULL 136600 Using where; Using join buffer
What can I do to speed this up? Worst case I would do it in two separate queries, but I'd prefer one JOIN query if possible.
I didn't actually try this but maybe it points you in the right direction. The idea is to make a temp result of the UNION
of (unique) id, similarity pairs from similarities
, then join items with that.
SELECT Item.*, s.other_item_id, s.similarity
FROM items AS Item
JOIN
(
SELECT item1_id AS id, item2_id AS other_item_id, similarity FROM similarities
UNION
SELECT item2_id AS id, item1_id AS other_item_id, similarity FROM similarities
) AS s ON s.id = items.id
WHERE items.id IN (1, 2, 3, ...)
ORDER BY s.similarity DESC;
In your original query you don't need to restrict the ids from similarities
in both the JOIN
condition and the WHERE
clause.
I am wondering whether joining to the items table twice will perform better than the two queries. Pardon the psuedo-code-ish SELECT portion of this statement - I think you'll actually need a CASE for every field value...
SELECT
CASE WHEN `Item2`.`id` IS NULL THEN
`Item1`.`id`
ELSE `Item2`.`id`
END,
SELECT
CASE WHEN `Item2`.`id` IS NULL THEN
`Item1`.`name`
ELSE `Item2`.`name`
END,
SELECT
CASE WHEN `Item2`.`id` IS NULL THEN
`Item1`.`description`
ELSE `Item2`.`description`
END,
[and so on]
FROM `items` AS `Item1`
LEFT OUTER JOIN `similarities` AS `Similarity`
ON (`Item1`.`id` = `Similarity`.`item1_id`
RIGHT OUTER JOIN `items` AS `Item2`
ON (`Item2`.`id` = `Similarity`.`item2_id`
WHERE `Similarity`.`item1_id` IN (1, 2, 3, ...)
OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
ORDER BY `Similarity`.`similarity` desc
Thanks to the inspirations, I ended up with this query:
SELECT `Item`.*
FROM `items` AS `Item`
JOIN (
SELECT `item1_id` AS `id`, `similarity`
FROM `similarities`
WHERE `similarities`.`item2_id` IN (1, 2, 3, ...)
UNION
SELECT `item2_id` AS `id`, `similarity`
FROM `similarities`
WHERE `similarities`.`item1_id` IN (1, 2, 3, ...)
) AS `SimilarityUnion` ON `SimilarityUnion`.`id` = `Item`.`id`
GROUP BY `SimilarityUnion`.`id`
ORDER BY `SimilarityUnion`.`similarity` DESC
精彩评论