Optimizing join query to fetch data from A with condition on B sorted by B

2023-01-20 17:58 问答作者：

I have an item-to-item similarity matrix set up with these tables:

items (id, ...) (Primary key `id`)
similarities (item1_id, item2_id, similarity) (Index on `item1_id` and `item2_id`)

The similarities tables contains pairs of ids with a similarity index, i.e:

item1_id  item2_id  similarity
1         2         0.3143
2         3         0.734

For efficient storage "reverse pairs" are omitted, i.e. there's only one pair (1,2), there's no redundant pair (2,1). That means the foreign key for an item may be either item1_id or item2_id.

Now I want to find items that are similar to a bunch of other items, sorted by descending similarity. I'm using this query:

SELECT    `Item`.*
FROM      `items` AS `Item`
LEFT JOIN `similarities` AS `Similarity`
       ON (`Item`.`id` = `Similarity`.`item1_id`
              AND开发者_Python百科 `Similarity`.`item2_id` IN (1, 2, 3, ...))
          OR (`Item`.`id` = `Similarity`.`item2_id`
              AND `Similarity`.`item1_id` IN (1, 2, ,3, ...))
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
GROUP BY  `Item`.`id`
ORDER BY  `Similarity`.`similarity` desc

It's extremely slow though, it takes 4-5 seconds for ~100,000 items and ~30,000 similarity pairs. It seems the JOIN is extremely costly. Here's the query EXPLAINed:

select_type  table       type         possible_keys      key                key_len  ref   rows    Extra
SIMPLE       Similarity  index_merge  item1_id,item2_id  item1_id,item2_id  110,110  NULL  31      Using sort_union(item1_id,...
SIMPLE       Item        ALL          PRIMARY            NULL               NULL     NULL  136600  Using where; Using join buffer

What can I do to speed this up? Worst case I would do it in two separate queries, but I'd prefer one JOIN query if possible.

I didn't actually try this but maybe it points you in the right direction. The idea is to make a temp result of the UNION of (unique) id, similarity pairs from similarities, then join items with that.

SELECT Item.*, s.other_item_id, s.similarity
FROM items AS Item
JOIN
    (
    SELECT item1_id AS id, item2_id AS other_item_id, similarity FROM similarities
    UNION
    SELECT item2_id AS id, item1_id AS other_item_id, similarity FROM similarities
    ) AS s ON s.id = items.id
WHERE items.id IN (1, 2, 3, ...)
ORDER BY s.similarity DESC;

In your original query you don't need to restrict the ids from similarities in both the JOIN condition and the WHERE clause.

I am wondering whether joining to the items table twice will perform better than the two queries. Pardon the psuedo-code-ish SELECT portion of this statement - I think you'll actually need a CASE for every field value...

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`id`
ELSE `Item2`.`id`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`name`
ELSE `Item2`.`name`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`description`
ELSE `Item2`.`description`
END,

[and so on]

FROM      `items` AS `Item1`
LEFT OUTER JOIN `similarities` AS `Similarity`
       ON (`Item1`.`id` = `Similarity`.`item1_id`
RIGHT OUTER JOIN `items` AS `Item2`
       ON (`Item2`.`id` = `Similarity`.`item2_id`       
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
ORDER BY  `Similarity`.`similarity` desc

Thanks to the inspirations, I ended up with this query:

SELECT `Item`.*
FROM `items` AS `Item`
JOIN (
    SELECT `item1_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item2_id` IN (1, 2, 3, ...)
    UNION
    SELECT `item2_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item1_id` IN (1, 2, 3, ...)
) AS `SimilarityUnion` ON `SimilarityUnion`.`id` = `Item`.`id`
GROUP BY `SimilarityUnion`.`id`
ORDER BY `SimilarityUnion`.`similarity` DESC

继续阅读：join

Optimizing join query to fetch data from A with condition on B sorted by B

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？