开发者

Optimizing join query to fetch data from A with condition on B sorted by B

I have an item-to-item similarity matrix set up with these tables:

items (id, ...) (Primary key `id`)
similarities (item1_id, item2_id, similarity) (Index on `item1_id` and `item2_id`)

The similarities tables contains pairs of ids with a similarity index, i.e:

item1_id  item2_id  similarity
1         2         0.3143
2         3         0.734

For efficient storage "reverse pairs" are omitted, i.e. there's only one pair (1,2), there's no redundant pair (2,1). That means the foreign key for an item may be either item1_id or item2_id.

Now I want to find items that are similar to a bunch of other items, sorted by descending similarity. I'm using this query:

SELECT    `Item`.*
FROM      `items` AS `Item`
LEFT JOIN `similarities` AS `Similarity`
       ON (`Item`.`id` = `Similarity`.`item1_id`
              AND开发者_Python百科 `Similarity`.`item2_id` IN (1, 2, 3, ...))
          OR (`Item`.`id` = `Similarity`.`item2_id`
              AND `Similarity`.`item1_id` IN (1, 2, ,3, ...))
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
GROUP BY  `Item`.`id`
ORDER BY  `Similarity`.`similarity` desc

It's extremely slow though, it takes 4-5 seconds for ~100,000 items and ~30,000 similarity pairs. It seems the JOIN is extremely costly. Here's the query EXPLAINed:

select_type  table       type         possible_keys      key                key_len  ref   rows    Extra
SIMPLE       Similarity  index_merge  item1_id,item2_id  item1_id,item2_id  110,110  NULL  31      Using sort_union(item1_id,...
SIMPLE       Item        ALL          PRIMARY            NULL               NULL     NULL  136600  Using where; Using join buffer

What can I do to speed this up? Worst case I would do it in two separate queries, but I'd prefer one JOIN query if possible.


I didn't actually try this but maybe it points you in the right direction. The idea is to make a temp result of the UNION of (unique) id, similarity pairs from similarities, then join items with that.

SELECT Item.*, s.other_item_id, s.similarity
FROM items AS Item
JOIN
    (
    SELECT item1_id AS id, item2_id AS other_item_id, similarity FROM similarities
    UNION
    SELECT item2_id AS id, item1_id AS other_item_id, similarity FROM similarities
    ) AS s ON s.id = items.id
WHERE items.id IN (1, 2, 3, ...)
ORDER BY s.similarity DESC;

In your original query you don't need to restrict the ids from similarities in both the JOIN condition and the WHERE clause.


I am wondering whether joining to the items table twice will perform better than the two queries. Pardon the psuedo-code-ish SELECT portion of this statement - I think you'll actually need a CASE for every field value...

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`id`
ELSE `Item2`.`id`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`name`
ELSE `Item2`.`name`
END,

SELECT    
CASE WHEN `Item2`.`id` IS NULL THEN 
  `Item1`.`description`
ELSE `Item2`.`description`
END,

[and so on]

FROM      `items` AS `Item1`
LEFT OUTER JOIN `similarities` AS `Similarity`
       ON (`Item1`.`id` = `Similarity`.`item1_id`
RIGHT OUTER JOIN `items` AS `Item2`
       ON (`Item2`.`id` = `Similarity`.`item2_id`       
WHERE     `Similarity`.`item1_id` IN (1, 2, 3, ...)
          OR `Similarity`.`item2_id` IN (1, 2, 3, ...)
ORDER BY  `Similarity`.`similarity` desc


Thanks to the inspirations, I ended up with this query:

SELECT `Item`.*
FROM `items` AS `Item`
JOIN (
    SELECT `item1_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item2_id` IN (1, 2, 3, ...)
    UNION
    SELECT `item2_id` AS `id`, `similarity`
    FROM   `similarities`
    WHERE  `similarities`.`item1_id` IN (1, 2, 3, ...)
) AS `SimilarityUnion` ON `SimilarityUnion`.`id` = `Item`.`id`
GROUP BY `SimilarityUnion`.`id`
ORDER BY `SimilarityUnion`.`similarity` DESC
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜