The fastest SQL expression for complex searching in many-to-many relationships?

2023-02-27 09:08 问答作者：

In a product_tag table, the columns are

id, product_id, tag_id

If I would like to search for a product that is tag1 OR tag2 OR tag3, the direct way is:

SELECT DISTINCT productId FROM product_tags WHERE tagId IN (2,4);

If I would like to search for a product that is tag1 AND tag2 AND tag3, the direct way is:

SELECT productId FROM product_tag WHERE tag_id IN (tag1, tag2, tag3) GROUP BY productId HAVING COUNT(*) = 3

But the question is if I would like to search a product that has a complex tag relationship, such as:

开发者_高级运维
product that is (tag1 OR tag2 OR tag3) AND (tag 4 OR tag5 OR tag 6) AND (tag 7 OR tag8 OR tag9)

What is the SQL expression with best performance? (and preferably elegant).

Edit:

The most important performance gain was to add indexes, as Remus in the comments recommended.

You really can't do this directly with a set-based language such as SQL.

Your simple "AND" version also won't work unless you have no duplicates of (productId,tagId).

For your complex relationship, it would be necessary to break your query apart into several subqueries. First break along all "AND" clauses:

WHERE tag_id IN (tag1, tag2, tag3)
WHERE tag_id IN (tag4, tag5, tag6)
WHERE tag_id IN (tag7, tag8, tag9)

Then do an INTERSECTION of the query results.

If any of these subqueries is not a simply OR'ed list but in turn contain AND's in a more complex logic structure, you need to further break down these subqueries recursively.

In other words, you can recursively break down the logic tree along "AND" clauses, then at each tree level do an INTERSECT of the query results.

Doing this is likely to be much faster than generating a huge SQL that will return the result in one go -- because each of the simple OR'ed list can leverage an index you have on tag_id.

Union all 3 groups. They're 3 selects, but they're really simple ones.

The performance wouldn't be that great but you could do a nested query

SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag1, tag2, tag3)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag4, tag5, tag6)
)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag7, tag8, tag9)
)

I noticed Select values that meet different conditions on different rows?

How about

SELECT DISTINCT t1.productId FROM product_tags t1
JOIN product_tags t2 ON t1.productId=t2.productId AND t2.tagId IN (tag4,tag5,tag6)
JOIN product_tags t3 ON t1.productId=t3.productId AND t3.tagId IN (tag7, tag8, tag9)
AND t1.tagId IN (tag1,tag2,tag3)

It would be even better if DISTINCT could be removed somehow.

Is the number of tags known in advance? If it isn't something that will grow over time, I would change tag_id to a bitset.

WITH T AS 
 (SELECT product_id, bit_or((1<<tag_id)::bigint) tagset 
  FROM product_tag GROUP BY product_id) 
SELECT product_id 
WHERE (tagset & 7)>0 AND (tagset & 56)>0 AND (tagset & 448)>0;

I have used Postgres here, where & is known as bitwise AND; bit_or is an aggregate function (SUM would work just as well here, assuming no duplicates allowed in the product_tag table). The magic numbers in the masks are just bit_ors of powers of two. The double-colon is a Postgres cast. Everything here should be available under slightly different names elsewhere. But PG also has bitstrings of indefinite size, and the same logic with bitstrings can be implemented for a large number of tags.

By the way, the situation of matching all tags is just (tagset & mask)=mask.

This is in fact why your indices are working so fast; they are probably being merged into this type of test.

继续阅读：many-to-many performance sql

The fastest SQL expression for complex searching in many-to-many relationships?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？