开发者

The fastest SQL expression for complex searching in many-to-many relationships?

In a product_tag table, the columns are

id, product_id, tag_id

If I would like to search for a product that is tag1 OR tag2 OR tag3, the direct way is:

SELECT DISTINCT productId FROM product_tags WHERE tagId IN (2,4);

If I would like to search for a product that is tag1 AND tag2 AND tag3, the direct way is:

SELECT productId FROM product_tag WHERE tag_id IN (tag1, tag2, tag3) GROUP BY productId HAVING COUNT(*) = 3

But the question is if I would like to search a product that has a complex tag relationship, such as:

开发者_高级运维

product that is (tag1 OR tag2 OR tag3) AND (tag 4 OR tag5 OR tag 6) AND (tag 7 OR tag8 OR tag9)

What is the SQL expression with best performance? (and preferably elegant).

Edit:

The most important performance gain was to add indexes, as Remus in the comments recommended.


You really can't do this directly with a set-based language such as SQL.

Your simple "AND" version also won't work unless you have no duplicates of (productId,tagId).

For your complex relationship, it would be necessary to break your query apart into several subqueries. First break along all "AND" clauses:

WHERE tag_id IN (tag1, tag2, tag3)
WHERE tag_id IN (tag4, tag5, tag6)
WHERE tag_id IN (tag7, tag8, tag9)

Then do an INTERSECTION of the query results.

If any of these subqueries is not a simply OR'ed list but in turn contain AND's in a more complex logic structure, you need to further break down these subqueries recursively.

In other words, you can recursively break down the logic tree along "AND" clauses, then at each tree level do an INTERSECT of the query results.

Doing this is likely to be much faster than generating a huge SQL that will return the result in one go -- because each of the simple OR'ed list can leverage an index you have on tag_id.


Union all 3 groups. They're 3 selects, but they're really simple ones.


The performance wouldn't be that great but you could do a nested query

SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag1, tag2, tag3)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag4, tag5, tag6)
)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag7, tag8, tag9)
)


I noticed Select values that meet different conditions on different rows?

How about

SELECT DISTINCT t1.productId FROM product_tags t1
JOIN product_tags t2 ON t1.productId=t2.productId AND t2.tagId IN (tag4,tag5,tag6)
JOIN product_tags t3 ON t1.productId=t3.productId AND t3.tagId IN (tag7, tag8, tag9)
AND t1.tagId IN (tag1,tag2,tag3)

It would be even better if DISTINCT could be removed somehow.


Is the number of tags known in advance? If it isn't something that will grow over time, I would change tag_id to a bitset.

WITH T AS 
 (SELECT product_id, bit_or((1<<tag_id)::bigint) tagset 
  FROM product_tag GROUP BY product_id) 
SELECT product_id 
WHERE (tagset & 7)>0 AND (tagset & 56)>0 AND (tagset & 448)>0;

I have used Postgres here, where & is known as bitwise AND; bit_or is an aggregate function (SUM would work just as well here, assuming no duplicates allowed in the product_tag table). The magic numbers in the masks are just bit_ors of powers of two. The double-colon is a Postgres cast. Everything here should be available under slightly different names elsewhere. But PG also has bitstrings of indefinite size, and the same logic with bitstrings can be implemented for a large number of tags.

By the way, the situation of matching all tags is just (tagset & mask)=mask.

This is in fact why your indices are working so fast; they are probably being merged into this type of test.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜