开发者

MySQL INTERSECT via joins table?

So essentially I have t开发者_运维问答wo tables, containing URLS and TAGS, with a has-and-belongs-to-many relationship between the two via a joins tables TAGS_URLS.

A simple query to find URL's by tags would be:

SELECT urls.id FROM urls 
  INNER JOIN tags_urls ON urls.id=tags_urls.url_id
  INNER JOIN tags ON tags_urls.tag_id=tags.id 
WHERE tags.tag IN ("sample","tag","list");

However, I'm trying to recover an intersection of all URL's that contain all of a set of tags. I.e., only URL's that contain the tag "sample" AND "tag" AND "list".

I have a working query, but I cannot get the query to execute in less than 30 seconds.

SELECT a.id
  FROM
    (SELECT DISTINCT urls.id FROM urls
      INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
      WHERE tags.tag = 'sample') a
  JOIN
     (SELECT DISTINCT urls.id FROM urls
      INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
      WHERE tags.tag = 'list') b
  ON a.id = b.id;

The result set is correct, but the performance is horrific.

I do also currently have the data duplicated in a Redis database as a list of URL id's stored in tag sets so I can do something like this and get a result set VERY quickly.

SINTER "tag-sample" "tag-list"

Would it be possible, with reasonable effort, to bring the MySQL performance for this task up to the Redis levels with SINTER?


I am not 100% sure, but I think the underlying engine is creating a temp table for each of your subselects. Depending on the size of your data, this can be quite costly. If they are big (and they are in your case) temp tables have to write their contents out to disk because they are too big to hold in memory at once. So basically your query is copying huge amounts of data as it tries to build out two temporary tables that match the selection criteria for your two subselects. Once this is done, it finally executes the outer select and this most likely rather fast.

I would try to factor the subselects out for inner joins. I think the following will give you what you are looking for:

select urls.id from urls
inner join tags_urls tu1 on tu1.url_id = urls.id
inner join tags t1 on tu1.tag_id = t1.id and t1.tag = 'sample'
inner join tag_urls tu2 on tu2.url_id = urls.id
inner join tags t2 on t2.id = tu2.tag_id and t2.tag = 'list'

You would continue to add pairs of inner joins to tag_urls and tags for each 'tag' you wanted to intersect with. Again, run this through explain and make sure everything has the right index.

DBMS's can do pretty well with a several inner joins but as you increase the number of intersections, your performance will decrease.


You might try replacing the sql subqueries in the 2nd statmenet with joins. Robert Vieira claims in his Sql Server books that sometimes joins are faster and sometimes subqueries are faster. Hard to believe that the same would not be true for MySql as well. Also, if there ia quite a bit of other data in the table, other than 'list' or 'sample', then you might want to insert this data into a temp table and run your queries from that table instead. This is especially true if you are going to be running multiple queries on that data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜