开发者

Score algorithm in multiple join

I have a list of publications stored in publications table. Each publication has a many-to-many relation with categories and also a many-to-many relation with keywords.

Given a publication I'd like to find related one开发者_如何学Cs based on a score value computed with the following algorithm:

  • each shared category with other publications counts as one point
  • each shared keyword with other publications counts as one point
  • the score value is the sum of the points computed with previous steps

I want to retrieve with a single query the list of related publications ordered by this score.

Now I have these two queries which compute the score for both categories and keyword

SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC

and for the keyword score

SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC

Finally I need to JOIN the resulting query with a SELECT query which should retrieve publications data (title, intro, etc,) ordering them by score and with a limit clause to get the most relevant publications related to the selected one.

Currently I tried to use these two queries as subtables in a join:

SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

EXPLAIN shows me that a couple of temporary table will be used. Could it be a performance problem? Is there any better way to implement this?

Update

To answer to Johan's comment

Your solution is wrong. Use a LIMIT clause in subqueries could lead to inconsistent results with every value for the limit. What if I have the following results for the subqueries (I'll show 11 records, but your query will fetch only the first ten)

+-------+--------+ +-------+--------+
| p.id  | cscore | | p.id  | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100    | | 27865 | 100    |
| 27853 | 100    | | 27864 | 100    |
| 27852 | 100    | | 27863 | 100    |
| 27851 | 100    | | 27862 | 100    |
| 27850 | 100    | | 27861 | 100    |
| 27849 | 100    | | 27860 | 100    |
| 27848 | 100    | | 27859 | 100    |
| 27847 | 100    | | 27858 | 100    |
| 27846 | 100    | | 27857 | 100    |
| 27845 | 100    | | 27856 | 100    |
| 27844 | 100    | | 27855 | 100    |
| 1000  | 99     | | 1000  | 99     |
+-------+--------+ +-------+--------+

If I have ten record with 100 as cscore and ten different records with 100 as kscore the join will produce an empty set. So I'm not getting any result, while the publication with id 1000 should be the solution and it's left out from the result set.

Furthermore I could consider your solution with a LEFT JOIN, in this case only records from the left table will be fetched, and each record will get a total score of 100 (because of the NULL given by the empty kscore field in the second table). Again, the result is wrong because the highest scored record should be p1000 with a total score of 198 (= 99 + 99)

Your solution cannot produce reliable results.


You only want 5 results each from the subqueries.
I think it is best to only select 5 from then and use that in the query.

Rewrite q1 as:

SELECT c.publication_id, COUNT(*) AS cscore
FROM cat_pub c
WHERE c.publication_id = p.id  
AND c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
LIMIT 10

Rewrite q2 as:

SELECT k.publication_id, COUNT(*) AS kscore
FROM key_pub k
WHERE p.id = k.publication_id
  AND k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
LIMIT 10

Leave the join as is:

SELECT p.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

Note that count(*) is usually a faster choice, because it will not test of null If you can have null values and don't want to include those in the count, then name the count(field) explicitly.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜