Score algorithm in multiple join

2023-04-04 18:35 问答作者：

I have a list of publications stored in publications table. Each publication has a many-to-many relation with categories and also a many-to-many relation with keywords.

Given a publication I'd like to find related one开发者_如何学Cs based on a score value computed with the following algorithm:

each shared category with other publications counts as one point
each shared keyword with other publications counts as one point
the score value is the sum of the points computed with previous steps

I want to retrieve with a single query the list of related publications ordered by this score.

Now I have these two queries which compute the score for both categories and keyword

SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC

and for the keyword score

SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC

Finally I need to JOIN the resulting query with a SELECT query which should retrieve publications data (title, intro, etc,) ordering them by score and with a limit clause to get the most relevant publications related to the selected one.

Currently I tried to use these two queries as subtables in a join:

SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

EXPLAIN shows me that a couple of temporary table will be used. Could it be a performance problem? Is there any better way to implement this?

Update

To answer to Johan's comment

Your solution is wrong. Use a LIMIT clause in subqueries could lead to inconsistent results with every value for the limit. What if I have the following results for the subqueries (I'll show 11 records, but your query will fetch only the first ten)

+-------+--------+ +-------+--------+
| p.id  | cscore | | p.id  | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100    | | 27865 | 100    |
| 27853 | 100    | | 27864 | 100    |
| 27852 | 100    | | 27863 | 100    |
| 27851 | 100    | | 27862 | 100    |
| 27850 | 100    | | 27861 | 100    |
| 27849 | 100    | | 27860 | 100    |
| 27848 | 100    | | 27859 | 100    |
| 27847 | 100    | | 27858 | 100    |
| 27846 | 100    | | 27857 | 100    |
| 27845 | 100    | | 27856 | 100    |
| 27844 | 100    | | 27855 | 100    |
| 1000  | 99     | | 1000  | 99     |
+-------+--------+ +-------+--------+

If I have ten record with 100 as cscore and ten different records with 100 as kscore the join will produce an empty set. So I'm not getting any result, while the publication with id 1000 should be the solution and it's left out from the result set.

Furthermore I could consider your solution with a LEFT JOIN, in this case only records from the left table will be fetched, and each record will get a total score of 100 (because of the NULL given by the empty kscore field in the second table). Again, the result is wrong because the highest scored record should be p1000 with a total score of 198 (= 99 + 99)

Your solution cannot produce reliable results.

You only want 5 results each from the subqueries.
I think it is best to only select 5 from then and use that in the query.

Rewrite q1 as:

SELECT c.publication_id, COUNT(*) AS cscore
FROM cat_pub c
WHERE c.publication_id = p.id  
AND c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
LIMIT 10

Rewrite q2 as:

SELECT k.publication_id, COUNT(*) AS kscore
FROM key_pub k
WHERE p.id = k.publication_id
  AND k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
LIMIT 10

Leave the join as is:

SELECT p.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

Note that count(*) is usually a faster choice, because it will not test of null If you can have null values and don't want to include those in the count, then name the count(field) explicitly.

继续阅读：count join

Score algorithm in multiple join

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Update

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？