What formula is used for building a list of related items in a tag-based system?

2023-03-08 07:55 问答作者：

There are a lot of sites out there that use 'tags' to categorize items in their system. For example, YouTube uses keywords to categorize videos, Stack Overflow uses tags to categorize questions, etc.

What formulas do these sites use (especially SO) to build a list of items related to another item based on the tags it has? I'm building a system much like the one on SO and I'd like to find a way to generate a list of 20 items or so based on the tags of one item, but also make it spread e开发者_运维百科nough so that each photo generates a vastly different list, and so that clicking an item in any given related list could eventually lead you to almost every item in the database.

The technical term for an organization based on user tags is a folksonomy. A google search for that term brings up a huge amount of material on how these systems are put together. A good place to start is the Wikipedia article.

I had to solve this exact problem for a contract a few years back, and the company was nice enough to let me blog about how I did it at http://bentilly.blogspot.com/2011/02/finding-related-items.html.

You'll note that if you get a decent volume of data then you'll really, really want to do this out of the database.

Similarity between items is often represented as dot products between the vectors representing the items. So if you have a tag based system, each tag will define one dimension. The vector then for an item becomes 1 in dimension i if tag i is set for this item (or higher numbers if you allow multiple tagging). If you calculate the dot product of the vectors of two items you will get the similarity for those items (N.b. the vectors have to be normalized so that the absolute value is 1).

Note that the dimensionality will get very large (several tens of thousands of tags are common). This sounds like a show stopper for this kind of thing. But you will also not that the vectors are really sparse and multiple dot product become one big matrix multiplication of a sparse matrix with it's own transposition. Using efficient algorithms for sparse matrix multiplication, this can be done relatively fast.

Also note, that most systems do not only rely on tags, but rather on "user behavior" (whatever that means). I.e. for Youtube user behavior would be "Watching a video", "Subscribing to a channel", "looking for similar videos as video X" or "tagging video x with tag y".

I ended up using the following code (with different names), which finds all other items with at least one tag in common, and orders the results by number of common tags, descending, and subsorts by other criteria specific to my problem:

SELECT PT.WidgetID, COUNT(*) AS CommonTags, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date FROM WidgetTags PT INNER JOIN WidgetStatistics PS ON PT.WidgetID = PS.WidgetID
    WHERE PT.TagID IN (SELECT PTInner.TagID FROM WidgetTags PTInner WHERE PTInner.WidgetID = @WidgetID)
    AND PT.WidgetID != @WidgetID
GROUP BY PT.WidgetID, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date
ORDER BY CommonTags DESC, PS.OtherOrderingCriteria1 DESC, PS.OtherOrderingCriteria2 DESC, PS.OtherOrderingCriteria3 DESC, PS.Date DESC, PT.WidgetID DESC

继续阅读：algorithm sql tags

What formula is used for building a list of related items in a tag-based system?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？