Compare large sets of weighted tag clouds?

2023-01-04 05:48 问答作者：

I have thousands of large sets of tag cloud data; I can retrieve a weighted tag clouds for ea开发者_StackOverflow中文版ch set with a simple select/group statement (for example)

SELECT tag, COUNT( * ) AS weight
FROM tags
WHERE set_id = $set_id
GROUP BY tag
ORDER BY COUNT( * ) DESC

What I'd like to know is this -- what is the best way to compare weighted tag clouds and find other sets that are most similar, taking the weight (the number of occurrences within the set) into account and possibly even computing a comparison score, all in one somewhat effiecient statement?

I found the web to be lacking quality literature on the topic, thought it somewhat broadly relevant and tried to abstract my example to keep it generally applicable.

First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.

You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:

m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)

then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.

After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:

v1 = { a: 0.12, b: 0.31; c: 0.17; e:  0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e:  0.28 }

similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e

if a vector has a tag that the other one doesn't then that specific product is obviously 0.

This similarity in within range [0,1], 0 means no correlation while 1 means equality.

Compare large sets of weighted tag clouds?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？