Compare large sets of weighted tag clouds?
I have thousands of large sets of tag cloud data; I can retrieve a weighted tag clouds for ea开发者_StackOverflow中文版ch set with a simple select/group statement (for example)
SELECT tag, COUNT( * ) AS weight
FROM tags
WHERE set_id = $set_id
GROUP BY tag
ORDER BY COUNT( * ) DESC
What I'd like to know is this -- what is the best way to compare weighted tag clouds and find other sets that are most similar, taking the weight (the number of occurrences within the set) into account and possibly even computing a comparison score, all in one somewhat effiecient statement?
I found the web to be lacking quality literature on the topic, thought it somewhat broadly relevant and tried to abstract my example to keep it generally applicable.
First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.
You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:
m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)
then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.
After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:
v1 = { a: 0.12, b: 0.31; c: 0.17; e: 0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e: 0.28 }
similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e
if a vector has a tag that the other one doesn't then that specific product is obviously 0.
This similarity in within range [0,1]
, 0 means no correlation while 1 means equality.
精彩评论