Euclidian distance between posts based on tags
I am playing with the euclidian distance example from programming collective intelligence book,
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
this is the original code for ranking movie critics, i am trying to modify this to find similar posts, based on tags i build a map such as,
url1 - > tag1 tag2
url2 - > tag1 tag3
but if apply this to the function,
pow(prefs[person1][item]-prefs[person2][item],2)
this becomes 0 cause tags don't have weight same tags has ranking 1. I modified the code to manually create a difference to test,
pow(prefs[1,开发者_如何学JAVA2)
then i got a lot of 0.5 similarity, but similarity of the same post to it self is dropped down to 0.3. I can't think of a way to apply euclidian distance to my situation?
Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:
def sim_distance(prefs, person1, person2):
# Get the list of shared_items
p1, p2 = prefs[person1], prefs[person2]
si = set(p1).intersection(set(p2))
# Add up the squares of all the differences
matches = (p1[item] - p2[item] for item in si)
return sum(a * a for a in matches)
Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."
Lastly, it would help if you provided sample data for prefs[person1]
and prefs[person2]
. Then you could tell what you are getting and what you expect to get.
Edit: based on my comment below, I would use code like this:
def sim_distance(prefs, person1, person2):
p1, p2 = prefs[person1], prefs[person2]
s, t = set(p1), set(p2)
return len(s.intersection(t)) / len(s.union(t))
Basically, tags don't have weights and can't be represented by numerical values. So you can't define a distance between two tags.
If you want to find the similarity between two posts using their tags, I would suggest that you use the ratio of similar tag. For example, if you have
url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6
then you have 2 similar tags, representing 2 (similar tags) / 4 (total tags) = 0.5
. I think this would represent a good measurement for similarity, as long as you have more than 2 tags per post.
精彩评论