Euclidian distance between posts based on tags

2022-12-13 12:29 问答作者：

I am playing with the euclidian distance example from programming collective intelligence book,


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

this is the original code for ranking movie critics, i am trying to modify this to find similar posts, based on tags i build a map such as,

url1 - > tag1 tag2
url2 - > tag1 tag3

but if apply this to the function,

pow(prefs[person1][item]-prefs[person2][item],2)

this becomes 0 cause tags don't have weight same tags has ranking 1. I modified the code to manually create a difference to test,

pow(prefs[1,开发者_如何学JAVA2)

then i got a lot of 0.5 similarity, but similarity of the same post to it self is dropped down to 0.3. I can't think of a way to apply euclidian distance to my situation?

Okay, first off, your code looks incomplete: I see only one return from your function. I think you mean something like this:

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

Next, your post needs a bit of editing for clarity. I don't know what this means: "this becomes 0 cause tags don't have weight same tags has ranking 1."

Lastly, it would help if you provided sample data for prefs[person1] and prefs[person2]. Then you could tell what you are getting and what you expect to get.

Edit: based on my comment below, I would use code like this:

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

Basically, tags don't have weights and can't be represented by numerical values. So you can't define a distance between two tags.

If you want to find the similarity between two posts using their tags, I would suggest that you use the ratio of similar tag. For example, if you have

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

then you have 2 similar tags, representing 2 (similar tags) / 4 (total tags) = 0.5. I think this would represent a good measurement for similarity, as long as you have more than 2 tags per post.

继续阅读：euclidean-distance python similarity

Euclidian distance between posts based on tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？