Converting python collaborative filtering code to use Map Reduce

2022-12-31 07:15 问答作者：

Using Python, I'm computing cosine similarity across items.

given event data that represents a purchase (user,item), I have a list of all items 'bought' by my users.

Given this input data

(user,item)
X,1
X,2
Y,1
Y,2
Z,2
Z,3

I build a python dictionary

{1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']}

From that dictionary, I generate a bought/not bought matrix, also another dictionary(bnb).

{1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]}

From there, I'm computing similarity between (1,2) by calculating cosine between (1,1,0) and (1,1,1), yielding 0.816496

I'm doing this by:

items=[1,2,3]
for item in items:
  for sub in items:
    if sub >= item:    #as to not calculate similarity on the inverse
      sim = coSim( bnb[item], bnb[sub] )

I think the brute force approach is killing me and it only runs slower as the data gets larger. Using my trusty laptop, this calculation runs for hours when dealing with 8500 users and 3500 items.

I'm trying to compute similarity for all items in my dict and it's taking longer than I'd like it to. I think this is a good candidate for MapReduce but I'm having trouble 'thinking' in terms of key/value pairs.

Alternatively, is the issue with my approach and not necessa开发者_StackOverflow中文版rily a candidate for Map Reduce?

This is not actually a "MapReduce" function but it should give you some significant speedup without all of the hassle.

I would actually use numpy to "vectorize" the operation and make your life easier. From this you'll just need to loop through this dictionary and apply the vectorized function comparing this item against all others.

import numpy as np
bnb_items = bnb.values()
for num in xrange(len(bnb_items)-1):
    sims = cosSim(bnb_items[num], bnb_items[num+1:]

def cosSim(User, OUsers):
""" Determinnes the cosine-similarity between 1 user and all others.
Returns an array the size of OUsers with the similarity measures

User is a single array of the items purchased by a user.
OUsers is a LIST of arrays purchased by other users.

"""

    multidot = np.vectorize(np.vdot)
    multidenom = np.vectorize(lambda x: np.sum(x)*np.sum(User))

    #apply the dot-product between this user and all others
    num = multidot(OUsers, User)

    #apply the magnitude multiplication across this user and all others
    denom = multidenom(OUsers)

    return num/denom

I haven't tested this code so there may be some silly errors but the idea should get you 90% of the way.

This should have a SIGNIFICANT speedup. If you still need a speed up there is a wonderful blog post which implements a "Slope One" recommendation system here.

Hope that helps, Will

继续阅读：collaborative-filtering optimization python similarity

Converting python collaborative filtering code to use Map Reduce

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？