Collaborative Filtering: Non-Personalized item-to-item similarity

2022-12-22 05:20 问答作者：

I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current users' history. I'd like to start off with a non-targeted approach before factoring in the current users' preferences.

Looking at the Amazon.com recommendations white paper, they use the following logic for offline item-item similarity:

For each item in product catalog, I1 
  For each customer C who purchased I1
    For each item I2 purchased by customer C
       Record that a customer purchased I1 and I2
  For each item I2 
    Compute the similarity between I1 and I2

If I understand correctly, by the time we're at "Compute similiarty between I1 and I2", I have a list of items(I2) purchased in conjunction with a single value I1(the outer loop).

How is this calculation performed?

Another idea is that I'm overthinking this and making it more difficult than I need to - Would it be enough to do a top-n query on the count of I2 bought in conjunction with I1?

I also appreciate suggestions on whether or not this approach is a correct one. My product database has about 150k items at any time. Since the bulk of the reading material I've seen shows user-item similarity or even user-user similarity, should I be looking to go that route instead.

I've worked with similarity algorithms in the past but they've always involved a rank or a score. I think the only way this would work would be to build a customer-product matrix scoring 0/1 for not purchased/purchased. Given the purchase h开发者_开发知识库istory and the item size, this could get really large.

edit: although i listed python as a tag, i'd prefer to keep the logic inside of a db, preferably using Oracle PL/SQL.

Let's understand Item-to-Item Collaborative Filtering. suppose we have purchase matrix

        Item1  Item2 ... ItemN
 User1  0        1   ...  0
 User2  1        1   ...  0 
  .
  .
  .
 UserM  1        0   ...  0

Then we can calculate Item similarity using column vector, e.g use cosine. We have a item similarity symmetry matrix as below

        Item1  Item2 ... ItemN
 Item1  1       1/M  ...  0
 Item2  1/M     1    ...  0 
  .
  .
  .
 ItemN  0       0    ...  1

It's can be explained as "Customers who viewed/purchased X have also viewed/purchased Y, Z, ..." (Collaborative Filtering). Because Item's vectorization is based on user's purchased.

Amazon's logic is exactly same with above while it's target is to improve efficient. As they said

We could build a product-to-product matrix by iterating through all item pairs and com- puting a similarity metric for each pair. However, many product pairs have no common customers, and thus the approach is inefficient in terms of processing time and memory usage. The iterative algorithm provides a better approach by calculating the similarity between a single prod-uct and all related products

There's a good O'Reilly book on this topic. While the whitepaper might lay the logic out in pseudo-code like that, I don't think that approach would scale very well. The calculations are all probability calculations, so things like Bayes' Theorem get used to say, "Given Person A purchased X, what's the likelihood they purchased Z?" Straightforward looping over the data is working too hard. You have to go through it all for each person.

@Neil or whoever comes to this question later on:

The choice of similarity metric is up to you and you might want to leave it malleable for the future. Check out the Wikipedia article on Frobenius norm for a start. Or as in the link you submitted, the Jaccard coefficient cos(I1,I2).

User-item –vs– user-user –vs– item-item, or whatever combination, cannot be answered objectively. It depends on what kind of data you can get from your users, how the UI draws information out of them, what parts of your data you consider reliable, and your own time constraints (as far as hybrids go).

Since many people have done masters theses on the questions above, you probably want to start with the easiest implementable solution while leaving room for growth in the complexity of the algorithm.

This may not be a perfect answer for your question but another way to look at this problem is Frequent Itemset Mining, which computes all the frequently co-purchased product pairs / groups given a minimum frequency threshold. And you can map a customer's purchase to its commonly co-purchased products.

There is no model training or Bayesian probability predicting because it's a pure math problem. Just need to count the frequency of all possible product pairs purchased together in your transaction base. It's an exponential search space but there are a lot of different efficient algorithms and implementations out there to use (SPMF is a very good one written in Java). This could work as a quick baseline model.

继续阅读：algorithm collaborative-filtering python recommendation-engine similarity

Collaborative Filtering: Non-Personalized item-to-item similarity

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？