"People who watched this also watched" algorithm
I am trying to code an algor开发者_Python百科ithm that acts a bit like Amazon's "People who bought this also bought".
The difference between the two is that mine is just counting the "products" you watched in a single session, while Amazon is counting every purchase/checkout.
I have a bit of difficulty in implementing and figuring out what the algo should be.
- So far, I am counting by SessionID the productID that was watched.
- By the end of a day, I have many ProductIDs watched by many SessionIDs.
- Now, I need to create some sort of cliques in the DB. That is, going one-by-one on the SessionsIDs and extracting all the products they've viewed. then, writing it as a clique in a DB table.
- Once I have cliques, and a product is being viewed, I am scanning this table to look in which clique it is in, and then extracting all the rest of the productIDs.
Do you have any reference/idea if my algorithm is correct? Is there a better one?
I was able to achieve your desired result using a simple DB structure, and a pretty simple query:
Table
TABLE `exa`
| sesh_id | prod_id |
---------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 3 | 3 |
| 3 | 4 |
| 4 | 1 |
| 4 | 2 |
| 4 | 5 |
Query
SELECT c.prod_id, COUNT(*)
FROM `exa` a
JOIN `exa` b ON a.prod_id=b.prod_id
JOIN `exa` c ON b.sesh_id=c.sesh_id
WHERE a.`prod_id`=3 AND c.prod_id!=3
GROUP BY c.prod_id
ORDER BY 2 DESC;
Result
| prod_id | COUNT |
| 4 | 9 |
| 2 | 6 |
| 1 | 3 |
The idea is that every time a session views a product, it gets inserted into the table [in this case exa
]
Then, on any particular product view, you can check and see what other products people who viewed this product also viewed, weighted by frequency. So in this particular example, EVERYONE that viewed product #3 viewed product #4, so it comes up first in the sort. Product #5 was only viewed by session #4, and session #4 didn't view product #3, so product #5 doesn't come up in the results.
The NetFlix prize was won with a solution based on SVD. Implementing a covariance matrix in a database table is a challenge. Implementing SVD in a database might be a research problem. But most people would call it insane.
I would make one improvement to your idea. When you're figuring out the cliques that go together and deciding which make the strongest relationships you should add a weight against each connection. The easiest way to calculate the weight would be to see how many people who looked at product X also looked at Y. The more views, the stronger the relationship.
You are not going to need any (any!) +1
s.
What you do need is a history. In case of "customers who bought X also bought Y" it's a history of purchases. In case of "customers who saw X also were interested in Y" it's a history of who saw what.
As soon as you have a history, you're ready for making sense of your data. Here, you'll need to adjust your mind to the nearest neighbour problem. What you need is the nearest neighbours for X. The coordinates are users, the values are 0 and 1 depending on whether a user saw/bought an item.
The easiest way to calculate the distance is just sum of squared deltas; it can be easily calculated once an hour for example (as soon as you have plenty of views, distances will stop changing too often), and after that you'll always have a table of distances handy.
The examples for this approach can be found (in Python) in Programming Collective Intelligence published by O'Reilly
Ok. I think i figured it out. part of the work is code implementation.
What i did was to group by the sessionID, productID.
then, in my code I iterate thought each sessionID, and I make a dictionary with pairs.
for instance, if I have pid 10 and 20 and 30, which are a clique basically.
so I insert to the dictionary in the following way:
1. 10-20, weight 1
2. 20-10, weight 1
3. 10-30, weight 1
4. 30-10, weight 1
5. 20-30, eight 1.
6. 30-20, weight 1.
in case I encounter one of the values again, i will add +1 to the appropriate pair/s.
at the end, i will have the weights and pairs aligned.
all I have to do now is by a given productID, to scan the table and find the clique it is inside.
if you have improvements suggestions, please let me know!
thanks all!
精彩评论