Translating a score into a probabilty
People visit my website, and I have an algorithm that produces a score between 1 and 0. The higher the score, the greater the probability that this person will buy something, but the score isn't a probability, and it may not be a linear relationship with the purchase probability.
I have a bunch of data about what scores I gave people in the past, and whether or not those people actually make a purchase.
Using this data about what happened with scores in the past, I want to be able to take a score and translate it into the corresponding probability based on this past data.
Any ideas?
edit: A few people are suggesting bucketing, and I should have mentioned that I had considered this approach, but I'm sure there must be a way to do it "smoothly". A while ago I asked a question about a different but possibly related problem here, I have a feeling that something similar may be applicable but I'm not sure.
edit2: Let's say I told you that of the 100 customers with a score above 0.5, 12 of them purchased, and of the 25 customers with a score below 0.5, 开发者_运维知识库2 of them purchased. What can I conclude, if anything, about the estimated purchase probability of someone with a score of 0.5?
Draw a chart - plot the ratio of buyers to non buyers on the Y axis and the score on the X axis - fit a curve - then for a given score you can get the probability by the hieght of the curve.
(you don't need to phyically create a chart - but the algorithm should be evident from the exercise)
Simples.
That is what logistic regression, probit regression, and company were invented for. Nowdays most people would use logistic regression, but fitting involves iterative algorithms - there are, of course, lots of implementations, but you might not want to write one yourself. Probit regression has an approximate explicit solution described at the link that might be good enough for your purposes.
A possible way to assess whether logistic regression would work for your data, would be to look at a plot of each score versus the logit of the probability of purchase (log(p/(1-p)), and see whether these form a straight line.
I eventually found exactly what I was looking for, an algorithm called “pair-adjacent violators”. I initially found it in this paper, however be warned that there is a flaw in their description of the implementation.
I describe the algorithm, this flaw, and the solution to it on my blog.
Well, the straightforward way to do this would be to calculate which percentage of people in a score interval purchased something and do this for all intervals (say, every .05 points).
Have you noticed an actual correlation between a higher score and an increased likelihood of purchases in your data?
I'm not an expert in statistics and there might be a better answer though.
You could divide the scores into a number of buckets, e.g. 0.0-0.1, 0.1-0.2,... and count the number of customers who purchased and did not purchase something for each bucket.
Alternatively, you may want to plot each score against the amount spent (as a scattergram) and see if there is any obvious relationship.
You could use exponential decay to produce a weighted average.
Take your users, arrange them in order of scores (break ties randomly).
Working from left to right, start with a running average of 0. Each user you get, change the average to average = (1-p) * average + p * (sale ? 1 : 0)
. Do the same thing from the right to the left, except start with 1.
The smaller you make p
, the smoother your curve will become. Play around with your data until you have a value of p
that gives you results that you like.
Incidentally this is the key idea behind how load averages get calculated by Unix systems.
Based upon your edit2 comment you would not have enough data to make a statement. Your overall purchase rate is 11.2% That is not statistically different from your 2 purchase rates which are above/below .5 Additionally to validate your score, you would have to insure that the purchase percentages were monotonically increasing as your score increased. You could bucket but you would need to check your results against a probability calculator to make sure they did not occur by chance.
http://stattrek.com/Tables/Binomial.aspx
精彩评论