A Digg-like rotating homepage of popular content, how to include date as a factor?
I am building an advanced image sharing web application. As you may expect, users can upload images and others can comments on it, vote on it, and favorite it. These events will determine the popularity of the image, which I capture in a "karma" field.
Now I want to create a Digg-like homepage system, showing the most popular images. It's开发者_运维百科 easy, since I already have the weighted Karma score. I just sort on that descendingly to show the 20 most valued images.
The part that is missing is time. I do not want extremely popular images to always be on the homepage. I guess an easy solution is to restrict the result set to the last 24 hours. However, I'm also thinking that in order to keep the image rotation occur throughout the day, time can be some kind of variable where its offset has an influence on the image's sorting.
Specific questions:
- Would you recommend the easy scenario (just sort for best images within 24 hours) or the more sophisticated one (use datetime offset as part of the sorting)? If you advise the latter, any help on the mathematical solution to this?
- Would it be best to run a scheduled service to mark images for the homepage, or would you advise a direct query (I'm using MySQL)
- As an extra note, the homepage should support paging and on a quiet day should include entries of days before in order to make sure it is always "filled"
I'm not asking the community to build this algorithm, just looking for some advise :)
I would go with a function that decreases the "effective karma" of each item after a given amount of time elapses. This is a bit like Eric's method.
Determine how often you want the "effective karma" to be decreased. Then multiply the karma by a scaling factor based on this period.
effective karma = karma * (1 - percentage_decrease)
where percentage_decrease
is determined by yourfunction. For instance, you could do
percentage_decrease = min(1, number_of_hours_since_posting / 24)
to make it so the effective karma of each item decreases to 0 over 24 hours. Then use the effective karma to determine what images to show. This is a bit more of a stable solution than just subtracting the time since posting, as it scales the karma between 0 and its actual value. The min is to keep the scaling at a 0 lower bound, as once a day passes, you'll start getting values greater than 1.
However, this doesn't take into account popularity in the strict sense. Tim's answer gives some ideas into how to take strict popularity (i.e. page views) into account.
For your first question, I would go with the slightly more complicated method. You will want some "All time favorites" in the mix. But don't go by time alone, go by the number of actual views the image has. Keep in mind that not everyone is going to login and vote, but that doesn't make the image any less popular. An image that is two years old with 10 votes and 100k views is obviously more important to people than an image that is 1 year old with 100 votes and 1k views.
For your second question, yes, you want some kind of caching going on in your front page. That's a lot of queries to produce the entry point into your site. However, much like SO, your type of site will tend to draw traffic to inner pages through search engines .. so try and watch / optimize your queries everywhere.
For your third question, going by factors other than time (i.e. # of views) helps to make sure you always have a full and dynamic page. I'm not sure about paginating on the front page, leading people to tags or searches might be a better strategy.
You could just calculate an "adjusted karma" type field that would take the time into account:
adjusted karma = karma - number of hours/days since posted
You could then calculate and sort by that directly in your query, or you could make it an actual field in the database that you update via a nightly process or something. Personally I would go with a nightly process that updates it since that will probably make it easier to make the algorithm a bit more sophisticated in the future.
This, i've found it, the Lower bound of Wilson score confidence interval for a Bernoulli parameter
Look at this: http://www.derivante.com/2009/09/01/php-content-rating-confidence/
At the second example he explains how to use time as a "freshness factor".
精彩评论