开发者

Clever way to estimate URL clicks per hour without logging every click?

I have a site with millions of URLs. Each time a URL is clicked, a database row corresponding to that URL is updated indicating the timestamp of that click. I would like to, using additional columns for sure, but without the need to insert distinct rows for every click, estimate the number of clicks per hour this URL receives. Some ideas include storing a handful of timestamps that are aligned towards the most recent second, minute, 15 minute and hour intervals (but that idea is fuzzy to me, how that actually gets what we want), or the more nasty solution of serializing a "log" of time deltas in some kind of serialized row.

While a naive approach suggests to measure the time between the current click and the last one to determine a rate, that would only produce a useful estimate if the link is clicked at a very consistent rate. In reality the link could receive a flurry of clic开发者_运维知识库ks in one minute and nothing at all for another 20.

the reason I don't want to log each click distinctly is just so that the database isn't weighed down with thousands of additional INSERT statements per hour (and the corresponding DELETEs of data more than an hour old), or alternatively that I don't have to fire up an additional storage system (tokyo tyrant, grepping apache logs, etc.) to log these clicks.


How about storing a counter in memcached, keyed by the URL, and a last_counter_reset_time in the DB?

Memcached has a lightweight atomic incr operation. Call that on each request. Periodically reset the counter, updating the last_counter_reset_time.

I'm no memcached veteran, but I imagine there are ways of being fairly sure that the counters for all your URLs stay cached. There's no persistence so you may lose the counter at any time, but occasional data loss of that kind might be acceptable.


Have you tried another approach, like an external stats service? maybe Google Analitycs? It could give you the info you're looking for without any extra load on your servers.


Is there any reason why you've disregarded processing of the apache access logs? They do have the benefit of being timestamped and created automatically by the server and are fairly light-weight. A fairly simple perl or awk script can then keep a running summary of the logs for simple parsing.


First of all, why keep timestamps at all? You could keep exact counts by having one record in the database for each URL, and just incrementing a count each time it is clicked.

If even that is too much load, I would think the next most obvious answer is statistical sampling. Pick a time slice, say ten minutes. For each ten minute slice, pick one URL. Count the number of clicks for that URL. Assume the rate for that ten minutes is consistent and multiply by a constant to get the estimated rate for whatever desired time period. Then for the next ten minute slice pick a different URL. Etc.

Realistically you could probably count more than one URL at a time without overburdening the server, so you could pick some convenient number of URLs, ten or a hundred or whatever your system can handle.

You would also want to consider time of day. If most of your users are in, say, California, then then a URL that is sampled when it is 4:00 pm Pacific time would likely get a much higher number of hits than if it was sampled at 4:00 am. So you'd want to cycle through URLs in a way that insures that when you come back to a given URL it is at a different time of day then when you first sampled it. If your users are evenly spread over the entire world this wouldn't be an issue, but that sounds unlikely.


This may not be a practical solution, but since you asked for a "clever" way, here is some academic research on a question that is not exactly your problem but could probably be adapted. Some papers in the "Cited by" list might be even closer.


If you want exact counts, Redis is ideal for this task. It's roughly comparable in speed to memcached, but offers persistence. The persistence is based on forking and writing sequentially to disk, so it avoids the high io load of keeping this sort of information in your database.

If you want a very simple approach: just discard samples in a non-biased way, (ie log_request(foo) if rand(1) < 0.1 to sample 10% of traffic). You'll lose any signals on urls accessed less than the ratio you're subsampling by, but if you're mostly interested in the highly accessed URLS this can be very simple and efficient.

There's more complicated variations on the above scheme where you update a counter with a probability that lessons as the count grows (and then weight counters appropriately via the probability function when reading them), which is a sort of bastardized form of importance sampling. These are almost as simple and preserve counts on the tail of the distribution better.

  • Edit:

Ah, sorry, I see now from the comments that you're after rates for some period of time. The approach I've used for this is basically the same as the sampling/counter, just store individual counters for some time bracket (ie hourly). For keeping long term archives have additional summary tables for larger time periods (daily, weekly) that a batch job populates from the fine grained (hourly) table, letting you delete old data from the fine grained table.

RRDTool is a more general implementation of this idea, and several OSS monitoring solutions use it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜