How to increment a counter in Cassandra?
I'd like to use Cassandra to store a counter. For example how many times a given page has been viewed. The counter will never decrement. The value of the counter does not need to be exact but it should be accurate over time.
My first thought was to store the value as a column and just read the current count, increment it by one and then put it back in. However if another operation is also trying to increment the counter, I think the final value would just be the one with the latest timestamp.
Another thought would be to store each page load as a new column in a CF. Then I could just run get_count()
on that key and get the number of columns. Reading through the documentation, it appears that it is not a very efficient operation at all.
Am I approaching the pr开发者_运维技巧oblem incorrectly?
Counters have been added to Cassandra 0.8
Use the incr method increment the value of a column by 1.
[default@app] incr counterCF [ascii('a')][ascii('x')];
Value incremented.
[default@app] incr counterCF [ascii('a')][ascii('x')];
Value incremented.
Describe here: http://www.jointhegrid.com/highperfcassandra/?p=79
Or it can be done programatically
CounterColumn counter = new CounterColumn();
ColumnParent cp = new ColumnParent("page_counts_by_minute");
counter.setName(ByteBufferUtil.bytes(bucketByMinute.format(r.date)));
counter.setValue(1);
c.add(ByteBufferUtil.bytes( bucketByDay.format(r.date)+"-"+r.url)
, cp, counter, ConsistencyLevel.ONE);
Described here: http://www.jointhegrid.com/highperfcassandra/?cat=7
[Update] Looks like counter support will be ready for primetime in 0.8!
I definitely wouldn't use get_count, as that is an O(n) operation which is ran every time you read the "counter." Worse than it being just O(n) it may span multiple nodes which would introduce network latency. And finally, why tie up all that disk space when all you care about is a single number?
For right now, I wouldn't use Cassandra for counters at all. They are working on this functionality, but it's not ready for prime time yet.
https://issues.apache.org/jira/browse/CASSANDRA-1072
You've got a few options in the mean time.
1) (Bad) Store your count in a single record and have one and only one thread of your application be responsible for counter management.
2) (Better) Split the counter into n shards, and have n threads manage each shard as a separate counter. You can randomize which thread is used by your app each time for stateless load balancing across these threads. Just make sure that each thread is responsible for exactly one shard.
3a) (Best) Use a separate tool that is either transactional (aka an RDBMS) or that supports atomic increment operations (memcached, redis).
[Update.2] I would avoid using a distributed lock (see memcached and zookeeper mutexes), as this is very intolerant to node failure or network partitioning if improperly implemented.
What I ended up doing was using get_count() and caching the result in a caching ColumnFamily.
This way I could get a general guess at the count but still get the exact count whenever I wanted.
Additionally, I was able to adjust how stale the data I was willing to accept on a per request basis.
We are going to address a similar problem by keeping the current value of a counter in a distributed cache (for example - memcached). When the counter is updated, we will store its value in Cassandra. Therefore even if some cache node fails, we will be able to get the value from the database.
This solution is not perfect. However data such a visit counter are not very sensitive so minor inconsistencies are allowed in my opinion.
Interestingly enough, I do not see anyone mentioning the possibility to count on a per app computer basis. Say your app runs on 5 machines named a1, a2, ... a5. Then you can have a lock on a per machine basis (i.e. a file you open with O_EXCL or use lock to wait for other instances to be done with the counter) and add either one row per machine or one column depending on your implementation. Something like
machine_lock();
this_column_family[machine-name][my-counter] += 1;
machine_unlock();
That way, you get one counter per machine. When you need the total, you just read a1, a2, ... a5 and sum them.
total = 0;
foreach(machines as m) {
total += this_column_family[m][my-counter];
}
(this is pseudo code that would more or less work with libQtCassandra.)
This way you avoid a lock that locks all the nodes and yet you still get safe/consistent counting (obviously the read + sum is not perfect and it only gives you an approximation of the total, but it still remains consistent.)
I'm not too sure whether what Ben Burns pointed out in regard to having n shards and n threads would be the same thing, but it doesn't sound exactly like it to me.
And since 0.8.x, you can use the Cassandra counters which is certainly a lot easier to do, although it may not always fit your needs.
精彩评论