Row count of a column family in Cassandra

2022-12-14 14:46 问答作者：

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.

For instance, if I have a column family conta开发者_开发百科ining users and wanted to get the number of users. How could I do it? Each user is it's own row.

If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:

nodetool --host <hostname> cfstats

This will dump out a list for each column family looking like this:

Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634

The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.

If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.

If you are not, you will need to store your user ids in a special row.

I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra

select count(*) from cf limit 1000000

Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.

[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]

I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.

I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.

Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.

http://www.mail-archive.com/user@cassandra.apache.org/msg03965.html

There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.

NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.

I have been getting the counts like this after I convert the data into a hash in PHP.

继续阅读：cassandra count database rowcount

Row count of a column family in Cassandra

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？