Efficient retrieval of column families
Recently I've come up against efficient retrieval of several columns from single row in single column family. Currently, I am using Pelops as Cassandra API. The question is what to do if I want to get columns from several ranges. It would be easy if I could get columns from the family according to few slices at once, but I can't.
For example I have a family with enourmous number of columns. Some of them have a common prefix, let's say "group/xxx", where xxx is an identifier. There are also a couple of columns named for example "a", "b", "c". Now, I want to get these columns together, so I have to define two slices an开发者_运维技巧d call getColumnsFromRow twice.
How to solve this problem in terms of efficiency? Does Cassandra somehow cache a column family which was recently retrieved and calling getColumnsFromRow for the second time will not make searching it again?
Because you have rolled your own compound column names, you basically have to issue multiple get_slice calls.
This is not a terribly big deal efficiency wise since these columns are in the same row and, if you chose your comparator correctly, should be a single disk seek. Subsequent queries to this same row should hit this portion of the table in the OS's disk cache (OS level, nothing to do with Cassandra).
Row caching was designed for small rows where the entire contents are accessed frequently (like a serialized object or similar). They will actually impose a substantial amount of memory pressure for large rows like this. I would recommend leaving row cache disabled for this CF.
If you find you need to, you can do some additional tweaking via making the following adjustments: - turn down read_repair_chance - enable 'result pinning': https://github.com/apache/cassandra/blob/cassandra-0.7.0/conf/cassandra.yaml#L229-236
This will let your 0S'S file system cache work more efficiently since the same hosts will be handling the same queries, and the subsequent slices will be operating on sections of the row ideally in the same SSTable and thus in FS cache.
(Shameless plug - but actually quite helpful in these situations) Also, consider a free download OpsCenter (http://www.datastax.com/opscenter), and watch the metrics for the column family as you experiment with the different options. This will give you an idea of the most efficient way to structure your queries specifically for your data.
Cassandra does have optional row caching but this is likely to cost a lot of memory if your rows are very large, so is probabably not advisable.
(Row caching is configured per-columnfamily using the rows_cached, row_cache_save_period_in_seconds, and preload_row_cache proeprties in your storage configuration)
http://wiki.apache.org/cassandra/StorageConfiguration says:
The row cache saves even more time, but must store the whole values of its rows, so it is extremely space-intensive. It's best to only use the row cache if you have hot rows or static rows.
精彩评论