cassandra data model for web logging
Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id's? Would it best to have a single column family per pageid, or 1 Super-column (logs) with columns pageid? Each page has a unique id, then would like to store date and some other metrics on the view.
I am just not sure which solution handles better scalability, lots of column family OR 1 giant super-column?
page-92838 { date:sept 2, brow开发者_开发问答ser:IE } page-22939 { date:sept 2, browser:IE5 }
OR
logs { page-92838 { date:sept 2, browser:IE } page-22939 { date:sept 2, browser:IE5 } }
And secondly, how to handle lots of different date: entries for page-92838?
You don't need a column-family per pageid.
One solution is to have a row for each page, keyed on the pageid.
You could then have a column for each page-view or hit, keyed and sorted on time-UUID (assuming having the views in time-sorted order would be useful) or other unique, always-increasing counter. Note that all Cassandra columns are time-stamped anyway, so you would have a precise timestamp 'for free' regardless of what other time- or date- stamps you use. Using a precise time-UUID as the key also solves the problem of storing many hits on the same date.
The value of each column could then be a textual value or JSON document containing any other metadata you want to store (such as browser).
page-12345 -> {timeuuid1:metadata1}{timeuuid2:metadata2}{timeuuid3:metadata3}...
page-12346 -> ...
With cassandra, it is best to start with what queries you need to do, and model your schema to support those queries.
Assuming you want to query hits on a page, and hits by browser, you can have a counter column for each page like,
stats { #cf
page-id { #key
hits : # counter column for hits
browser-ie : #counts of views with ie
browser-firefox : ....
}
}
If you need to do time based queries, look at how twitters rainbird denormalizes as it writes to cassandra.
精彩评论