What could be cassandra schema to serve this query?
Assume a social application that has some million users & there are around 200-300 topics, Users can make posts which could be tagged on upto 5 topics. I have 2 kind开发者_Python百科 of queries on this data:
- find post by a certain user
- find all recent posts tagged on a specific topic.
For 1st query I can easily create the schema using superColumns in the User Columnfamily(in this supercolumn, I can store the postIds of all posts by user as columns).
My question is how should I design the schema to serve 2nd query in Cassandra?
Although Justice's answer would work, I don't like it because it requires an OrderPreservingPartitioner to perform the range scan. OPP has a lot of problems associated with it. See the article that I've been linking to constantly for details.
Instead, I would recommend this:
topic|YYMMDDHH: {TimeUUID: postID, TimeUUID: postID, etc... }
where "topic|YYMMDDHH" is the row key, each column name is a TimeUUID, and the column values are postIDs.
To get the latest posts for any topic, you get a slice off the end of the most recent row for that topic. If that row didn't have enough columns, you go to the previous one in time, etc.
This has a few nice properties. First, if you don't care about really old posts on a topic, only relatively recent ones, you can purge old rows on a regular basis and save yourself some space; this could even be done with column TTLs so that you don't have to do any extra work. Second, your rows will be bounded in size because they are split every hour. Third, you don't need OPP :)
One downside to this is that if there's a really hot topic, one node may receive higher traffic than the others for an hour at a time.
For the second query, build a secondary-index column family whose keys are #{topic}:#{unix_timestamp}
. Rows would have a single column with the post ID. You can then do a range scan.
精彩评论