开发者

Apache Cassandra Data Schema for Twitter Streaming API

I am aware of Twissandra which is an example twitter clone usin开发者_C百科g Cassandra but I was interested to see if anyone has shared a Cassandra schema not to clone Twitter but to use for storing tweets coming through Twitter Streaming API?


It very much depends what sort of queries you want to do with the data after you have ingested it - I see from your previous question "Dumping Twitter Streaming API tweets..." you probably just want to do big batch processing on it.

If this is the case, you just need to worry about load balancing, making sure each node in the cluster handles 1/n of the write load, and contains 1/n of the data - using the random partition and inserting one row per tweets with the status id as the row key will achieve this.

However, if you want to do queries like "give me all tweets for a given user" you will need a slightly more complicated schema, as the schema suggested above will require you to scan all the data. You could insert multiple tweets per row, the row key being the userid, the column key being the tweet id and the value being the tweet. Then you could use get_slice to answer that query.

A good (somewhat related) blog post: http://blog.insidesystems.net/basic-time-series-with-cassandra

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜