开发者

The difficulty of choosing right database for analytics

I need some help deciding which database we should choose for our project. We are developing a web application that collects data about user's behavior and analyses that (bad explanation, but I can't provide much more detail; web analytics data is one of our core datasets). We have estimated that we will insert approx 200 million rows per week into database + data calculated from that raw data. The data must be retained for at least six months.

I have spent last week and half gathering information about differ开发者_如何学Pythonent solutions, but there seems to be so many that I feel lost. Most promising ones I found are Cassandra, Hbase and Hive. I also looked at MongoDb, Redis and some others, but they looked like they suited different needs or community wasn't that active.

  • The whole app will be run in Amazon's EC2. As a startup company pay-as-you-go pricing model fits us like a glove. The easier the database is to manage in the cloud, the better.
  • Scalability is important. The amount of data we will generate varies quite much and will grow over time.
  • We can't pay huge licensing fees. Otherwise we would probably use something like http://www.vertica.com/.
  • We need to do all sorts of analysis on data, and the easier they are write the better. I thought about using Map/Reduce for the task; Hbase seems to have better support for this than Cassandra, and Hive has it's own query language. Real-time analysis isn't needed; we can calculate results once a day and shovel those back to database for fast retrieval.
  • Compression support would be nice, but not necessary (disk space is cheap :).

I also though about using MySql (because we will use that for all the user information etc. anyway), but scaling will be much harder in the future and I think at some point we would have to move to some other db anyway. We are also more than willing to commit some time and effort to push the selected database forward in terms of development.


We have decided to go on with Hadoop(& Hive/Hbase) as our primary data store. Main reasons for this are:

  • It is proven technology, and many big sites are using it (Facebook...).
  • Lot's of documentation around and even Hadoop books have been written.
  • Hive provides nice SQL-like query language and command line, so even guys who don't know Java/Python/etc. can write queries easily.
  • It's free and community people seem to be helpful :)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜