Database selection for a web-scale analytics application
I want to build a web-application similar to Google-Analytics, in which I collect sta开发者_运维知识库tistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
- High scalability, handle very large volume
- Compartmentalized - Queries always run on a single customer's data
- Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
- Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
- Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
- Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
- (Optional) Event journaller or immutable event store feeds
- NoSQL database (e.g. Cassandra, Riak, HBase) feeds
- A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
- What do you mean by large volume ?
- Where are your customer user information?
- What kind of front-end and reporting are you going to use?
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details
精彩评论