NoSql or MySQL for Data Analytics
We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week).
All these data need to be made available for Analytics.
We have a Msyql solution with star schema(only parts of data is loaded on to this). But
concern is how 开发者_JAVA技巧far one can stretch this ?
Should I be looking at NoSQL like Hive for data analytics ??
I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive
How big is big Data, and when should I be looking away from MySQL? Will the structural rigidness of Mysql cause problems ?
Currently the data is only few GB(in MySQL), But it certainly will grow. How about MySQL clustering ??
Should I be going down this path at all ??
350Gb (growing couple of GB a week)... All these data need to be made available for Analytics
Do you have MySQL gurus in house? If yes, sure => just create and grow that MySQL cluster. The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation.
Something that can become a problem later with MySQL solution is Oracle. So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for.
If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions.
For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple.
The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual, and if you strictly doing analytics, for you, consistency may not be a price at all.
In return you get a very natural Big Data handling, fault tolerance and much more.
If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt, which promises 5 times Hive performance.
The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. Scaling up on one box or out on multiple servers is supported...
You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms
Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (e.g. scaling, read speed, the problems with requiring 100% consistency, etc.).
InfiniDB is not free.
Check out http://code.google.com/p/shard-query
This is like Map-Reduce over a sharded shared-nothing set of databases. Works great for STAR schemas. Shard the fact table over N nodes and duplicate the dimension tables on each server.
You can check out this blog post for more info and performance testing results:
http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/
FYI: I'm the author of Shard-Query.
精彩评论