What is the advantage of integrating Hbase and Hive
Recently, I came across a blog where the author mentioned about integrating Hbase and Hive. Will this be possible开发者_JAVA技巧 and if so what is the advantage of using both(in terms of performance and scalability). Kindly correct me if I went wrong.
I think it will be possible but not trivial to set up for a bit -- maybe CDH3 final will include integration when it comes out.
Advantages: Hive queries over hbase. Think joins and a easy way to do aggregates and simple operations on your HBase data.
Why not just use Hive and not bother with HBase? HBase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses HBase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
There is currently a patch which enables loading data between HBase and Hive. You can find it here:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
The implementation overhead looks to be pretty high.
It might be easier to run a scan on the HBase table and save to an external file then import it into Hive for data manipulation. (This is also pretty cumbersome, but if you are doing it on a regular basis can be scripted.) This is currently the solution that I am currently working on. I'll let you know how it goes.
As for why you would choose HBase over Hive, they aren't really interchangeable. HBase is meant as a highly scalable data store built on top of Hadoop, with little support for data analysis. Hive on the other hand isn't used for storing data in a production environment, but rather makes it very easy to run specific queries over large amounts of data.
精彩评论