开发者

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)

I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.

Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.

This question: ( Hadoop and MySQL Integration ) calls the pra开发者_开发知识库ctice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....

Enlighten me.


At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.

I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.

If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera

Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).

Hope that helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜