Migrating A Java Application to Hadoop : Architecture/Design Roadblocks?
Alrite.. so.. here's a situation: I am responsible for architect-ing the migration of an ETL software (EAI, rather) that is java-based. I'll have to migrate this to Hadoop (the apache version). Now, technically this is more like a reboot and not a migration - coz I've got no database to migrate. This is about leveraging Hadoop, such that, the Transformation phase (of 'ETL') is parallel-iz-ed. This would make my ETL software,
- Faster - with transformation parallel-iz-ed.
- Scalable - Handling more data / big data is about adding more nodes.
- Reliable - Hadoop's redundancy and reliability will add to my product's features.
I've tested this configuration out - changed my transformation algos into a mapreduce model, tested it out on a high end Hadoop cluster and bench-marked the performance. Now, I'm trying to understand and document all those things that could stand in the way of this application redesign/ rearch / migration. Here's a few I could think of:
- The other two phases: Extraction and Load - My ETL tool can handle a variety of datasources - So, do I redesign my data adapters to read data from these data sources, load it to HDFS and then transform it and load it into the target datasource? Could this step act as a huge bottleneck to the entire architecture?
- Feedback: So my transformation fails on a record - how do I let the end user know that the ETL hit an error on a particular record? In short, how do I keep track of what is actually going on at the app level with all the maps/reduces/merges and sorts happening - The default Hadoop web interface is not for the end-user - its for admins. So should I build 开发者_如何学编程a new web app that scrapes from the Hadoop web interface? (I know this is not recommended)
- Security: How do I handle authorization at Hadoop level? Who can run jobs, who are not allowed to run 'em - how to support ACL?
I look forward to hearing from you with possible answers to above questions and more questions/facts I'd need to consider, based on your experiences with Hadoop / problem analysis. Like always, I appreciate your help and thank ya all in advance.
- I do not expect loading to the HDFS to be a bottlneck, since the load is distributed among datanodes - so the network interface will be only bottleneck. Loading data back to the database might be a bottlneck but I think it is no worse then now. I would design jobs to have their input and their output to sit in the HDFS, and then run some kind of bulk load of results into the database.
- Feedback is a problematic point, since actually MR have only one result - and it is transformed data. All other tricks, like write failed records into HDFS files will lack "functional" reliability of the MR, because it is a side effect. One of the ways to mitigate this problem you should design you software in the way to be ready for duplicated failed records. There is also scoop = the tool specific for migrating data between SQL databases and Hadoop. http://www.cloudera.com/downloads/sqoop/ In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive, therof reducing data volumes before going to (perhaps single node) database.
精彩评论