开发者

HBase bulk load spawn high number of reducer tasks - any workaround

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a s开发者_JAVA百科mall cluster..

Is there any workaround possible by using MultipleOutputFormat or something else?

Thanks


  1. Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
  2. Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.
  3. It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.

      job.getConfiguration().setBoolean("mapred.compress.map.output", true);
      job.getConfiguration().setClass("mapred.map.output.compression.codec",
          org.apache.hadoop.io.compress.GzipCodec.class,
          org.apache.hadoop.io.compress.CompressionCodec.class);
      job.getConfiguration().set("hfile.compression",
          Compression.Algorithm.LZO.getName());
    
  4. Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).


When we use HFileOutputFormat, its overrides number of reducers whatever you set. The number of reducers is equal to number of regions in that HBase table. So decrease the number of regions if you want to control the number of reducers.

You will find a sample code here:

Hope this will be useful :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜