HBase Mapreduce on multiple scan objects
I am just trying to evaluate HBase for some of data analysis stuff we are doing.
HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. 开发者_如何学编程Total number of event type is around 1000.
The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.
public static void initTableMapperJob(String table, Scan scan, Class mapper, Class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
Is it possible to run mapreduce on a list of scan objects? any workaround?
Thanks
TableMapReduceUtil.initTableMapperJob
configures your job to use TableInputFormat
which, as you note, takes a single Scan
.
It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own InputFormat
, something like MultiSegmentTableInputFormat
. Extend TableInputFormatBase
and override the getSplits
method so that it calls super.getSplits
once for each start/stop row segment of the table. (Easiest way would be to TableInputFormatBase.scan.setStartRow()
each time). Aggregate the InputSplit
instances returned to a single list.
Then configure the job yourself to use your custom MultiSegmentTableInputFormat
.
You are looking for the class:
org/apache/hadoop/hbase/filter/FilterList.java
Each scan can take a filter. A filter can be quite complex. The FilterList allows you to specify multiple single filters and then do an AND or an OR between all of the component filters. You can use this to build up an arbitrary boolean query over the rows.
I've tried Dave L's approach and it works beautifully.
To configure the map job, you can use the function
TableMapReduceUtil.initTableMapperJob(byte[] table, Scan scan,
Class<? extends TableMapper> mapper,
Class<? extends WritableComparable> outputKeyClass,
Class<? extends Writable> outputValueClass, Job job,
boolean addDependencyJars, Class<? extends InputFormat> inputFormatClass)
where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in Dave L's comments.
精彩评论