开发者

How can I know when the amazon mapreduce task is complete?

i am trying to run a mapreduce task on amazon ec2. i set all the configuration params and then call runFlowJob method of the AmazonElasticMapReduce service. i wonder is there any way to know whether the job has completed and what was the status. (i need it to know when i can pick up the mapreduce results from s3 for further processing)

currently the code just keep executing bacause the call to runJobFlow is non-blocking.

public void startMapReduceTask(String accessKey, String secretKey
        ,String eC2KeyPairName, String endPointURL, String jobName
        ,int numInstances, String instanceType, String placement
        ,String logDirName, String bucketName, String pigScriptName) {
    log.info("Start running MapReduce");

    // config.set
    ClientConfiguration config = new ClientConfiguration();
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);

    AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config);
    service.setEndpoint(endPointURL);

    JobFlowInstancesConfig conf = new JobFlowInstancesConfig();

    conf.setEc2KeyName(eC2KeyPairName);
    conf.setInstanceCount(numInstances);
    conf.setKeepJobFlowAliveWhenNoSteps(true);
    conf.setMasterInstanceType(instanceType);
    conf.setPlacement(new PlacementType(placement));
    conf.setSlaveInstanceType(instanceType);

    StepFactory stepFactory = new StepFactory();

    StepConfig enableDebugging = new StepConfig()
    .withName("Enable Debugging")
    .withActionOnFailure("TERMINATE_JOB_FLOW")
    .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    StepConfig installPig = new StepConfig()
    .withName("Install Pig")
    .withActionOnFailure("TERMINATE_JOB_FLOW")
    .withHadoopJarStep(stepFactory.newInstallPigStep());

    StepConfig runPigScript = new StepConfig()
    开发者_如何学编程.withName("Run Pig Script")
    .withActionOnFailure("TERMINATE_JOB_FLOW")
    .withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, ""));

    RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf)
    .withSteps(enableDebugging, installPig, runPigScript)
    .withLogUri("s3n://" + bucketName + "/" + logDirName);

    try {
        RunJobFlowResult res = service.runJobFlow(request);
        log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully");
    } catch (Exception e) {
        log.error("Caught Exception: ", e);
    }
    log.info("End running MapReduce");      
}

thanks,

aviad


From the AWS documentation:

Once the job flow completes, the cluster is stopped and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.

It goes on to say:

If the JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the job flow will transition to the WAITING state rather than shutting down once the steps have completed.

A maximum of 256 steps are allowed in each job flow.

For long running job flows, we recommended that you periodically store your results.

So it looks like there is no way of knowing when it is done. Instead you need to save your data as part of the job.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜