Hadoop (XI): and format combinations Task Overview

Combined task overview

  • Some complex tasks difficult to complete the deal by one MR, so generally it needs to be split into multiple sub-tasks simple to perform MR.

  • MapReduce framework provides for this type of problem several ways to control the flow of execution of tasks, including the following ways:

    • Combined MapReduce task order

      • Previous executed, execution again later

    • Combined MapReduce task dependencies

      • Executing the plurality of front, back again performed

    • Chain MapReduce task

      • Before Map Reduce or increase after treatment

    • The order in which the combined MapReduce tasks can become deformed through iterative MapReduce tasks.

Combined MapReduce task order

  • MR plurality of tasks performed as a string, a front output as a MR MR input, automatic completion order of execution.

  • MR combined in order:

    • Each sub-tasks require specialized set up a separate configuration code (and in fact is the same as an ordinary MR)

    • Installation task execution order is set to run the job order

    • After the completion of all tasks of intermediate results output directory can delete operation.

  • 格式: MapReduce1 -> MapReduce2 -> MapReduce3....,

  • Each sub-task must call job.waitForCompletion (true) to wait before they can complete the job execution - that is, before the implementation of a complete, back before they can begin.

    Advantages: simple structure, easy to implement.

    Disadvantages: As the front after a task MR MR must wait for a task to complete before they can carry out, leading to a cluster utilization is not high; the relationship can not be achieved multiple dependent (or multi-dependencies too much trouble).

package com.rzp.linemr;
​
import org.apache.hadoop.mapreduce.Job;
import java.io.IOException;
// Case test combinations mr task: wordcount output is output in the case of keyword dictionary sort, modified to sort by number of occurrences 
public  class Demo1 {public static void main(String[] args) throws InterruptedException, IOException, ClassNotFoundException {
        //按顺序创建job1,job2...
        Job job1 = createJobByTimes(1);
        JOB2 Job = createJobByTimes (2 );
         // start execution, the execution order of the order must be executed and where the combination of MR has Job

        runJob(job1,1);
        runJob(job2,2);
        System.out.println ( "the successful implementation of the Job" );
    }
    // implementation of Job, fails thrown 
    public  static  void runJob (the Job the Job, int Times) throws InterruptedException, IOException, ClassNotFoundException {
         IF (! Job.waitForCompletion ( to true )) {
             the throw  new new RuntimeException ( "first" + times + "th job execution failed " );
        }
    }
​
    /**
     * Create a job according to the given parameters times
     * Times = MR combined in the first of several job tasks
     */
    public  static Job createJobByTimes ( int Times) {
         // and Job creation as ordinary MR, from the beginning to the InputFormant OutputFormat given 
// TODO create Job return null ;
         
    }
}

 


Combined dependencies

  • Hadoop framework provides a mechanism for performing a combined procedure MapReduce job complex data dependencies.

  • MR3 is actually dependent on the results of MR1 and MR2, but MR1 and MR2 are not dependent on each other, can be performed simultaneously, and if sequential can not do.

  • Hadoop provide specific programming job through the Job and these JobControl classes.

    • Job addition to maintaining configuration information, but also need to maintain dependencies subtasks

    • JobControl class mainly used to control execution of the entire job. JobControl is a subclass of Runnable, by a thread to call start operation method of performing flow control.

  • Job complete class named: org.apache.hadoop.mapred.jobcontrol.Job

  • JobControl full class name: org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl

  • Advantages: relatively simple to achieve and improve cluster utilization.

  • Disadvantages: need to implement job execution flow management (execution flow after the failure of job failures and other operations)

package com.rzp.linemr;
​
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.Job;
import java.io.IOException;
public  class Demo2 {
    public  static  void main (String [] args) throws IOException {
         // call createControlledJob () the mapreduce.Job (Common job) can be converted to the object controlled ControlledJob 
        ControlledJob the jobs that job1 = createControlledJob (createJobByTimes (. 1 ));
        ControlledJob job2 = createControlledJob(createJobByTimes(2));
        JOB3 ControlledJob = createControlledJob (createJobByTimes (. 3 ));
         // the specified dependencies --job3 dependent on job1 and JOB2
         // addDependinJob return Boolean, it can be used to verify 
        job3.addDependingJob (job1);
        job3.addDependingJob(job2);
        // start creating job execution flow of 
        JobControl JC = new new JobControl ( "combined test dependencies" );
         // add a job, no order 
        jc.addJob (job1);
        jc.addJob(job2);
        jc.addJob(job3);
        // total number of job 
        int totalSize = jc.getReadyJobsList () size ();.
         // begin job stream
         // because inherited Runnable interface can call the run method
         // jc.run ();
         // more Thread is recommended to perform 
boolean succeeded. = false ; // the Job execution flag bit stream successful implementation of the try {
             new new Thread (JC) .start ();
             the while (! jc.allFinished ()) {
                 // not executed, continued the try {
        
                
                    Thread.sleep(30000);
                } catch (InterruptedException e) {
                    e.printStackTrace ();
                }
            }
        } The finally {
             // stop execution 
            jc.stop ();
             IF (. Jc.allFinished () && jc.getSuccessfulJobList () size () == totalSize) {
                 // all been performed, and the successful execution of the job number of the job and the total equal to the number, the task job execution Chen Gong 
                succeeded. = to true ;
            }
        }
        System.out.println ( "the Job execution" + (succeeded "success": "failure"? ));
    }
​
    // The mapreduce.Job (Common job) can be converted to the target controlled ControlledJob 
    public  static ControlledJob createControlledJob (the Job job) throws IOException {
        ControlledJob cj = new ControlledJob(job.getConfiguration());
        cj.setJob (Job); // waking disposed 
        return CJ;
    }
​
    // and ordinary MR Job creation is the same, from the beginning to the InputFormant OutputFormat given 
    public  static Job createJobByTimes ( int Times) {
        // TODO create Job 
        return  null ;
    }
}

 

Chain MapReduce

  • The first two methods will have multiple Job startup and shutdown, consume resources, Map and Reduce involve IO operation, efficiency is not high, so you can use a chain MR.

  • MR a task may have some pre-processing and post-processing, such as document inverted index may require pre-treatment to remove some of the "stop words" post-processing needs to merge a number of synonyms.

  • Chain MR: Chain Mapper (ChainMapper) and chain Reducer (ChainReducer) to complete this process. This job is execution flow: map1 -> map2 -...--> reducer -> map3 -> map4 -...

    • MR requires a chain link can have a reduce operation , there may be multiple map operations.

  • Advantages: For pre-processing and post-processing tasks require MR, reducing the operation efficiency is improved.

  • Disadvantages: need to specify an additional parameter information (the first two methods, job itself and the general wording of MR is the same, only in the main program running to do the operation, but this is different).

  • After creating the job, map-reduce execution sequence specifically set class to provide a link in the hadoop required.

    • Use ChainMapper.addMapper add mapper Map stage, performed in the order of addition,

    • Reducer must be used in the processing stage ChainReducer.setReducer add reducer class before you can use ChainReducer.addMapper add mapper class treatment, but also to perform job according to the order of addition.

package com.rzp.linemr;
​
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.lib.ChainMapper;
import org.apache.hadoop.mapred.lib.ChainReducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
​
PUBLIC  Class Demo3 {
    public static void main(String[] args) throws Exception {
        Configuration conf1 = new Configuration();
        Job job1 = Job.getInstance(conf1,"job1");
        job1.setJarByClass(Demo3.class);
        FileInputFormat.addInputPath(job1,new Path(""));
        /**
         * Set mapper
         * Klass mapper class corresponding to
         Type * K1, V1, K2, V2 --- klass corresponding input and output
         * Write your message mapperConf mapper used
         */
        // add the first one Mapper 
        ChainMapper.addMapper (JobConf the Job,
                Class<? extends Mapper<K1, V1, K2, V2>> klass,
                Class<? extends K1> inputKeyClass,
                Class<? extends V1> inputValueClass,
                Class<? extends K2> outputKeyClass,
                Class<? extends V2> outputValueClass,
        boolean byValue, JobConf mapperConf);
​
        // add the first two Mapper 
        ChainMapper.addMapper (JobConf the Job,
                Class<? extends Mapper<K1, V1, K2, V2>> klass,
                Class<? extends K1> inputKeyClass,
                Class<? extends V1> inputValueClass,
                Class<? extends K2> outputKeyClass,
                Class<? extends V2> outputValueClass,
        boolean byValue, JobConf mapperConf);
        // add the reducer
         // input values and the same as above 
        ChainReducer.setReducer (...);// add reducer subsequent Mapper
         // format and also the same as above
         // Note reducer 
        ChainReducer.addMapper (...);
        // set the total input and output path 
        job1.setJarByClass (Demo3. Class );
         // the TODO key map and added output and reduce the value of the type 
    }
}

 

 

Guess you like

Origin www.cnblogs.com/renzhongpei/p/12635278.html