Map -> Reduce -> Reduce (two reducers to be called sequentially) - how to configure driver program

Aadith Ramia :

I need to write a Map reduce program that calls two reducers in succession. ie, output of first reducer will be the input to the second reducer. How do I achieve this?

What I have found so far suggests that I will need to configure two map reduce jobs in my driver code(code below).
This looks wasteful, for two reasons -

  1. I dont really need a maper in second job
  2. having two jobs looks like an overkill.

Is there a better way to achieve this?

Also, a question on the below approach : Job1's output would be multiple files in the OUTPUT_PATH directory. This directory is passed in as Job2's input, is this okay? Does it not have to be a file? Will Job2 process all files under the given directory?

Configuration conf = getConf();
  FileSystem fs = FileSystem.get(conf);
  Job job = new Job(conf, "Job1");
  job.setJarByClass(ChainJobs.class);

  job.setMapperClass(MyMapper1.class);
  job.setReducerClass(MyReducer1.class);

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  job.setInputFormatClass(TextInputFormat.class);
  job.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job, new Path(args[0]));
  TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));

  job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/


  /*
   * Job 2
   */
  Configuration conf2 = getConf();
  Job job2 = new Job(conf2, "Job 2");
  job2.setJarByClass(ChainJobs.class);

  job2.setMapperClass(MyMapper2.class);
  job2.setReducerClass(MyReducer2.class);

  job2.setOutputKeyClass(Text.class);
  job2.setOutputValueClass(Text.class);

  job2.setInputFormatClass(TextInputFormat.class);
  job2.setOutputFormatClass(TextOutputFormat.class);

  TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
  TextOutputFormat.setOutputPath(job2, new Path(args[1]));

  return job2.waitForCompletion(true) ? 0 : 1;
cricket_007 :

dont really need a maper in second job

The framework does, though

having two jobs looks like an overkill... Is there a better way to achieve this?

Then don't use MapReduce... Spark, for example would likely be faster and have less code

Will Job2 process all files under the given directory?

Yes

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=195458&siteId=1