MapReduce-提交job源码分析

　　　　　　　　　　　　　　　　　　　　MapReduce-提交job源码分析

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　作者：尹正杰

一.环境准备

1>.顺手的IDE，大家可以根据自己的喜好选择你喜欢的IDE

　　博主推荐以下2款IDE，大家可以自行百度官网，也看看我之前调研的笔记：

　　　　eclipse：https://www.cnblogs.com/yinzhengjie/p/8733302.html

　　　　idea：https://www.cnblogs.com/yinzhengjie/p/9080387.html（我比较推荐它，挺好使的，而且我们公司的好多开发也在用它开发呢~）

2>.编写Wordcount代码

/*
@author :yinzhengjie
Blog:http://www.cnblogs.com/yinzhengjie/tag/Hadoop%E7%94%9F%E6%80%81%E5%9C%88/
EMAIL:[email protected]
*/
package mapreduce.yinzhengjie.org.cn;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        // 1 获取一行
        String line = value.toString();

        // 2 切割
        String[] words = line.split(" ");

        // 3 输出
        for (String word : words) {

            k.set(word);
            context.write(k, v);
        }
    }
}

WordcountMapper.java 文件内容

/*
@author :yinzhengjie
Blog:http://www.cnblogs.com/yinzhengjie/tag/Hadoop%E7%94%9F%E6%80%81%E5%9C%88/
EMAIL:[email protected]
*/
package mapreduce.yinzhengjie.org.cn;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    protected void reduce(Text key, Iterable<IntWritable> value,
                          Context context) throws IOException, InterruptedException {

        // 1 累加求和
        int sum = 0;
        for (IntWritable count : value) {
            sum += count.get();
        }

        // 2 输出
        context.write(key, new IntWritable(sum));
    }
}

WordcountReducer.java 文件内容

 1 /*
 2 @author :yinzhengjie
 3 Blog:http://www.cnblogs.com/yinzhengjie/tag/Hadoop%E7%94%9F%E6%80%81%E5%9C%88/
 4 EMAIL:[email protected]
 5 */
 6 package mapreduce.yinzhengjie.org.cn;
 7 
 8 import java.io.IOException;
 9 import org.apache.hadoop.conf.Configuration;
10 import org.apache.hadoop.fs.Path;
11 import org.apache.hadoop.io.IntWritable;
12 import org.apache.hadoop.io.Text;
13 import org.apache.hadoop.mapreduce.Job;
14 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
15 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
16 
17 public class WordcountDriver {
18 
19     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
20 
21         //配置Hadoop的环境变量，如果没有配置可能会抛异常：“ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path”，还有一件事就是你的HADOOP_HOME的bin目录下必须得有winutils.exe
22         System.setProperty("hadoop.home.dir", "D:/yinzhengjie/softwares/hadoop-2.7.3");
23 
24         //获取配置信息
25         Configuration conf = new Configuration();
26         Job job = Job.getInstance(conf);
27 
28         //设置jar加载路径
29         job.setJarByClass(WordcountDriver.class);
30 
31         //设置map和Reduce类
32         job.setMapperClass(WordcountMapper.class);
33         job.setReducerClass(WordcountReducer.class);
34 
35         //设置map输出
36         job.setMapOutputKeyClass(Text.class);
37         job.setMapOutputValueClass(IntWritable.class);
38 
39         //设置Reduce输出
40         job.setOutputKeyClass(Text.class);
41         job.setOutputValueClass(IntWritable.class);
42 
43         //设置输入和输出路径
44         FileInputFormat.setInputPaths(job, new Path(args[0]));
45         FileOutputFormat.setOutputPath(job, new Path(args[1]));
46 
47         //等待job提交完毕
48         boolean result = job.waitForCompletion(true);
49 
50         System.exit(result ? 0 : 1);
51     }
52 }

WordcountDriver.java 文件内容

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

1.txt 测试数据

3>.配置相关参数

4>.打断点，点击debug进行调试

二.代码调试过程

1>.单步进入

2>.进入submit()方法

3>.进入connect()的方法

　　新旧的API对比,可查看官网：http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

4>.

5>.

6>.

7>.

三.总结

1>.简介job提交源码分析

waitForCompletion()
submit();
// 1建立连接
    connect();    
        // 1）创建提交job的代理
        new Cluster(getConfiguration());
            // （1）判断是本地yarn还是远程
            initialize(jobTrackAddr, conf); 
    // 2 提交job
submitter.submitJobInternal(Job.this, cluster)
    // 1）创建给集群提交数据的Stag路径
    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
    // 2）获取jobid ，并创建job路径
    JobID jobId = submitClient.getNewJobID();
    // 3）拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);    
    rUploader.uploadFiles(job, jobSubmitDir);
// 4）计算切片，生成切片规划文件
writeSplits(job, submitJobDir);
    maps = writeNewSplits(job, jobSubmitDir);
        input.getSplits(job);
// 5）向Stag路径写xml配置文件
writeConf(conf, submitJobFile);
    conf.writeXml(out);
// 6）提交job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());

2>.网上找的一张流程图，画得挺命令，摘下来方便自己以后理解

MapReduce-提交job源码分析

猜你喜欢