Hadoop_MapReduce工作原理

Hadoop_MapReduce工作原理

六个阶段：

Input 文件输入
Splitting 分片
Mapping
Shuffling
Reducing
Final result

mapper的输入数据为KV对形式，每一个KV对都会调用map()方法，输出数据也是KV对形式。

mapper从context中获得输入数据，将处理后的结果写入context中（context.write(text, iw);），输入（LongWritable, Text）和输出（Text, IntWritable）的数据格式由用户设置。

context通过RecordReader获取输入数据，通过RecordWriter保存mapper处理后的数据

InputFormat负责处理MR的输入

InputFormat是一个抽象类，有以下几个子类：

ComposableInputFormat
CompositeInputFormat
DBInputFormat
DelegatingInputFormat
FileInputFormat

InputFormat有三个方法：

InputFormat() ：构造器
createRecordReader() ：提供RecordReader的实现类，把切片读到Mapper中进行处理。
getSplits() ：把输入文件进行切分

InputFormat的子类FileInputFormat还是一个抽象类，有以下几个子类：

CombineFileInputFormat
FixedLengthInputFormat
KeyValueTextInputFormat
NLineInputFormat
SequenceFileInputFormat
TextInputFormat

其中的 TextInputFormat 是MapReduce默认的InputFormat，它是按行读取每条记录。
Key（LongWritable）：用来存储该行在整个文件中的起始字节偏移量
Value（Text）：为该行的内容。
TextInputFormat对文件切分的逻辑是使用父类（FileInputFormat）的 getSplits() 方法。
切片方式为：对每个文件进行切分，默认的切片大小为128M.

NLineInputFormat

切片方式：以文件N行作为一个切片，默认一行一个切片。
KEY类型：LongWritable
VALUE类型：Text

示例：输入12行数据，以3行为一个切片，分成4个切片：

修改 Hadoop_WordCount单词统计工程

修改 MyWordCount.java

package com.blu.mywordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			conf.set("mapreduce.map.output.compress", "true");
			conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
			
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			//指定划分切片的行数
			NLineInputFormat.setNumLinesPerSplit(job, 3);
			//指定InputFormat的类型
			job.setInputFormatClass(NLineInputFormat.class);
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}

在D:\data下的testdata.txt文件中写入12行的数据：

good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male

设置以下参数运行MyWordCount的main方法

D:\data\testdata.txt D:\data\output

运行结果

afternoon	2
evening	2
female	2
good	6
lisi	2
male	4
morning	2
wangwu	2
zhangsan	2

控制台打印切片数量为4：

[INFO ] 2020-04-26 17:12:41,643 method:org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:204)
number of splits:4

修改的关键代码：

//指定划分切片的行数
NLineInputFormat.setNumLinesPerSplit(job, 3);
//指定InputFormat的类型
job.setInputFormatClass(NLineInputFormat.class);

KeyValueTextInputFormat
KEY类型：Text ：以分隔符前的数据作为key
VALUE类型：Text ：以分隔符后的数据作为value

示例，使用 KeyValueTextInputFormat 统计以下txt中人名出现的次数

D:\data\money.txt ( 注意该文件中每一行的人名与后面的数据的分割符为Tab )

zhangsan	500 450 jan
lisi	200 150 jan
lilei	150 160 jan
zhangsan	500 500 feb
lisi	200 150 feb
lilei	150 160 feb

创建 Kvmapper 类

package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * 输出格式：
 * zhangsan 1
 * lisi 1
 * zhangsan 1
 * 
 * @author BLU
 *
 */
public class Kvmapper extends Mapper<Text, Text, Text, IntWritable>{
	
	/**
	 * 输入格式：
	 * zhangsan 500 450 jan
	 * key:zhangsan
	 * value:500 450 jan
	 */

	private IntWritable iw = new IntWritable(1);
	
	@Override
	protected void map(Text key, Text value, Mapper<Text, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
		
		context.write(key, iw);
	}
}

KvReducer类

package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class KvReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	
	IntWritable iw = new IntWritable();
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		
		int sum = 0;
		for(IntWritable iw : value) {
			sum += iw.get();
		}
		iw.set(sum);
		context.write(key, iw);
	}

}

KeyValueDemo

package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class KeyValueDemo {

	public static void main(String[] args) throws Exception {
		
		Job job = Job.getInstance();
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		Configuration conf = new Configuration();
		//设置以tab为分隔符
		conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");
		job.setJarByClass(KeyValueDemo.class);
		job.setMapperClass(com.blu.kvdemo.Kvmapper.class);
		job.setReducerClass(com.blu.kvdemo.KvReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean flag = job.waitForCompletion(true);
		System.exit(flag?0:1);
	}
}

设置以下参数运行KeyValueDemo的main方法

D:\data\money.txt D:\data\output

运行结果

lilei	2
lisi	2
zhangsan	2

BLUcoding

发布了24 篇原创文章 · 获赞 3 · 访问量 2万+

私信关注

Hadoop_MapReduce工作原理

猜你喜欢