Hadoop_MapReduce工作原理

Hadoop_MapReduce工作原理

六个阶段:

  • Input 文件输入
  • Splitting 分片
  • Mapping
  • Shuffling
  • Reducing
  • Final result

mapper的输入数据为KV对形式,每一个KV对都会调用map()方法,输出数据也是KV对形式。

mapper从context中获得输入数据,将处理后的结果写入context中(context.write(text, iw);),输入(LongWritable, Text)和输出(Text, IntWritable)的数据格式由用户设置。

context通过RecordReader获取输入数据,通过RecordWriter保存mapper处理后的数据


InputFormat负责处理MR的输入

InputFormat是一个抽象类,有以下几个子类:

  • ComposableInputFormat
  • CompositeInputFormat
  • DBInputFormat
  • DelegatingInputFormat
  • FileInputFormat

InputFormat有三个方法:

  • InputFormat() :构造器
  • createRecordReader() :提供RecordReader的实现类,把切片读到Mapper中进行处理。
  • getSplits() :把输入文件进行切分

InputFormat的子类FileInputFormat还是一个抽象类,有以下几个子类:

  • CombineFileInputFormat
  • FixedLengthInputFormat
  • KeyValueTextInputFormat
  • NLineInputFormat
  • SequenceFileInputFormat
  • TextInputFormat

其中的 TextInputFormat 是MapReduce默认的InputFormat,它是按行读取每条记录。
Key(LongWritable):用来存储该行在整个文件中的起始字节偏移量
Value(Text):为该行的内容。
TextInputFormat对文件切分的逻辑是使用父类(FileInputFormat)的 getSplits() 方法。
切片方式为:对每个文件进行切分,默认的切片大小为128M.

NLineInputFormat

切片方式:以文件N行作为一个切片,默认一行一个切片。
KEY类型:LongWritable
VALUE类型:Text

示例:输入12行数据,以3行为一个切片,分成4个切片:

修改 Hadoop_WordCount单词统计 工程

  1. 修改 MyWordCount.java
package com.blu.mywordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyWordCount {
	
	public static void main(String[] args) {
		
		try {
			Configuration conf = new Configuration();
			conf.set("mapreduce.map.output.compress", "true");
			conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
			
			Job job = Job.getInstance(conf);
			job.setJarByClass(MyWordCount.class);
			job.setMapperClass(MyWordCountMapper.class);
			job.setReducerClass(MyWordCountReducer.class);
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(IntWritable.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(IntWritable.class);
			FileInputFormat.addInputPath(job, new Path(args[0]));
			FileOutputFormat.setOutputPath(job, new Path(args[1]));
			//指定划分切片的行数
			NLineInputFormat.setNumLinesPerSplit(job, 3);
			//指定InputFormat的类型
			job.setInputFormatClass(NLineInputFormat.class);
			boolean flag = job.waitForCompletion(true);
			System.exit(flag ?0 : 1);
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

}
  1. 在D:\data下的testdata.txt文件中写入12行的数据:
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
good morning
good afternoon
good evening
zhangsan male
lisi female
wangwu male
  1. 设置以下参数运行MyWordCount的main方法
D:\data\testdata.txt D:\data\output
  1. 运行结果
afternoon	2
evening	2
female	2
good	6
lisi	2
male	4
morning	2
wangwu	2
zhangsan	2
  1. 控制台打印切片数量为4:
[INFO ] 2020-04-26 17:12:41,643 method:org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:204)
number of splits:4
  1. 修改的关键代码:
//指定划分切片的行数
NLineInputFormat.setNumLinesPerSplit(job, 3);
//指定InputFormat的类型
job.setInputFormatClass(NLineInputFormat.class);

KeyValueTextInputFormat
KEY类型:Text :以分隔符前的数据作为key
VALUE类型:Text :以分隔符后的数据作为value

示例,使用 KeyValueTextInputFormat 统计以下txt中人名出现的次数

D:\data\money.txt ( 注意该文件中每一行的人名与后面的数据的分割符为Tab )

zhangsan	500 450 jan
lisi	200 150 jan
lilei	150 160 jan
zhangsan	500 500 feb
lisi	200 150 feb
lilei	150 160 feb
  1. 创建 Kvmapper 类
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * 输出格式:
 * zhangsan 1
 * lisi 1
 * zhangsan 1
 * 
 * @author BLU
 *
 */
public class Kvmapper extends Mapper<Text, Text, Text, IntWritable>{
	
	/**
	 * 输入格式:
	 * zhangsan 500 450 jan
	 * key:zhangsan
	 * value:500 450 jan
	 */

	private IntWritable iw = new IntWritable(1);
	
	@Override
	protected void map(Text key, Text value, Mapper<Text, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
		
		context.write(key, iw);
	}
}
  1. KvReducer类
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class KvReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	
	IntWritable iw = new IntWritable();
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		
		int sum = 0;
		for(IntWritable iw : value) {
			sum += iw.get();
		}
		iw.set(sum);
		context.write(key, iw);
	}

}
  1. KeyValueDemo
package com.blu.kvdemo;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class KeyValueDemo {

	public static void main(String[] args) throws Exception {
		
		Job job = Job.getInstance();
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		Configuration conf = new Configuration();
		//设置以tab为分隔符
		conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");
		job.setJarByClass(KeyValueDemo.class);
		job.setMapperClass(com.blu.kvdemo.Kvmapper.class);
		job.setReducerClass(com.blu.kvdemo.KvReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean flag = job.waitForCompletion(true);
		System.exit(flag?0:1);
	}
}
  1. 设置以下参数运行KeyValueDemo的main方法
D:\data\money.txt D:\data\output
  1. 运行结果
lilei	2
lisi	2
zhangsan	2
发布了24 篇原创文章 · 获赞 3 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/BLU_111/article/details/105771363