Hadoop-MapReduce-WordCount example and serialization

WordCount example

    Requirements: Count and output the number of occurrences of each word in a given text file.

输入						Map阶段				中间结果					Reduce阶段			输出
Java Java Java Java		<Java,1>...			<Assembly,<1,1>>		<Assembly,2>		Assembly	2
PHP PHP PHP PHP PHP		<PHP,1>...			<Java,<1,1,1,1>>		<Java,4>			Java		4
Python Python Python	<Python,1>...		<PHP,<1,1,1,1,1>>		<PHP,5>				PHP			5
Assembly Assembly		<Assembly,1>...		<Python,<1,1,1>>		<Python,3>			Python		3
SQL SQL SQL SQL			<SQL,1>...			<SQL,<1,1,1,1>>			<SQL,4>				SQL			4

WordCountMapper

  1. User-defined Mapper should inherit the parent class Mapper
  2. The form of KV pair of Mapper input data (KV type can be customized)
  3. The business logic of Mapper is written in the map() method
  4. The KV pair form of Mapper's output data (KV type can be customized)
  5. Call the map() method once for each <K, V>
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

//KEYIN    输入数据的key类型
//VALUEIN  输入数据的value类型
//KEYOUT   输出数据的key类型
//VALUEOUT 输出数据的value类型
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    
	
	Text k = new Text();
	IntWritable v = new IntWritable(1);
	
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
    
    
		String string = value.toString();   //获取一行数据
		String[] words = string.split(" ");   //获取一行的每个单词
		for (String word : words) {
    
    
			k.set(word);
			context.write(k, v);
		}
	}
}

WordCountReducer

  1. User-defined Reducer should inherit the parent class Reducer
  2. The input data type of the Reducer is the output data type of the Mapper
  3. The business logic of the Reducer is written in the reduce() method
  4. The KV pair form of the reducer's output data (KV type can be customized)
  5. Each group of the same <K, V> calls the reduce() method once
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    
	IntWritable result = new IntWritable();
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
    
    
		int sum = 0;
		for (IntWritable value : values) {
    
    
			sum += value.get();
		}
		result.set(sum);
		context.write(key, result);
	}
}

WordCountDriver

Driver is divided into 7 steps:

  1. Get job object
  2. Set the jar package storage location
  3. Associate Map and Reduce classes
  4. Set the key and value types of the output data in the Mapper stage
  5. Set the key and value types of the final output data
  6. Set input and output path
  7. Submit job
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    
    
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
		Configuration conf = new Configuration();
		//1. 获取job对象
		Job job = Job.getInstance(conf);
		//2.设置jar包存储位置
		job.setJarByClass(WordCountDriver.class);
		//3.关联Map和Reduce类
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		//4.设置Mapper阶段输出数据的key和value类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		//5.设置最终输出数据的key和value类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		//6.设置输入输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		//7.提交job
		boolean result = job.waitForCompletion(true);
		System.exit(result ? 0 : 1);
	}
}

Serialization

    Serialization refers to converting objects in memory into byte sequences for storage to disk (persistence) and network transmission.
    Deserialization refers to the conversion of received byte sequence or persistent data from disk into objects in memory.
    Java serialization is a heavyweight serialization framework ( Serializable ). After an object is serialized, it will be accompanied by a lot of additional information (check information, header, inheritance system, etc.), which is not convenient for efficient transmission on the network. Therefore, Hadoop has developed a set of serialization mechanism ( Writable )
    corresponding to Hadoop data serialization types for commonly used data types.

Java type Hadoop Writable type
boolean BooleanWritable
byte ByteWritable
int IntWritable
float FloatWritable
long LongWritable
double DoubleWritable
String Text
map MapWritable
array ArrayWritable

Custom object serialization

  1. Implement Writable interface
  2. When deserializing, you need to use reflection to call the null parameter constructor, so there must be a null parameter constructor
  3. Rewrite the serialization method write()
  4. Override the deserialization method readFields(), the deserialization order must be consistent with the serialization order
  5. To display the results in a file, you need to rewrite toString(), which can be separated by "\t"
  6. If you need to place a custom bean in the key for transmission, you need to implement the Comparable interface, because the Shuffle process in MapReduce needs to sort the keys
package beanwritable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class Student implements Writable, Comparable<Student> {
    
    
	private int chinese;
	private int math;
	private int english;
	private int sum;
	
	public Student() {
    
    }
	public Student(int chinese, int math, int english) {
    
    
		this.chinese = chinese;
		this.math = math;
		this.english = english;
		this.sum = chinese + math + english;
	}
	
	public int getChinese() {
    
     return chinese; }
	public void setChinese(int chinese) {
    
     this.chinese = chinese; }
	public int getMath() {
    
     return math; }
	public void setMath(int math) {
    
     this.math = math; }
	public int getEnglish() {
    
     return english; }
	public void setEnglish(int english) {
    
     this.english = english; }
	public int getSum() {
    
     return sum; }
	public void setSum(int sum) {
    
     this.sum = sum; }
	public void setGrade(int chinese, int math, int english) {
    
    
		this.chinese = chinese;
		this.math = math;
		this.english = english;
		this.sum = chinese + math + english;
	}
	
	public String toString() {
    
    
		return "chinese=" + chinese + "\tmath=" + math + "\tenglish=" + english;
	}
	
	//序列化方法
	public void write(DataOutput out) throws IOException {
    
    
		out.writeInt(chinese);
		out.writeInt(math);
		out.writeInt(english);
		out.writeInt(sum);
	}
	//反序列化方法:必须与序列化方法顺序一致
	public void readFields(DataInput in) throws IOException {
    
    
		chinese = in.readInt();
		math = in.readInt();
		english = in.readInt();
		sum = in.readInt();
	}
	
	public int compareTo(Student o) {
    
    
		return this.sum - o.sum;
	}
}

Guess you like

Origin blog.csdn.net/H_X_P_/article/details/105967732