Hadoop-MapReduce实战(WordCount)

WordCount case

Requirement 1: Count the number of words in a bunch of files

Count and output the total number of occurrences of each word in a given text file
- data preparation:
- analysis
  
  Write Mapper, Reducer, and Driver according to the mapreduce programming specification.
  
  analysis
Write the mapper class


import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    
    
	
	Text k = new Text();
	IntWritable v = new IntWritable(1);
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
    
    
		
		// 1 获取一行
		String line = value.toString();
		
		// 2 切割
		String[] words = line.split(" ");
		
		// 3 输出
		for (String word : words) {
    
    
			
			k.set(word);
			context.write(k, v);
		}
	}
}

Write the reducer class

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    

	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,
			Context context) throws IOException, InterruptedException {
    
    
		
		// 1 累加求和
		int sum = 0;
		for (IntWritable count : value) {
    
    
			sum += count.get();
		}
		
		// 2 输出
		context.write(key, new IntWritable(sum));
	}
}

Write driver class

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    
    

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    

        String[] args=new String{
    
    “输入路径”,”输出路径”};

		// 1 获取配置信息
		Configuration configuration = new Configuration();
		Job job = Job.getInstance(configuration);

		// 2 设置jar加载路径
		job.setJarByClass(WordcountDriver.class);

		// 3 设置map和Reduce类
		job.setMapperClass(WordcountMapper.class);
		job.setReducerClass(WordcountReducer.class);

		// 4 设置map输出
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);

		// 5 设置Reduce输出
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		// 6 设置输入和输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// 7 提交
		boolean result = job.waitForCompletion(true);

		System.exit(result ? 0 : 1);
	}
}

Requirement 2: Partition the words according to ASCII code parity (Partitioner)

analysis
Custom partition

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class WordCountPartitioner extends Partitioner<Text, IntWritable>{
    
    

	@Override
	public int getPartition(Text key, IntWritable value, int numPartitions) {
    
    
		
		// 1 获取单词key  
		String firWord = key.toString().substring(0, 1);
		//String -> int
        int result = Integer.valueOf(firWord);

		// 2 根据奇数偶数分区
		if (result % 2 == 0) {
    
    
			return 0;
		}else {
    
    
			return 1;
		}
	}
}

Configure the load partition in the driver and set the number of reducetasks

job.setPartitionerClass(WordCountPartitioner.class);
job.setNumReduceTasks(2);

Requirement 3: Partial summary of the output of each maptask (Combiner)

In the statistical process, the output of each maptask is partially summarized to reduce the network transmission volume, that is, the Combiner function is used

data preparation

Option One
- Add a WordcountCombiner class to inherit Reducer

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
    
    
        // 1 汇总
		int count = 0;
		for(IntWritable v :values){
    
    
			count += v.get();
		}
		// 2 写出
		context.write(key, new IntWritable(count));
	}
}

Specify the combiner in the WordcountDriver driver class

// 9 指定需要使用combiner，以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordcountCombiner.class);

Option II

Specify WordcountReducer as a combiner in the WordcountDriver driver class

// 指定需要使用combiner，以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordcountReducer.class);

Running program
Insert picture description here requirement 4: slicing optimization of a large number of small files (CombineTextInputFormat)

In the distributed architecture, the distributed file system HDFS, and the distributed computing program programming framework mapreduce.

HDFS: Not afraid of large files, afraid of many small files

mapreduce: afraid of data skew

So how does mapreduce solve the problem of multiple small files?

mapreduce optimization strategy for a large number of small files

By default, TextInputFormat's task slicing mechanism is to slice according to the file plan. No matter how many small files there are, they will be separate slices and will be handed over to a maptask. In this way, if there are a large number of small files, a large number of maptasks will be generated. Extremely low processing efficiency
Optimization Strategy

The best method: at the forefront of data processing (preprocessing, acquisition), merge small files into large files, and upload them to HDFS for subsequent analysis

Remedy: If there are already a large number of small files in HDFS, you can use another inputformat for slicing (CombineFileInputformat), whose slicing logic is the same as TextInputformat

Note: combineTextInputFormat is a subclass of CombineFileInputformat

different:

It can logically plan multiple small files into one slice, so that multiple small files can be handed over to a maptask

/如果不设置InputFormat，它默认的用的是TextInputFormat.class
		/*CombineTextInputFormat为系统自带的组件类
		 * setMinInputSplitSize 中的2048是表示n个小文件之和不能大于2048
		 * setMaxInputSplitSize 中的4096是     当满足setMinInputSplitSize中的2048情况下  在满足n+1个小文件之和不能大于4096
		 */
		job.setInputFormatClass(CombineTextInputFormat.class);
		CombineTextInputFormat.setMinInputSplitSize(job, 2048);
		CombineTextInputFormat.setMaxInputSplitSize(job,4096);

Input data: prepare 5 small files
Implementation process
- Do not do any processing, run the wordcount program in requirement 1, and observe that the number of slices is 5

File size <MinSplit <MaxSplit	number of splits:1
MinSplit <file size <MaxSplit	number of splits:1
MaxSplit <file size <2*MaxSplit	number of splits:2
2 * MaxSplit <file size	number of splits:3
Test size	Max MB	File size and maximum multiple	Splits
4.97MB	3MB	1.65 times	2
4.1MB	3MB	1.36	1
6.51	3MB	2.17	3

Hadoop-MapReduce实战(WordCount)

WordCount case

Guess you like