Hadoop learning tutorial (MapReduce) (4)

MapReduce

1. MapReduce overview

1.1. MapReduce definition

MapReduce is a programming framework for distributed computing programs and is the core framework for users to develop data analysis applications based on Hadoop.
The core function of MapReduce is to integrate user-written business logic code and built-in default components into a complete distributed computing program that runs concurrently on a Hadoop cluster.

1.2. Advantages and disadvantages of MapReduce

1.2.1. Advantages of MapReduce

1) MapReduce is easy to program.
It can complete a distributed program by simply implementing some interfaces. This distributed program can be distributed to run on a large number of cheap PC machines. Writing a distributed program is exactly the same as writing a simple serial program. Because of this feature, MapReduce programming is very popular.
2) Good scalability:
When your computer resources cannot be satisfied, you can use the simplest method of adding machines to expand its computing capabilities.
3) High fault tolerance
The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs up, the above computing task can be transferred to another node to run without causing the task to fail. This process does not require manual participation and is completely completed within Hadoop.
4) Suitable for offline processing of massive data of PB level or above.
It can realize the concurrent work of tens of millions of server clusters and provide data processing capabilities.

1.2.2. Disadvantages of MapReduce

1) Not good at real-time calculations.
MapReduce cannot return results within milliseconds or seconds like Mysql.
2) Not good at streaming computing.
The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot change dynamically. This is because the design characteristics of MapReduce itself determine that the data source must be static.
3) Not good at DAG (directed acyclic graph) calculations.
Multiple applications have dependencies, and the input of the latter application is the output of the previous one. In this case, it is not that MapReduce cannot be used, but after use, the output results of each MapReduce job will be written to the disk, resulting in a large amount of disk IO, resulting in very low performance.

1.3. The core idea of ​​MapReduce

(1) Distributed computing programs often need to be divided into at least two stages.
(2) The concurrent instances of MapTask in the first stage run completely in parallel and are independent of each other.
(3) The concurrent instances of ReduceTask in the second stage are independent of each other, but their data depends on the output of the concurrent instance of MapTask in the previous stage.
(4) The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user's business logic is very complex, only multiple MapReduce programs can be run in parallel.

1.4. MapReduce process

A complete MapReduce program has three types of instance processes when running in a distributed manner:
(1) MrAppMaster: Responsible for scheduling and status coordination of the entire program process.
(2) MapTask: Responsible for the entire data processing process in the Map stage.
(3) ReduceTask: Responsible for the entire data processing process of the Reduce stage.

1.5. Official WordCount source code

Use decompilation tools to decompile the source code and find that the WordCount case has Map class, Reduce class and driver class. And the data type is the serialization type encapsulated by Hadoop itself.

1.6. Common data serialization types

Java types Hadoop Writable type
Boolean BooleanWritable
Byte ByteWritable
Int IntWritable
Float FloatWritable
Long LongWritable
Double DoubleWritable
String Text
Map MapWritable
Array ArrayWritable
Null nullWritable

1.7. MapReduce programming specifications

The program written by the user is divided into three parts: Mapper, Reducer, and Driver.
1. Mapper stage
(1) The user-defined Mapper must inherit its own parent class
(2) The input data of the Mapper is in the form of KV pairs (the type of KV can be customized)
(3) The business logic in the Mapper is written in map( ) method
(4) Mapper’s output data is in the form of KV pairs (the type of KV can be customized)
(5) The map() method (MapTask process) calls once for each <K, V>
2. Reducer stage
(1 ) User-defined Reduce must inherit its own parent class
(2) The input data type of Reducer corresponds to the output data type of Mapper. It is also KV
(3) Reducer's business logic is offloaded in the reduce() method.
(4) The ReduceTask process calls the reduce() method once for each group of <K, V> with the same K.
3. The Driver stage
is equivalent to the client of the YARn cluster. The user submits our entire program to the YARN cluster and submits a job object that encapsulates the relevant operating parameters of the MapReduce program.

1.8. WordCount case operation

1.8.1. Local testing

1) Requirements:
Count and output the total number of occurrences of each word in a text file
(1) Input data
Custom data, this article uses (separator is space)
atguigu atguigu
ss ss
cls cls
jiao
banzhang
xue
hadoop
(2) Expected output data
shenbaoyun 2
banzhang 1
cls 2
hadoop 1
jiao 1
ss 2
xue 1

2) Requirements analysis (operation steps)
1. Enter the data
Mapper
2. Convert the text content passed in by MapTask into String type
3. Divide this line into multiple words according to the delimiter
4. Output the word as KV<word, 1 >
Reduce
1. Summarize the number of each Key
2. Output the total number of times of the Key
Driver
1. Obtain the configuration information and obtain the job object instance
2. Specify the local path where the jar package of this program is located
3. Associate the Mapper and Reducer business classes
4 , Specify the kv type of the Mapper output data
5. Specify the kv type of the final output data
6. Specify the directory where the original input file of the job is located
7. Specify the directory where the output results of the job are located
8. Submit the job

3) Business practice (environment preparation)
(1) Create maven project, MapReduceDemo
(2) Add the following dependencies in the pom.xml file

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.30</version>
    </dependency>
</dependencies>

(3) Create a new file in the project's /src/main/resource directory, name it "log4j.properties", and fill in the file

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

(4) Create package name: com.shenbaoyun.mapreduce.wordcount

4) Write a program
(1) Write WordCountMapper


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 *KEYIN Map阶段输入的key的类型:LongWritabel
 *VALUEIN Map阶段输入的value类型:Text
 *KEYOUT Map阶段输出的Key类型:Text
 *VALUEOUT Map阶段输出的value类型:IntWritabel
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    
    

    Text outK = new Text();
    IntWritable outV = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context)	throws IOException, InterruptedException {
    
    

        // 1 获取一行
        String line = value.toString();

        // 2 切割
        String[] words = line.split(" ");

        // 3 循环写出
        for (String word : words) {
    
    
            //封装outK
            outK.set(word);

            //写出
            context.write(outK, outV);
        }
    }
}

(2) Write the Reducer class


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 *KEYIN Redece阶段输入的key的类型:Text
 *VALUEIN Redece阶段输入的value类型:IntWritabel
 *KEYOUT Redece阶段输出的Key类型:Text
 *VALUEOUT Redece阶段输出的value类型:IntWritabel
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    

    IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
    
    

        // 1 累加求和
        int sum = 0;
        for (IntWritable value : values) {
    
    
            sum += value.get();
        }

        // 2 输出
        outV.set(sum);
        context.write(key,outV);
    }
}

(3) Write Driver driver class


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {
    
    

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    

        // 1 获取配置信息以及获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 2 设置jar包路径
        job.setJarByClass(WordCountDriver.class);

        // 3 关联Mapper和Reducer的jar
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4 设置Mapper输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
//        FileInputFormat.setInputPaths(job, new Path("路径"));
//        FileOutputFormat.setOutputPath(job, new Path("路径"));
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);
    }
}

1.8.2. Submit to cluster testing

(1) Packaged with maven, you need to add dependencies

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.atguigu</groupId>
    <artifactId>MapReduce</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.3</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.30</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>

(2) Package the program
Insert image description here
(3) Modify the name of the jar package without dependencies to wc.jar, and copy it to the /opt/module/hadoop-3.1.3 path of the Hadoop cluster
(4) Start the Hadoop cluster

myhadoop.sh stop
myhadoop.sh start

(5) Execute the WordCount program

hadoop jar wc.jar com.shenbaoyun.mapreduce.wordcount.WordCountDriver /user/shenbaoyun /input /user/shenbaoyun/output

2. Hadoop serialization

2.1. Overview of serialization

1) What is serialization? Serialization
converts objects in memory into a sequence of bytes (or other data transfer protocols) for storage to disk (persistence) and network transmission.
Deserialization is to convert the received byte sequence (or other data transfer protocol) or persistent data on the disk into an object in memory.
2) Why serialization is necessary?
Generally speaking, "live" objects only exist in memory and will not exist when the power is turned off. However, "live" objects can only be used by local processes and cannot be sent to other devices on the network. A computer. However, serialization can store "live" objects and can send "live" objects to remote computers.
3) Why not use Java serialization?
Java serialization is a heavyweight serialization framework (Serializable). After an object is serialized, it will be accompanied by a lot of additional information (various verification information, Header, inheritance system, etc.). It is not convenient to transmit on the network, so hadoop developed its own serialization mechanism (Writable).
4) Features of Hadoop serialization
(1) Compact: efficient use of space.
(2) Fast: The additional overhead of reading and writing data is small.
(3) Interoperability: Support multi-language interaction.

2.2. Custom bean object implements serialization interface (Writable)

The basic serialization types commonly used in enterprise development cannot meet all needs. For example, if a bean object is passed inside the Hadoop framework, then the object needs to implement the serialization interface.
The specific steps to implement bean object serialization are as follows: 7 steps.
(1) Implement the Writable interface
(2) The empty parameter constructor needs to be called reflectively during deserialization, so there must be an empty parameter constructor

public FlowBean() {
    
    
	super();
}

(3) Rewrite the serialization method

@Override
public void write(DataOutput out) throws IOException {
    
    
	out.writeLong(upFlow);
	out.writeLong(downFlow);
	out.writeLong(sumFlow);
}

(4) Rewrite the deserialization method

@Override
public void readFields(DataInput in) throws IOException {
    
    
	upFlow = in.readLong();
	downFlow = in.readLong();
	sumFlow = in.readLong();
}

(5) Note that the deserialization order is exactly the same as the serialization order
(6) If you want to display the results in the file, you need to rewrite toString(), which can be separated by "\t" to facilitate subsequent use
(7) If necessary To transfer custom beans in the key, you also need to implement the Comparable interface, because the Shuffle process in the Mapareduce framework requires that the key must be sortable.

@Override
public int compareTo(FlowBean o) {
    
    
	// 倒序排列,从大到小
	return this.sumFlow > o.getSumFlow() ? -1 : 1;
}

2.3. Practical operation of serialization cases

1) Statistics of the total uplink traffic, total downlink traffic, and total traffic consumed by each mobile phone number (
1) Input data
(2) Input data format

7 13560436666 120.196.100.99 1116 954 200
id phone number network ip Upstream traffic Downstream traffic Network status code

(3) Expected output data format

13560436666 1116 954 2070
phone number Upstream traffic Downstream traffic total traffic

2) Requirements analysis
1. Requirements: Statistics of the total upstream traffic, total downstream traffic, and total traffic consumed by each mobile phone number 2.
Input data format 3.
Expected output format
Map stage
4. Read a row of data and segment the fields
5. Extract Mobile phone number, upstream traffic, and downstream traffic
6. Use the mobile phone number as the key and the bean object as the value output, that is, context.write (mobile phone number, bean)
7. To be able to transmit the bean object, it must implement the serialization interface
Reduce stage
8. Accumulate upstream traffic and downstream traffic to get the total traffic
3) Write a MapReduce program
(1) Write a Bean object for traffic statistics


import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

//1 继承Writable接口

/**
 * 1、定义类时限Writable接口
 * 2、重写序列化和反序列化
 * 3、重写空参构造
 * 4、toString方法
 */
public class FlowBean extends LongWritable implements Writable {
    
    

    private long upFlow;    //上行流量
    private long downFlow;  //下行流量
    private long sumFlow;   //总流量

    //空参构造
    public FlowBean() {
    
    
    }

    public long getUpFlow() {
    
    
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
    
    
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
    
    
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
    
    
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
    
    
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
    
    
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
    
    
        this.sumFlow = this.upFlow + this.downFlow;
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    

        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    

        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();

    }

    @Override
    public String toString() {
    
    
        return  upFlow + "\t" + downFlow + "\t" + sumFlow ;
    }
}

(2) Write Mapper class


import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FLowMapper extends Mapper<LongWritable, Text, Text , LongWritable> {
    
    

    private Text outK = new Text();
    private FlowBean outV = new FlowBean();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    

        //1.获取一行信息
        String line = value.toString();
```java

        //2.切割
        String[] split = line.split("\t");

        //3.抓取数据
        String phone = split[1];
        String up = split[split.length - 3];
        String down = split[split.length - 2];

        //4.封装
        outK.set(phone);
        outV.setUpFlow(Long.parseLong(up));
        outV.setDownFlow(Long.parseLong(down));
        outV.setSumFlow();

        //5.写出
        context.write(outK, outV);
    }

(3) Write the Reducer class


import org.apache.hadoop.mapreduce.Reducer;

import javax.xml.soap.Text;
import java.io.IOException;

public class FlowReducer extends Reducer<javax.xml.soap.Text, FlowBean, Text, FlowBean> {
    
    


    private  FlowBean outV = new FlowBean();
    @Override
    protected void reduce(javax.xml.soap.Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
    
    

        //1.遍历集合累加
        long totalUp = 0;
        long totaldown = 0;
        for (FlowBean value : values) {
    
    
            totalUp += value.getUpFlow();
            totaldown += value.getDownFlow();
        }
        //2.封装outK,outV
        outV.setUpFlow(totalUp);
        outV.setDownFlow(totaldown);
        outV.setSumFlow();

        //3.写出
        context.write(key, outV);
    }
}

(4) Write Driver class


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import javax.xml.soap.Text;
import java.io.IOException;

public class FlowDriver {
    
    
    public static void main(String[] args) throws IOException {
    
    
        //1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2.设置jar
        job.setJarByClass(FlowDriver.class);

        //3.关联mapper和reducer
        job.setMapperClass(FLowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //4.设置mapper输出key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        //5.设置最终输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //6.设置数据的输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
    }

}

3. MapReduce framework principle

3.1. InputFormat data input

3.1.1. Slicing and MapTask parallelism determination mechanism

1) The parallelism of MapTask determines the concurrency of task processing in the Map stage, which in turn affects the processing speed of the entire job.
Thinking: Starting 8 MapTasks for a 1G program can improve the concurrent processing capabilities of the cluster. So with 1K of data, will starting 8 MapTasks improve performance? Is the more MapTask parallel tasks the better? What factors affect MapTask parallelism?
2) MapTask parallelism determination mechanism
Data block: Block is where HDFS physically divides data into pieces. A data block is a unit of HDFS storage data.
Data slicing: Data slicing only logically slices the input and does not slice it into pieces on the disk for storage. A data slice is a unit for the MapReduce program to calculate input data. A slice starts a MapTask.
(1) The parallelism of a job's Map phase is determined by the number of slices when the client submits the job again.
(2) Each split slice is assigned a MapTask parallel instance for processing.
(3) By default, slice size = BlockSize.
(4) When slicing, the entire data set is not considered, but each file is sliced ​​individually.

3.1.2. Detailed explanation of Job submission process source code and slicing process source code

waitForCompletion()

submit();

// 1建立连接
	connect();	
		// 1)创建提交Job的代理
		new Cluster(getConfiguration());
			// (1)判断是本地运行环境还是yarn集群运行环境
			initialize(jobTrackAddr, conf); 

// 2 提交job
submitter.submitJobInternal(Job.this, cluster)

	// 1)创建给集群提交数据的Stag路径
	Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

	// 2)获取jobid ,并创建Job路径
	JobID jobId = submitClient.getNewJobID();

	// 3)拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);	
	rUploader.uploadFiles(job, jobSubmitDir);

	// 4)计算切片,生成切片规划文件
writeSplits(job, submitJobDir);
		maps = writeNewSplits(job, jobSubmitDir);
		input.getSplits(job);

	// 5)向Stag路径写XML配置文件
writeConf(conf, submitJobFile);
	conf.writeXml(out);

	// 6)提交Job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());

2) FileInputFormat slicing source code analysis
(1) The program first finds the directory where the data is stored
(2) Starts traversing each file in the processing (planning slicing) directory
(3) Traverses the first file xxx.txt
a) Gets the file size fs. sizeOf(xxx.txt)
b) Calculate slice size

computeSplitSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M
c)默认情况下切片大小=blocksize
d)开始切片,形成第一个切片:**.txt--0-128M,第二个切片128-256M,第三个切片256-300M(每次切片是都要判断下剩余的部分是否大于块的1.1倍,不大于1.1的就划分为一块切片)
e)将切片信息写到一个切片规划文件中
f)整个切片的核心过程在getSplit()方法中完成
g)InputSplit只记录了切片的元数据信息,比如起始位置,长度以及所在节点列表等。

(4) Submit the slicing planning file to YARN, and MrAppMaster on YARN can calculate the number of opened MapTasks based on the slicing planning file.

3.1.3. FileInputFormat slicing mechanism

1. Slicing mechanism
(1) Simply slice according to the content length of the file
(2) The slice size defaults to the block size
(3) The entire data set is not considered when slicing, but each file is sliced ​​individually
2. FileInputFormat slice size Parameter configuration
(1) Formula for calculating slice size in source code

Math.max(minSize,Math.min(maxSize,blocksize)));
mapreduce.input.fileinputformat.split.minsize=1;默认值为1
mapreduce.input.fileinputformat.split.maxsize=Long.MAXValue;默认值Long.MAXValue

So by default, slice size = blocksize.
(2) Slice size setting

maxsize (maximum slice value): If the parameter is adjusted smaller than blocksize, the slice will become smaller and equal to the value of this parameter.
minsize (minimum slice value): If the parameter is adjusted larger than blocksize, the slice will become larger than blocksize.

(3) Obtain slice information API

//获取切片的名称文件
String name = inputsplit.getPath().getName();
//根据文件类型获取切片信息
FileSplit inputSplit = (FileSplit) context.getInputSplit();

3.1.4、TextInputFormat

1) FileInputFormat implementation class
Thinking: When running a MapReduce program, the input file formats include: line-based log files, binary format files, database tables, etc. So how does MapReduce read this data for different data types?
Common interface implementation classes of FileInputFormat include: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat and custom InputFormat, etc.
2) TextInputFormat
TextInputFormat is the default FileInputFormat implementation class. Read each record row by row. The key is the starting byte offset in the entire file where the line is stored, of type LongWritable. The value is the content of this line, excluding any line terminators (line feed, carriage return), Text type

3.1.5. COmbineTextInputFormat slicing mechanism

The framework's default TextInputFormat slicing mechanism is to slice tasks according to file planning. No matter how small the file is, it will be a separate slice and will be handed over to a MapTask. If there are a large number of small files, a large number of MapTasks will be generated, and the processing efficiency is extremely low.
1) Application scenario
CombineTextInputFormat is used in scenarios where there are too many small files. It can logically plan multiple small files into one slice, so that multiple small files can be handed over to one MapTask for processing.
2) Setting the maximum value of virtual storage slices

CombineTextInputFormat.setMaxInputSoplitSize(job,4194304);//4M

Note: It is best to set the maximum virtual storage slice size based on the actual small file size.
3) Slicing mechanism
The process of generating slices includes two parts: virtual storage process and slicing process.
(1) Compare the size of all files in the input directory with the set setMaxInputSoplitSize value in sequence. If it is not greater than the set maximum value, logically divide it into a block. If the input file is larger than the maximum value and is larger than twice, then one block is cut according to the maximum value. When the remaining data size exceeds the set maximum value and does not exceed twice, the file is divided into two virtual storage blocks.
For example, if the setMaxInputSplitSize value is 4M and the input file size is 8.02M, it will be logically divided into 4M files. The remaining size is 4.02M. If divided according to 4M logic, a small virtual storage file of 0.02M will appear, so the remaining 4.02M file will be divided into two files (2.01M and 2.01M).
(2) Slicing process
(a) Determine whether the file size of the virtual storage is greater than the setMaxInputSoplitSize value. If it is greater, a separate slice will be formed.
(b) If it is not larger, it will be merged with the next virtual storage file to form a slice.
(c) Test example: There are 4 small files with sizes of 1.7M, 5.1M, 3.4M and 6.8M respectively. After virtual storage, 6 file blocks are formed with sizes of: 1.7M, (
2.55 M, 2.55M), 3.4M and (3.4M, 3.4M)
will eventually form 3 slices, the sizes are:
(1.7+2.55)M, (2.55+3.4)M, (3.4+3.4)M

3.1.6. COmbineTextInputFormat case

1) Requirements
(1) Prepare four files
(2) Expect one slice to process 4 files
2) Implement
(a) Add code to the driver class (WordcountDriver)

// 如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);

//虚拟存储切片最大值设置4m
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

(b) Run if for 3 slices.

number of splits:3

(3) Add the following code to WordcountDriver, run the program, and observe that the number of running slices is 1.
(a) Add the following code to the driver:

// 如果不设置InputFormat,它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);

//Set the maximum value of virtual storage slice to 20m

CombineTextInputFormat.setMaxInputSplitSize(job, 20971520);//最大值大于四个文件总和

(b) Run if for 1 slice

number of splits:1

3.2. MapReduce workflow and principle

Map phase
(1) starts a job on the client.
(2) Submit the resource files required to run the job to YARN, and start multiple MapTasks based on the slice file information.
(3) MapTask parses multiple <k, v> from InputSplit through the user-written RecordReader and hands them to the user-written map method for processing, and generates new <k, v> (4) When the map method processing is
completed Finally, call OutputCollector.collect() to output the results and temporarily store them in a ring memory buffer (default 100M, controlled by the io.sort.mb property). When the buffer is about to overflow (default is 80% of the buffer size, controlled by the io.sort.spill.percent property), an overflow file will be created in the local file system, and the data in the buffer will be written to this document.
(5) Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, one reduce task corresponds to the data of one partition. This is done to avoid reduce data skew and cause efficiency problems.
When the map task outputs the last file, there may be many overflow files. At this time, they need to be merged, and sorting and combianer operations will be continuously performed during the merge process in order to reduce the amount of data written to the disk each time. , minimize the amount of data transmitted over the network in the next copy stage, and finally merge into a partitioned and sorted file.
(6) Copy the data in the partition to the corresponding reduce task,
reduce phase
(7) Reduce will receive data passed by different map tasks, and the data passed by each map is in order. If the amount of data received by the reduce end is small, it will be stored directly in the memory (the buffer size is determined by Controlled by the mapred.job.shuffle.input.buffer.percent property, indicating the percentage of heap space used for this purpose), if the data exceeds a certain proportion of the buffer size (determined by mapred.job.shuffle.merge.percent) , the data will be merged and then overflowed and written to the disk.
(8) As the number of overflow files increases, the background thread will merge them into a larger ordered file, which will save merging time (in fact, MapReduce performs sorting repeatedly regardless of the map side or the reduce side. Merge operation, now I finally understand why some people say: sorting is the soul of hadoop).
(9) A lot of intermediate files will be generated during the merge process, but MapReduce will write as little data as possible to the disk, and the result of the last merge is not written to the disk, but directly input into the reduce function.

This section quotes https://www.cnblogs.com/hadoop-dev/p/5894911.html
and makes some changes to the content

3.3、Shuffle

3.3.1. Shuffle mechanism

1. What is the Shuffle mechanism?
In Hadoop, the process of transferring data from the Map stage to the Reduce stage is called Shuffle. The Shuffle mechanism is the core part of the entire MapReduce framework.
Core mechanisms: partitioning, sorting, caching.
2. Scope of Shuffle
Generally, the process of outputting data from the Map stage to the Reduce stage is called Shuffle, so the scope of Shuffle is the intermediate process from the output of the Map stage to the data input of the Reduce stage.
3. Shuffle process
(1) MapTask collects the <k, v> output by the map method and puts it into the memory buffer
(2) Local disk files continue to overflow from the memory buffer, and there may be multiple files
(3) Multiple overflows The files will be merged into large overflow files
(4) During the overflow process and the merge process, Partitioner must be called to partition and sort the keys
(5) ReduceTask will go to each MapTask machine according to its own partition number to obtain the corresponding Partition data
(6) ReduceTask will capture the result files from different MapTasks in the same partition, and ReduceTask will merge and sort these files again (
7) After merging into a large file, the Shuffle process is over, and the subsequent logical operation process of ReduceTask Take multiple <k, v> from the file and pass them into the user-defined reduce method.
Note:
(1) The buffer size in Shuffle will affect the execution efficiency of the MapReduce program. In principle, the buffer size The larger the value, the fewer the number of disk IOs and the faster the execution speed.
(2) The size of the buffer can be adjusted through parameters. Parameter: mapreduce.task.io.sort.mb defaults to 100M.

3.3.2, Partition partition

1. Partitioner default partition
Insert image description hereThe default partition is obtained by taking the modulo of the number of ReduceTasks based on the hashCode of the key. Users have no control over which key is stored in which area.
2. Steps to customize Partirioner
(1) The custom class inherits Partitioner and rewrites the getPartition() method
(2) Sets the custom Partirioner in the job driver
(3) After customizing the Partition, set the corresponding number according to the logic of the custom Partitioner ReduceTask
3. Partition summary
(1) If the number of ReduceTask is greater than the number of results of getPartition, several more empty output files part-r-000xx will be generated
(2) If 1<the number of ReduceTask<the number of results of getPartition, there will be some There is nowhere to place the partition, which will cause an Exception
(3) If the number of ReduceTasks = 1, no matter how many file partitions are output by the MapTask end, the final result will be given to this ReduceTask, and only one result file part-r-000xx will be generated
( 4) The partition numbers must be accumulated one by one starting from zero.
4. Case:
If the number of custom partitions added is 5, then
(1) job.setNumReduceTasks(1); will run normally and only generate one file
(2) job.setNumReduceTasks(2); The program reports an error
(3) job.setNumReduceTasks(5); it will run normally and an empty file will be generated.

3.3.3. Partition partition case

1) Requirement
: Output the statistical results into different files (partitions) according to the different provinces where the mobile phones belong.
1. Input data
2. Expected output data
Mobile phone numbers starting with 136, 137, 138, and 139 are put into four independent files respectively. , others starting with are put into a file.

2) Requirements analysis
1. Output the statistical results into different files (partitions) according to the different provinces where the mobile phone belongs
2. Data input
3. Expected data output
4. Add a ProvincePartitioner partition
5. Driver driver class (specify custom partitions, formulate corresponding number of reduceTask)

3) Based on case 2.3, add a partition class

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
    
    
    @Override
    public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
    
    
        //Text是手机号

        String phone = text.toString();

        String prePhone = phone.substring(0, 3);

        int partition;
        if("136".equals(prePhone)){
    
    
            partition = 0;
        }else if("137".equals(prePhone)){
    
    
            partition = 1;
        }else if("138".equals(prePhone)){
    
    
            partition = 2;
        }else if("139".equals(prePhone)){
    
    
            partition = 3;
        }else {
    
    
            partition = 4;
        }


        //返回partition
        return partition;
    }
}

4) Add custom data partition settings and ReduceTask settings in the driver function

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import javax.xml.soap.Text;
import java.io.IOException;

public class FlowDriver {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        //1.获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2.设置jar
        job.setJarByClass(FlowDriver.class);

        //3.关联mapper和reducer
        job.setMapperClass(FLowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //4.设置mapper输出key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        //5.设置最终输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //8 指定自定义分区器
        job.setPartitionerClass(ProvincePartitioner.class);

        //9 同时指定相应数量的ReduceTask
        job.setNumReduceTasks(5);


        //6.设置数据的输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //7 提交Job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);

    }

}

3.3.4, WritableComparable sorting

Sorting is one of the most important operations in the MapReduce framework.
Both MapTask and ReduceTask will sort data by key, which is Hadoop's default behavior. Data in any application is sorted, whether logically required or not.
The default sorting is in dictionary order, and the method to achieve this sorting is quick sort.
For MapTask, it will temporarily put the processing results into the ring buffer. When the ring buffer usage reaches a certain threshold, it will quickly sort the data in the buffer and overflow the ordered data to the disk. on, and when the data is processed, it will merge and sort all the files on the disk.
For ReduceTask, it remotely copies the corresponding data file from each MapTask. If the file size exceeds a certain limit, it is overwritten to the disk, otherwise it is stored in the memory. If the number of files on the disk reaches a certain threshold, it is performed once. Merge sort to generate a larger file. If the file size or input in memory exceeds a certain threshold, a merge operation is performed and the data is overflowed to the disk. After all data is copied, ReduceTask performs a unified merge and sorting of all data in memory and disk.

1. Sorting classification
(1) Partial sorting
MapReduce sorts the data set according to the key of the input record to ensure that each output file is internally ordered.
(2) Full sorting.
The final output result is only one file, and the file is internally ordered. The way to achieve this is to set up only one ReduceTask. However, this method is extremely inefficient when processing large files, because one machine completely loses the parallel architecture provided by MapReduce when processing all files.
(3) Auxiliary
sorting groups keys on the Reduce side. When the accepted key is a bean object, you can use grouping sorting when you want one or several keys with the same fields (all fields to be relatively different) to enter the same reduce method.
(4) Secondary sorting
In the custom sorting process, if the judgment conditions in compartTo are two, it is secondary sorting.

Custom sorting WritableComparable principle analysis
bean object is transmitted as a key. You need to implement the WritableComparable interface and override the compareTo method to achieve sorting.

@Override
public int compareTo(FlowBean bean) {
    
    

	int result;
		
	// 按照总流量大小,倒序排列
	if (this.sumFlow > bean.getSumFlow()) {
    
    
		result = -1;
	}else if (this.sumFlow < bean.getSumFlow()) {
    
    
		result = 1;
	}else {
    
    
		result = 0;
	}

	return result;
}

3.3.5. WritableComparable sorting case (full sorting)

1. Requirements:
Sort the total traffic in reverse order again based on the results generated by the serialization case in Case 2.3. The total traffic is sorted again in reverse order based on the results produced by the serialization case in Case 2.3.
2. Requirements analysis
(1) Sort according to the total traffic of mobile phones
(2) Input data
(3) Output data
(4) FLowBean implements the WritableComparable interface and rewrites the compareTo method
(5) Mapper class
(6) Reducer class
3. Code
(1 ) The FlowBean object adds a comparison function based on requirement 1.
Create a new FlowBean class and override the write, readFileds, and compareTo methods. Set up a no-parameter structure, set the total upstream traffic and downlink traffic, three attributes

package com.shenbaoyun.mapreduce.writablecompable;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements WritableComparable<FlowBean> {
    
    

    //设置三个属性
    private long upFlow;    //上行流量
    private long downFlow;  //下行流量
    private long sumFlow;   //总流量

    //设置无参构造
    public FlowBean() {
    
    
    }

    public long getUpFlow() {
    
    
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
    
    
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
    
    
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
    
    
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
    
    
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
    
    
        this.sumFlow = sumFlow;
    }
    public void setSumFlow() {
    
    
        this.sumFlow = this.upFlow + this.downFlow;
    }

    //实现序列化和反序列化方法,注意顺序一定要一致
    @Override
    public void write(DataOutput out) throws IOException {
    
    

        out.writeLong(this.upFlow);
        out.writeLong(this.downFlow);
        out.writeLong(this.sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    

        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    @Override
    public String toString() {
    
    
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    @Override
    public int compareTo(FlowBean o) {
    
    

        //按照总流量比较,倒序排列
        if(this.sumFlow > o.sumFlow){
    
    
            return -1;
        }else if(this.sumFlow < o.sumFlow){
    
    
            return 1;
        }else {
    
    
            return 0;
        }
    }
}

(2) Write the Mapper class
, define outK, outV, and rewrite the map method. The process is still to obtain data, cut, encapsulate the upstream traffic, total downstream traffic, and write out

package com.shenbaoyun.mapreduce.writablecompable;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
    
    

    private FlowBean outK = new FlowBean();
    private Text outV = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    

        //1.获取一行数据
        String line = value.toString();

        //2.切割获取的数据
        String[] split = line.split("\t");

        //3.封装
        outK.setUpFlow(Long.parseLong(split[1]));
        outK.setDownFlow(Long.parseLong(split[2]));
        outK.setSumFlow();
        outV.set(split[0]);

        //4.写出
        context.write(outK, outV);
    }
}


(4) Write the Driver class
to obtain the job object, associate this Driver, associate Mapper and Reducer, set the kv type of the map output data, set the kv type of the final output data of the program, set the program input and output path, and submit the job

package com.shenbaoyun.mapreduce.writablecompable;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    

        //1.获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance();

        //2.关联本Driver类
        job.setJarByClass(FlowDriver.class);

        //3.关联Mapper和Reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //4.设置Map输出数据的kv类型
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);

        //5.设置程序最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //6.设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //7.提交job
        boolean b = job.waitForCompletion(true);
        System.out.println(b ? 0 : 1);
    }
}

3.3.6. Practical operation of WritableComparable sorting case (sorting within the area)

1) Internal sorting according to the total traffic in the file output according to the mobile phone number of each province
2) Demand analysis
1, data input
2, expected output
3)
Copy the case based on the previous case and create a new class ProvincePartitioner2, Integrate Partitoner, rewrite the getPartition method, obtain the first three digits of the mobile phone number, and set the partition number based on the first three digits of the mobile phone number.

package com.shenbaoyun.mapreduce.partitionercompable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner2 extends Partitioner<FlowBean, Text> {
    
    
    @Override
    public int getPartition(FlowBean flowBean, Text text, int numPartitions) {
    
    

        //获取手机号前三位判断省份
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);

        //定义一个分区号变量partition,根据prePhone设置分区号
        int partition;
        if("136".equals(prePhone)){
    
    
            partition = 0;
        }else if("137".equals(prePhone)){
    
    
            partition = 1;
        }else if("138".equals(prePhone)){
    
    
            partition = 2;
        }else if("139".equals(prePhone)){
    
    
            partition = 3;
        }else {
    
    
            partition = 4;
        }

        //最后返回partition
        return partition;
    }
}


(2) Add the following code to the driver class

// 设置自定义分区器
job.setPartitionerClass(ProvincePartitioner2.class);

// 设置对应的ReduceTask的个数
job.setNumReduceTasks(5);

3.3.7, Combiner merge

(1) Combiner is a component other than Mapper and Reduce in MR program.
(2) The parent class of the Combiner component is Reducer.
(3) The difference between Combiner and Reducer lies in the running location. Combiner runs on the node where each MapTask is located, and Reducer receives the output results of all Mapper globally.
(4) The meaning of Combiner is to partially summarize the output of each MapTask to reduce network transmission.
(5) The Combiner can be applied without affecting the final business logic, and the output kv of the Combiner should correspond to the input kv of the Reducer (6) Custom implementation steps (
a
) Customize a Combiner to integrate the Reducer, and rewrite the Reduce method

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    

    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    

        int sum = 0;
        for (IntWritable value : values) {
    
    
            sum += value.get();
        }
     
        outV.set(sum);
     
        context.write(key,outV);
    }
}

(b) Set in the Job driver class:

job.setCombinerClass(WordCountCombiner.class);

3.3.8. Combiner merger case practice

1) Requirements:
During the statistical process, the output of each MapTask is partially summarized to reduce the network transmission volume, that is, the Combiner function is used.
There are two implementation methods
. Option 1.
(1) Add a WordcountCombiner class to inherit Reducer.
(2) Count word summaries in WordcountCombiner and output the statistical results.
Option 2.
Specify WordcountReducer as a Combiner in the WordcountDriver driver class.

job.setCombinerClass(WordcountReducer.class);

2) Requirements analysis
1, data input
2, expected output

3) Case practice plan one
(1) Copy the WordCount case, rename it to combiner, and add a WordcountCombiner class to inherit Reducer

package com.shenbaoyun.mapreduce.combiner;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombiner extends Reducer<Text, IntWritable,Text,IntWritable> {
    
    
    
    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
        
        int sum=0;
        for (IntWritable value : values) {
    
    
            sum+=value.get();
        }
        //封装
        outV.set(sum);
        
        //写出
        context.write(key,outV);
    }
}

(2) Specify Combiner in the WordcountDriver driver class

// 指定需要使用combiner,以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordCountCombiner.class);

4) Case practice - Option 2
(1) Specify WordcountReducer as a Combiner in the WordcountDriver driver class

// 指定需要使用Combiner,以及用哪个类作为Combiner的逻辑
job.setCombinerClass(WordCountReducer.class);

3.4. OutputFormat data output

3.4.1. OutputFormat interface implementation class

OutputFormat is the base class for MapReduce output, and all implementations of MapReduce output implement the OutputFormat interface.
1. OutputFormat implementation class
2. Default output format TextOutputFormat
3. Customized OutputFormat
(1) Application scenario
Output data to storage frameworks such as MySQL, HBase, and Elasticsearch.
(2) Customize OutputFormat step
: Customize a class to inherit FileOutputFormat.
Rewrite ReducerWrite, specifically rewrite the method write() of output data.

3.4.2. Customized OutputFormat case practice

1) Requirements
: Filter the output logs. Websites that contain XX are output to e:/xx.log, and websites that do not contain XX are output to e:/other.log.
(1) Input data
log file
(2) Output data
xx.log, other.log
2) Case practice
(1) Write LogMapper class

package com.shenbaoyun.mapreduce.outputformat;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class LogMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    
    
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
        //不做任何处理,直接写出一行log数据
        context.write(value,NullWritable.get());
    }
}

(2) Write LogReducer class

package com.shenbaoyun.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class LogReducer extends Reducer<Text, NullWritable,Text, NullWritable> {
    
    
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
    
    
        // 防止有相同的数据,迭代写出
        for (NullWritable value : values) {
    
    
            context.write(key,NullWritable.get());
        }
    }
}

(3) Customize a LogOutputFormat class

package com.shenbaoyun.mapreduce.outputformat;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class LogOutputFormat extends FileOutputFormat<Text, NullWritable> {
    
    
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
    
    
        //创建一个自定义的RecordWriter返回
        LogRecordWriter logRecordWriter = new LogRecordWriter(job);
        return logRecordWriter;
    }
}

(4) Write the LogRecordWriter class

package com.shenbaoyun.mapreduce.outputformat;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import java.io.IOException;

public class LogRecordWriter extends RecordWriter<Text, NullWritable> {
    
    

    private FSDataOutputStream shenbaoyunOut;
    private FSDataOutputStream otherOut;

    public LogRecordWriter(TaskAttemptContext job) {
    
    
        try {
    
    
            //获取文件系统对象
            FileSystem fs = FileSystem.get(job.getConfiguration());
            //用文件系统对象创建两个输出流对应不同的目录
            shenbaoyunOut = fs.create(new Path("d:/hadoop/shenbaoyun.log"));
            otherOut = fs.create(new Path("d:/hadoop/other.log"));
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text key, NullWritable value) throws IOException, InterruptedException {
    
    
        String log = key.toString();
        //根据一行的log数据是否包含shenbaoyun,判断两条输出流输出的内容
        if (log.contains("shenbaoyun")) {
    
    
            shenbaoyunOut.writeBytes(log + "\n");
        } else {
    
    
            otherOut.writeBytes(log + "\n");
        }
    }

    @Override
    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
    
    
        //关流
        IOUtils.closeStream(shenbaoyunOut);
        IOUtils.closeStream(otherOut);
    }
}

(5)Write LogDriver class

package com.shenbaoyun.mapreduce.outputformat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class LogDriver {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(LogDriver.class);
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //设置自定义的outputformat
        job.setOutputFormatClass(LogOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\input"));
        //虽然我们自定义了outputformat,但是因为我们的outputformat继承自fileoutputformat
        //而fileoutputformat要输出一个_SUCCESS文件,所以在这还得指定一个输出目录
        FileOutputFormat.setOutputPath(job, new Path("D:\\logoutput"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

3.5. MapReduce kernel source code analysis

3.5.1. MapTask working mechanism

MapTask is divided into many stages, Read stage, Map stage, Collect stage, overflow stage, and Merge stage
(1) Read stage: MapTask uses the RecordReader obtained by InputFormat to parse out the keys and values ​​one by one from the input InputSplit.
(2) Map stage: This node mainly hands over the parsed keys and values ​​to the map() function written by the user, and generates a series of new keys and values.
(3) Collect collection stage: In the user-written map() function, when the data processing is completed, OutputCollector.collect() will generally be called to output the results. Inside this function, it will generate key and value partitions (call the Partirion method) and write them into a ring memory buffer.
(4) Spill stage: that is, the overflow stage. When the ring buffer is full, MapReduce will write the data to the local disk and generate a temporary file. It should be noted that before writing the data to the local disk, the data must be sorted locally, and if necessary, the data must be merged and compressed.
Details of the overflow stage:
Step 1: Use the quick sort algorithm to sort the data in the cache area. The sorting method is to first sort according to the partition number Partition, and then sort according to the key. In this way, after sorting, the data is gathered together in units of partitions, and all data in the same partition is ordered by key.
Step 2: Write the data in each partition to the temporary file output/spillN.out in the task working directory in ascending order according to the partition number (N represents the current number of spills). If the user sets a Combiner, an aggregation operation is performed on the data in each partition before writing to the file.
Step 3: Write the metainformation of the partition data into the memory index data structure SpillRecord. The metainformation of each partition includes the offset in the temporary file, the data size before compression and the data size after compression. If the current memory index size exceeds 1MB, write the memory index to the file output/spillN.out.index.
(5) Merge stage: After all data processing is completed, MapTask merges all temporary files to ensure that only one file will be formed in the end.
When all data processing is completed, MapTask merges all temporary files into one large file and saves it to output/file.out, and generates the corresponding index file output/file.out.index.
During the file merging process, MapTask merges files in partition units. For a certain partition, it will use multiple rounds of recursive merging, merging mapreduce.task.io.sort.factor (default 10) files in each round, and re-add the resulting files to the list to be merged. After sorting the files Repeat the above process until you end up with a large file.
Let each MapTask ultimately generate only one data file to avoid the overhead of random reads caused by opening and reading a large number of files at the same time.

3.5.2. ReduceTask working mechanism

ReduceTask is divided into multiple stages, Copy stage, Sort stage, and Reduce stage
(1) Copy stage: ReduceTask remotely copies a piece of data from each MapTask, and for a certain piece of data, if its size exceeds a certain threshold, it is written to the disk. Otherwise, it is placed directly into memory.
(2) Sort phase: While copying data remotely, ReduceTask starts two background threads to merge files in memory and disk to prevent excessive memory usage or too many files on disk. According to MapReduce semantics, the input data of the reduce method written by the user is a set of data aggregated according to the key. In order to group data with the same key together, Hadoop adopts a sorting-based strategy. Since each MapTask has implemented partial sorting of its own processing results, the ReduceTask only needs to merge and sort all the data once.
(3) Reduce stage: The reduce method writes the calculation results to HDFS

3.5.3. ReduceTask parallelism determination mechanism

Review: MapTask parallelism is determined by slicing, and the number of slices is determined by the input file and slicing rules.
ask? Who decides the parallelism of ReduceTask?
1) Set ReduceTask parallelism
ReduceTask parallelism also affects the concurrency and execution efficiency of the entire job, but unlike MapTask, where the number of concurrency is determined by the number of slices, the number of ReduceTasks can be set manually.

// 默认值是1,手动设置为4
job.setNumReduceTasks(4);

2) Notes
(1) ReduceTask=0 means there is no Reduce stage, and the number of output files is consistent with the number of Maps.
(2) The default value of ReduceTask is 1, so the number of output files is 1.
(3) If the data distribution is uneven, data skew may occur in the Reduce stage.
(4) The number of ReduceTasks is not set arbitrarily. Business logic requirements must also be considered. In some cases, global summary results need to be calculated, and there can only be one ReduceTask.
(5) The specific number of ReduceTasks needs to be determined according to the cluster performance.
(6) If the number of partitions is not 1, but the ReduceTask is 1, the partitioning process will not be executed, because in the MapTask source code, the prerequisite for executing the partition is to determine whether the number of ReduceNum is greater than 1.

3.6. Join application

3.6.1、Reduce Join

The main work on the Map side is to label kv pairs from different tables or files to distinguish records from different sources, then use the connection field as the key, the remaining parts and the newly added flag as the value, and finally output.
The main work of the Reduce side: The grouping of connection fields as keys on the Reduce side has been completed. We only need to separate the records from different files (already marked in the Map stage) in each group, and finally merge them.

3.6.2. ReduceJoin case

(1) Create a TableBean class that combines products and orders

package com.shenbaoyun.mapreduce.reducejoin;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class TableBean implements Writable {
    
    

    private String id; //订单id
    private String pid; //产品id
    private int amount; //产品数量
    private String pname; //产品名称
    private String flag; //判断是order表还是pd表的标志字段

    public TableBean() {
    
    
    }

    public String getId() {
    
    
        return id;
    }

    public void setId(String id) {
    
    
        this.id = id;
    }

    public String getPid() {
    
    
        return pid;
    }

    public void setPid(String pid) {
    
    
        this.pid = pid;
    }

    public int getAmount() {
    
    
        return amount;
    }

    public void setAmount(int amount) {
    
    
        this.amount = amount;
    }

    public String getPname() {
    
    
        return pname;
    }

    public void setPname(String pname) {
    
    
        this.pname = pname;
    }

    public String getFlag() {
    
    
        return flag;
    }

    public void setFlag(String flag) {
    
    
        this.flag = flag;
    }

    @Override
    public String toString() {
    
    
        return id + "\t" + pname + "\t" + amount;
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
        out.writeUTF(flag);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    
        this.id = in.readUTF();
        this.pid = in.readUTF();
        this.amount = in.readInt();
        this.pname = in.readUTF();
        this.flag = in.readUTF();
    }
}

(2) Write the TableMapper class

package com.shenbaoyun.mapreduce.reducejoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class TableMapper extends Mapper<LongWritable,Text,Text,TableBean> {
    
    

    private String filename;
    private Text outK = new Text();
    private TableBean outV = new TableBean();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
    
    
        //获取对应文件名称
        InputSplit split = context.getInputSplit();
        FileSplit fileSplit = (FileSplit) split;
        filename = fileSplit.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    

        //获取一行
        String line = value.toString();

        //判断是哪个文件,然后针对文件进行不同的操作
        if(filename.contains("order")){
    
      //订单表的处理
            String[] split = line.split("\t");
            //封装outK
            outK.set(split[1]);
            //封装outV
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.parseInt(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        }else {
    
                                 //商品表的处理
            String[] split = line.split("\t");
            //封装outK
            outK.set(split[0]);
            //封装outV
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");
        }

        //写出KV
        context.write(outK,outV);
    }
}

(3)Write TableReducer class

package com.shenbaoyun.mapreduce.reducejoin;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;

public class TableReducer extends Reducer<Text,TableBean,TableBean, NullWritable> {
    
    

    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
    
    

        ArrayList<TableBean> orderBeans = new ArrayList<>();
        TableBean pdBean = new TableBean();

        for (TableBean value : values) {
    
    

            //判断数据来自哪个表
            if("order".equals(value.getFlag())){
    
       //订单表

			  //创建一个临时TableBean对象接收value
                TableBean tmpOrderBean = new TableBean();

                try {
    
    
                    BeanUtils.copyProperties(tmpOrderBean,value);
                } catch (IllegalAccessException e) {
    
    
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
    
    
                    e.printStackTrace();
                }

			  //将临时TableBean对象添加到集合orderBeans
                orderBeans.add(tmpOrderBean);
            }else {
    
                                        //商品表
                try {
    
    
                    BeanUtils.copyProperties(pdBean,value);
                } catch (IllegalAccessException e) {
    
    
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
    
    
                    e.printStackTrace();
                }
            }
        }

        //遍历集合orderBeans,替换掉每个orderBean的pid为pname,然后写出
        for (TableBean orderBean : orderBeans) {
    
    

            orderBean.setPname(pdBean.getPname());

		   //写出修改后的orderBean对象
            context.write(orderBean,NullWritable.get());
        }
    }
}

(4) Write the TableDriver class

package com.shenbaoyun.mapreduce.reducejoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class TableDriver {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(TableDriver.class);
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);

        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

2) Summary
In this method, the merge operation is completed in Reduce, and the Reduce side is under great processing pressure. The load on the Map node is very low, resource utilization is not high, and data skew is easily generated in the Reduce stage.

3.6.3、Map Join

1) Usage scenario
MapJoin is suitable for scenarios with large tables and small tables

2) What to do if Reduce data is skewed?
Cache multiple tables on the Map side to process business logic in advance, increase Map side business, reduce data pressure on the Reduce side, and reduce data skew.

3) The specific method is to use DistributedCache
(1) Read the file into the cache collection during the setUp stage of Mapper.
(2) Load cache in Driver driver class

//缓存普通文件到Task运行节点。
//job.addCacheFile(new URI("file:///e:/cache/XXX"));
//如果是集群运行,需要设置HDFS路径
job.addCacheFile(new URI("hdfs://hadoop100:8020/cache/XXX"));

3.6.4. MapJoin case

1) MapJoin is suitable for scenarios where large tables are associated with small tables
2) Operation steps
(1) DistributedCacheDriver cache files

//加载缓存数据
job.addCacheFile(new URI("hdfs://hadoop100:8020/cache/XXX"));

//Map端join的逻辑不需要Reduce阶段,设置ReduceTask数量为0
job.setNumReduceTask(0);

(2) Read the cached file data
setup method:
1. Get the cached file
2. Loop and read one line of the cached file
3. Cut
4. Cache the data into a collection
5. Close the stream

Map method:
1. Get a row
2. Intercept
3. Get pid
4. Get order id and product name
5. Splice
6. Write out

(3) Implementation
(1) First add the cache file in the MapJoinDriver driver class

package com.shenbaoyun.mapreduce.mapjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class MapJoinDriver {
    
    

    public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
    
    

        // 1 获取job信息
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        // 2 设置加载jar包路径
        job.setJarByClass(MapJoinDriver.class);
        // 3 关联mapper
        job.setMapperClass(MapJoinMapper.class);
        // 4 设置Map输出KV类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        // 5 设置最终输出KV类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 加载缓存数据
        job.addCacheFile(new URI("file:///D:/input/tablecache/pd.txt"));
        // Map端Join的逻辑不需要Reduce阶段,设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6 设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\input"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\output"));
        // 7 提交
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

(2) Read the cache file in the setup method in the MapJoinMapper class

package com.shenbaoyun.mapreduce.mapjoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    
    

    private Map<String, String> pdMap = new HashMap<>();
    private Text text = new Text();

    //任务开始前将pd数据缓存进pdMap
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
    
    

        //通过缓存文件得到小表数据pd.txt
        URI[] cacheFiles = context.getCacheFiles();
        Path path = new Path(cacheFiles[0]);

        //获取文件系统对象,并开流
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(path);

        //通过包装流转换为reader,方便按行读取
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

        //逐行读取,按行处理
        String line;
        while (StringUtils.isNotEmpty(line = reader.readLine())) {
    
    
            //切割一行    
//01	小米
            String[] split = line.split("\t");
            pdMap.put(split[0], split[1]);
        }

        //关流
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    

        //读取大表数据    
//1001	01	1
        String[] fields = value.toString().split("\t");

        //通过大表每行数据的pid,去pdMap里面取出pname
        String pname = pdMap.get(fields[1]);

        //将大表每行数据的pid替换为pname
        text.set(fields[0] + "\t" + pname + "\t" + fields[2]);

        //写出
        context.write(text,NullWritable.get());
    }
}

3.7 Data cleaning (ETL)

ETL is the abbreviation of Extract-Transform-Load in English, which is used to describe the process of extracting, transforming, and loading data from the source to the destination. ETL is often used in data warehouses, but is not limited to data warehouses.
Before running the core business MapReduce program, the data often needs to be cleaned to remove data that does not meet user requirements. The cleaning process often only requires running the Mapper program and does not need to run the Reduce program.
1) Requirements
: Remove logs whose number of fields is less than or equal to 11.
2) Implementation
(1) Write the WebLogMapper class

package com.shenbaoyun.mapreduce.weblog;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
    
    
	
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
		
		// 1 获取1行数据
		String line = value.toString();
		
		// 2 解析日志
		boolean result = parseLog(line,context);
		
		// 3 日志不合法退出
		if (!result) {
    
    
			return;
		}
		
		// 4 日志合法就直接写出
		context.write(value, NullWritable.get());
	}

	// 2 封装解析日志的方法
	private boolean parseLog(String line, Context context) {
    
    

		// 1 截取
		String[] fields = line.split(" ");
		
		// 2 日志长度大于11的为合法
		if (fields.length > 11) {
    
    
			return true;
		}else {
    
    
			return false;
		}
	}
}

(2) Write the WebLogDriver class

package com.shenbaoyun.mapreduce.weblog;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WebLogDriver {
    
    
	public static void main(String[] args) throws Exception {
    
    

// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
        args = new String[] {
    
     "D:/input/inputlog", "D:/output1" };

		// 1 获取job信息
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		// 2 加载jar包
		job.setJarByClass(LogDriver.class);

		// 3 关联map
		job.setMapperClass(WebLogMapper.class);

		// 4 设置最终输出类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);

		// 设置reducetask个数为0
		job.setNumReduceTasks(0);

		// 5 设置输入和输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// 6 提交
         boolean b = job.waitForCompletion(true);
         System.exit(b ? 0 : 1);
	}
}

3.8. Summary of MapReduce development

1) The input data interface is InputFormat
(1) The default implementation class is TextInputFormat
(2) The functional logic of TextInputFormat is to read one line of text at a time, and then use the starting offset of the line as the key and the line content as the value to return.
(3) CombineTextInputFormat can combine multiple small files into one slice for processing, improving work efficiency.
2) The logical processing interface Mapper
implements map, setup, and cleanup methods according to business requirements.
3) Partitioner partition
(1) There is a default implementation of HashPartitioner. The logic is to return a partition number based on the hash value of the key and numReduces. key.hashCode()&Integer.MAXVALUE % numReduces
(2) If there are special business needs, you can customize the partition.
4) Comparable sorting
(1) When we use a custom object as a key to output, we must implement the WritableComparable interface and rewrite the compareTo method.
(2) Partial sorting: Internal sorting is performed on each file in the final output.
(3) Full sorting: Sort all data, usually only one Reduce.
(4) Secondary sorting: There are two conditions for sorting.
5) Combiner merging
Combiner merging can improve program execution efficiency and reduce IO transmission, but it cannot affect the original business processing results when used.
6) Logical processing interface Reducer
users implement reduce, setup, and cleanup methods according to business needs
7) Output data interface
(1) The default implementation class is TextOutputFormat, and the functional logic is to output one line for each kv pair to the target text file.
(2) Users can also customize OutputFormat.

4. Hadoop data compression

4.1. Overview

1) Advantages and Disadvantages
of Compression Advantages of compression: to reduce disk IO and reduce disk storage space.
Disadvantages of compression: Increased CPU overhead.
2) Compression principles
(1) For computing-intensive jobs, use less compression
(2) For IO-intensive jobs, use more compression

4.2. Compression coding supported by MR

1) Comparative introduction to compression algorithms

Compression format Hadoop comes with it? algorithm file extension Is it sliceable? After switching to compressed format, does the original program need to be modified?
DEFLATE Yes, use it directly DEFLATE .deflate no Same as text processing, no modification required
Gzip Yes, use it directly DEFLATE .gz no Same as text processing, no modification required
bzip2 Yes, use it directly bzip2 .bz2 yes Same as text processing, no modification required
LZO No, installation is required LZO .lzo yes An index needs to be built and the input format needs to be specified.
Snappy Yes, use Snappy directly .snappy no Same as text processing, no modification required

2) Comparison of compression performance

Compression algorithm Original file size Compressed file size Compression speed Decompression speed
gzip 8.3GB 1.8GB 17.5MB/s 58MB/s
bzip2 8.3GB 1.1GB 2.4MB/s 9.5MB/s
LZO 8.3GB 2.9GB 49.3MB/s 74.6MB/s

http://google.github.io/snappy/

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

4.3 Compression method selection

When selecting a compression method, key considerations include: compression/decompression speed, compression ratio (storage size after compression), and whether slicing can be supported after compression.

4.3.1 Gzip compression

Advantages: The compression rate is relatively high;
Disadvantages: Split is not supported; compression/decompression speed is average;

4.3.2 Bzip2 compression

Advantages: high compression rate; supports Split;
disadvantages: slow compression/decompression speed.

4.3.3 Lzo compression

Advantages: Fast compression/decompression; supports Split;
Disadvantages: average compression rate; additional indexes are required to support slicing.

4.3.4 Snappy compression

Advantages: Fast compression and decompression;
Disadvantages: Split is not supported; compression rate is average;

4.3.5 Compression position selection

Compression can be enabled at any stage of the MapReduce process.

5. Common errors and solutions

1) Guiding packages is prone to errors. Especially Text and CombineTextInputFormat.
2) The first input parameter in Mapper must be LongWritable or NullWritable, not IntWritable. The error reported is a type conversion exception.
3) java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4), indicating that the number of Partition and ReduceTask does not match, adjust the number of ReduceTask.
4) If the number of partitions is not 1, but the reducetask is 1, whether to execute the partitioning process.
The answer is: do not perform the partitioning process. Because in the source code of MapTask, the prerequisite for executing partitioning is to first determine whether the number of ReduceNum is greater than 1. If it is not greater than 1, it will definitely not be executed.
5) Import the jar package compiled in the Windows environment and run it in the Linux environment.
hadoop jar wc.jar com.shenbaoyun.mapreduce.wordcount.WordCountDriver /user/shenbaoyun/ /user/shenbaoyun/output
reports the following error:
Exception in thread “main ” java.lang.UnsupportedClassVersionError: com/shenbaoyun/mapreduce/wordcount/WordCountDriver: Unsupported major.minor version 52.0
The reason is that jdk1.7 is used in the Windows environment and jdk1.8 is used in the Linux environment.
Solution: Unify jdk version.
6) In the case of caching the pd.txt small file, it is reported that the pd.txt file cannot be found.
Reason: Most of them are path writing errors. There is also the problem of checking pd.txt.txt. Some computers cannot find pd.txt when writing a relative path and can change it to an absolute path.
7) Type conversion exception reported.
Usually there are programming errors when setting the Map output and final output in the driver function.
If the keys output by the Map are not sorted, a type conversion exception will also be reported.
8) When running wc.jar in the cluster, the input file cannot be obtained.
Reason: The input file of the WordCount case cannot be placed in the root directory of the HDFS cluster.

Guess you like

Origin blog.csdn.net/shenBaoYun/article/details/123063786