Article Directory
- 1. Overview of MapReduce
1. Overview of MapReduce
1.1. Definition of MapReduce
MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "Hadoop-based data analysis applications".
The core function of MapReduce is to integrate the business logic code written by users and its own default components into a complete distributed computing program , which runs concurrently on a Hadoop cluster.
1.2. Advantages and disadvantages of MapReduce
1.2.1 Advantages
- MapReduce is easy to program.
It simply implements some interfaces to complete a distributed program. This distributed program can be distributed to a large number of cheap PC machines to run. That is to say, when you write a distributed program, it is exactly the same as writing a simple serial program. It is because of this feature that MapReduce programming has become very popular. - Good scalability
When your computing resources cannot be satisfied, you can expand its computing power by simply adding machines. - High fault tolerance
The original intention of MapReduce design is to enable the program to be deployed on cheap PC machines, which requires it to have high fault tolerance. For example, if one of the machines hangs up, it can transfer the above computing tasks to another node to run, so that the task will not fail, and this process does not require manual participation, but is completely completed by Hadoop. - Suitable for off-line processing of massive data above PB level,
it can realize concurrent work of thousands of server clusters and provide data processing capabilities.
1.2.2 Disadvantages
- Not good at real-time computing
MapReduce cannot return results in milliseconds or seconds like MySQL. - Not good at streaming computing
The input data of streaming computing is dynamic, while the input data set of MapReduce is static and cannot be changed dynamically. This is because the design characteristics of MapReduce itself determine that the data source must be static. - Not good at DAG (Directed Acyclic Graph) calculation
There are dependencies among multiple applications, and the input of the latter application is the output of the previous one. In this case, it is not that MapReduce cannot do it, but after using it, the output result of each MapReduce job will be written to the disk, which will cause a lot of disk IO, resulting in very low performance.
1.3. The core idea of MapReduce
- Distributed computing programs often need to be divided into at least two stages.
- The concurrent instances of MapTask in the first stage run completely in parallel and are independent of each other.
- The concurrent instances of ReduceTask in the second stage are independent of each other, but their data depends on the output of all concurrent instances of MapTask in the previous stage.
- The MapReduce programming model can only contain one Map stage and one Reduce stage. If the user's business logic is very complex, only multiple MapReduce programs can be run serially.
Summary: Analyze the WordCount data flow to deeply understand the core idea of MapReduce.
1.4, MapReduce process
A complete MapReduce program has three types of instance processes during distributed runtime:
- MrAppMaster: responsible for the process scheduling and status coordination of the entire program.
- MapTask: Responsible for the entire data processing process in the Map phase.
- ReduceTask: Responsible for the entire data processing process of the Reduce phase.
1.5. Official WordCount source code
Use the decompilation tool to decompile the source code, and find that the WordCount case includes Map class, Reduce class and driver class. And the data type is the serialized type encapsulated by Hadoop itself.
1.6. Commonly used data serialization types
Java type | Hadoop Writable Type |
---|---|
Boolean | BooleanWritable |
Byte | ByteWritable |
Int | IntWritable |
Float | FloatWritable |
Long | LongWritable |
Double | DoubleWritable |
String | Text |
Map | MapWritable |
Array | ArrayWritable |
Null | NullWritable |
1.7. MapReduce program specification
The program written by the user is divided into three parts: Mapper, Reducer and Driver.
- Mapper stage
- User-defined Mappex should inherit its own parent class
- The input data of Mapper is in the form of KV pair t (the type of KV can be customized)
- The business logic in Mapper is written in the map() method
- The output data of Mapper is in the form of KV pairs (the type of KV can be customized)
- The map0 party (MapTaski process) calls once for each <K, V>
- Reducer stage
- User-defined Reducer must inherit its own parent class
- The input data type of Reducer corresponds to the output data type of Mapper, which is also KV
- Reducer's business logic is written in the reduce() method well
- The ReduceTask process calls the reduceQ method once for each group of the same k.
- The Driver stage
is equivalent to the client of the YARN cluster. It is used to submit our entire program to the YARN cluster. What is submitted is a job object that encapsulates the operating parameters of the MapReduce program.
1.8, WordCount case practice
1.8.1 Local testing
- It is required
to count the total number of occurrences of each word in a given text,
prepare a data file
and upload the data to HDFS
sherry sherry banzhang banzhang cls cls wly wly hadoop xue sss
- Requirements analysis
According to the MapReduce programming specification, Mapper, Reducer, and Driver are written respectively. - Environmental preparation
- Create a Maven project => MapReduce
- Add the following dependencies to the pom.xml file
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>
3. In the src/main/resource directory of the project, create a new file named "log4j.properties", fill in the file
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
4. Create package name: com.sherry.MapReduce.wordcount
- Programming
- Write the Mapper class
package com.sherry.MapReduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 1 获取一行
String line = value.toString();
// 2 切割
String[] words = line.split(" ");
// 3 输出
for (String word : words) {
k.set(word);
context.write(k, v);
}
}
}
- Write the Reducer class
package com.sherry.MapReduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
int sum;
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
// 1 累加求和
sum = 0;
for (IntWritable count : values) {
sum += count.get();
}
// 2 输出
v.set(sum);
context.write(key,v);
}
}
- Write Driver driver class
package com.sherry.MapReduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1 获取配置信息以及获取job对象
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 关联本Driver程序的jar
job.setJarByClass(WordCountDriver.class);
// 3 关联Mapper和Reducer的jar
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 4 设置Mapper输出的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置最终输出kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
- local test
- You need to configure the HADOOP_HOME variable and Windows running dependencies first
- Run the program on IDEA/Eclipse
1.8.2 Submit to the cluster test
Test on cluster
- Package the program into a jar package
. If the program has a target folder, clean it first.
Change the name of the jar package without dependencies to wc.jar, and copy the jar package to the/opt/module/hadoop-3.1.3/myjar
path of the Hadoop cluster.
- Execute the wordcount program.
Remember to open hdfs and yarn before executing the wordcount program.
Note: file paths and the like, where your files are uploaded
hadoop jar wc.jar com.sherry.MapReduce.wordcount.WordCountDriver /wordcount/input/hello.txt /wordcount/output/wc