MapReuduce
1, MapReduce concept
Mapreduce is a programming framework for distributed computing programs. Its core function is to integrate business logic code written by users and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.
Mapreduce is easy to program, has good scalability, and is suitable for processing petabyte-level data; however, it is not suitable for processing real-time data, churn computing, and directed graph computing.
2. MapReduce design concept
MapReduce thought modules are mainly divided into: Input、Spilt、Map、Shuffle、Reduce
etc.
Input:Read reads data; InputFormat splits the file into multiple InputSplits, and RecordReaders converts InputSplits into standard <key, value> key-value pairs as the output of map;
Spilt: In this process, the data is roughly divided into rows to obtain <Key, Value> type data;
Map: Fine-grained segmentation to obtain <Key, List> type data; sort and partition files in the ring buffer, when the amount of data is large, it will be overwritten to the disk, and the size of the buffer will determine the performance of the MR task. The default size is 100M. In this process, the Combine task can be set, and the initial aggregation will be performed according to the same Key (partion>3 will be combined);
combine: Preliminary aggregation of Merge, which mainly includes strategies such as partition number and the same key, and the data in the same partition is ordered;
Shuufle: Shuffling, that is, combining the results of each MapTask and outputting it to Reduce. The output of this process data is a Copy process. This process involves network IO, which is a time-consuming process and a core process.
Reduce: Merge the split data fragments. Merge sorting will be involved.
Partition: supports custom output partition, the default classifier is HashPartition. Formula: (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks
Let's try to explain the specific running process of MapReduce:
- Before the client submits the task, (InputFormat) will divide the data into slices according to the configuration strategy (SpiltSize defaults to blockSize 128M), and each slice is submitted to a MapTask (YARN is responsible for submission);
- MapTask executes tasks, generates <K, V> pairs according to the map function, outputs the results to the ring buffer, and then partitions, sorts, and overflows;
- Shuffle, that is, divide the map results into multiple partitions and assign them to multiple reduce tasks. This process is called Shuffle.
- Reduce, copy the data of the partition after the map (fetch process, default 5 threads to perform the copy), and perform the merge operation after all are completed.
The distributed feature is that a Job has multiple MapperTasks, Shuffles, and Reduces. The following figure is a good illustration of the distributed parallel process.
3. Write the mapreduce program:
MR programming framework:
1) Mapper stage
(1) User-defined Mapper should inherit its own parent class
(2) The input data of Mapper is in the form of KV pairs (the type of KV can be customized)
(3) The business logic in Mapper is written in the map() method
(4) The output data of Mapper is in the form of KV pairs (the type of KV can be customized)
(5) The map() method (maptask process) is called once for each <K, V>
2) Reducer stage
(1) User-defined Reducer should inherit its own parent class
(2) The input data type of Reducer corresponds to the output data type of Mapper, which is also KV
(3) The business logic of the Reducer is written in the reduce() method
(4) The Reducetask process calls the reduce() method once for each group of <k,v> groups of the same k
3) Driver stage
The entire program requires a Drvier to submit, and the submission is a job object that describes various necessary information.
4, MapReduce classic word frequency statistics case
Now write the first MapReduce program to implement the WordCount case:
Environment preparation:
IDEA creates a new maven project, introduces the core dependencies of the corresponding version of hadoop, and incorporates the configuration file into the resource management:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version>
</dependency>
1. Write a map program
public class WordCountMap extends Mapper<LongWritable, Text,Text, IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable(1);
//重写map方法,实现业务逻辑
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//1,获取一行
String line = value.toString();
//2,切割
String[] words = line.split(" ");
for(String word:words){
k.set(word);
context.write(k,v);
}
}
}
2. Write the reduce program
public class WordCountReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
//子类构造方法中会有一个默认隐含的super()方法,用于调用父类构造
//super.reduce(key, values, context);
int sum = 0;
for(IntWritable count:values){
sum += count.get();
}
context.write(key,new IntWritable(sum));
}
}
3. Write the driver class
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//1,获取job
Configuration configuration = new Configuration();
Job job = Job.getInstance();
//2,设置jar加载路径
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMap.class);
job.setReducerClass(WordCountReduce.class);
// 4 设置 map 输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置 Reduce 输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
// 7 提交
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
Then use the tool to make the main file into a jar package:
In project structre->
join Artifacts
, select From moudles from dependencies
, select your own main
class , and set the jar
package output directory; then through the toolbar Build
, select build Artifacts->build
to achieve the jar
package. Finally, through the FTP
tool , send the jar
package to the cluster environment, and use the following command to run.
hadoop jar jar包名 main类的全类名 输入目录 输出目录
5. Development skills
So how do you run the mapreduce
program ?
In actual production development, the program needs to be tested locally before being jar
packaged released to the cluster.
First, we need to set the main
class , configuration
and add the input and output parameters in advance:
Then set to modify the file system to local operation mode:
Reference link