Original author: Mr. Li
Original address: Basic principles and applications of MapReduce
table of Contents
1. Introduction to the MapReduce model
4. MapReduce application execution process
Two, WordCount running example
2. Reduce process of WordCount
1. Introduction to the MapReduce model
MapReduce highly abstracts the complex parallel computing process running on large-scale clusters into two functions: Map and Reduce. It adopts a "divide and conquer" strategy. A large-scale data set stored in a distributed file system will be divided into many independent fragments (split), which can be processed in parallel by multiple Map tasks
1. Map and Reduce functions
2. MapReduce Architecture
The MapReduce architecture is mainly composed of four parts: Client, JobTracker, TaskTracker and Task
1)Client
The MapReduce program written by the user is submitted to the JobTracker through the Client. The user can view the running status of the job through some interfaces provided by the Client.
2) JobTracker
JobTracker is responsible for resource monitoring and job scheduling. JobTracker monitors the health status of all TaskTrackers and Jobs. Once a failure is found, the corresponding task will be transferred to other nodes. JobTracker will track the task execution progress, resource usage and other information, and tell the task of this information Scheduler (TaskScheduler), and the scheduler will be when resources appear idle,
Choose the right tasks to use these resources
3)TaskTracker
TaskTracker will periodically report the resource usage and task running progress on the node to JobTracker through the "heartbeat", and at the same time receive commands sent by JobTracker and perform corresponding operations (such as starting new tasks, killing tasks, etc.) TaskTracker Use "slot" to divide the amount of resources (CPU, memory, etc.) on the node. Obtained by a Task
Only after a slot has the opportunity to run, and the role of the Hadoop scheduler is to allocate idle slots on each TaskTracker to Tasks. Slots are divided into Map slot and Reduce slot, which are used by MapTask and Reduce Task respectively
4)Task
Task is divided into Map Task and Reduce Task, both of which are started by TaskTracker
3. MapReduce workflow
1) Overview of workflow
- No communication between different Map tasks
- No information exchange occurs between different Reduce tasks
- Users cannot explicitly send messages from one machine to another
- All data exchange is achieved through the MapReduce framework itself
2) Each execution stage of MapReduce
4. MapReduce application execution process
Two, WordCount running example
The workflow is that Input reads the content of the text in parallel from HDFS, passes through the MapReduce model, and finally encapsulates the analyzed result with Output and persists it in HDFS
1. Map process of WordCount
Use three Map tasks to read the content in three lines of files in parallel, map the words read, and each word is generated in the form of <key, value>
Map source code:
public class WordMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
2. Reduce process of WordCount
The Reduce operation is to sort and merge the results of the Map and finally get the word frequency
Reduce source code
public class WordReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
3. WordCount source code
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class WordMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
public static class WordReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordMapper.class);
job.setCombinerClass(WordReducer.class);
job.setReducerClass(WordReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}