MapReduce-basic principles

Original author: Mr. Li

Original address: Basic principles and applications of MapReduce

table of Contents

1. Introduction to the MapReduce model

1. Map and Reduce functions

2. MapReduce Architecture

3. MapReduce workflow

4. MapReduce application execution process

Two, WordCount running example

1. Map process of WordCount

2. Reduce process of WordCount

3. WordCount source code


1. Introduction to the MapReduce model

  MapReduce highly abstracts the complex parallel computing process running on large-scale clusters into two functions: Map and Reduce. It adopts a "divide and conquer" strategy. A large-scale data set stored in a distributed file system will be divided into many independent fragments (split), which can be processed in parallel by multiple Map tasks

1. Map and Reduce functions

Map and Reduce

2. MapReduce Architecture

  The MapReduce architecture is mainly composed of four parts: Client, JobTracker, TaskTracker and Task

  1)Client

  The MapReduce program written by the user is submitted to the JobTracker through the Client. The user can view the running status of the job through some interfaces provided by the Client.

  2) JobTracker

  JobTracker is responsible for resource monitoring and job scheduling. JobTracker monitors the health status of all TaskTrackers and Jobs. Once a failure is found, the corresponding task will be transferred to other nodes. JobTracker will track the task execution progress, resource usage and other information, and tell the task of this information Scheduler (TaskScheduler), and the scheduler will be when resources appear idle,

  Choose the right tasks to use these resources

  3)TaskTracker

  TaskTracker will periodically report the resource usage and task running progress on the node to JobTracker through the "heartbeat", and at the same time receive commands sent by JobTracker and perform corresponding operations (such as starting new tasks, killing tasks, etc.) TaskTracker Use "slot" to divide the amount of resources (CPU, memory, etc.) on the node. Obtained by a Task

  Only after a slot has the opportunity to run, and the role of the Hadoop scheduler is to allocate idle slots on each TaskTracker to Tasks. Slots are divided into Map slot and Reduce slot, which are used by MapTask and Reduce Task respectively

  4)Task

  Task is divided into Map Task and Reduce Task, both of which are started by TaskTracker

3. MapReduce workflow

  1) Overview of workflow

 

  • No communication between different Map tasks
  • No information exchange occurs between different Reduce tasks
  • Users cannot explicitly send messages from one machine to another
  • All data exchange is achieved through the MapReduce framework itself

  2) Each execution stage of MapReduce

 

4. MapReduce application execution process

 

Two, WordCount running example

The workflow is that Input reads the content of the text in parallel from HDFS, passes through the MapReduce model, and finally encapsulates the analyzed result with Output and persists it in HDFS

1. Map process of WordCount

Use three Map tasks to read the content in three lines of files in parallel, map the words read, and each word is generated in the form of <key, value>

  

Map source code:

public class WordMapper extends  
            Mapper<Object, Text, Text, IntWritable> {  
  
        private final static IntWritable one = new IntWritable(1);  
        private Text word = new Text();  
  
        public void map(Object key, Text value, Context context)  
                throws IOException, InterruptedException {  
            String line = value.toString();  
            StringTokenizer itr = new StringTokenizer(line);  
            while (itr.hasMoreTokens()) {  
                word.set(itr.nextToken().toLowerCase());  
                context.write(word, one);  
            }  
        }  
    }

 

2. Reduce process of WordCount

The Reduce operation is to sort and merge the results of the Map and finally get the word frequency

 

Reduce source code

public class WordReducer extends  
            Reducer<Text, IntWritable, Text, IntWritable> {  
        private IntWritable result = new IntWritable();  
  
        public void reduce(Text key, Iterable<IntWritable> values,  
                Context context) throws IOException, InterruptedException {  
            int sum = 0;  
            for (IntWritable val : values) {  
                sum += val.get();  
            }  
            result.set(sum);  
            context.write(key, new IntWritable(sum));  
        }  
    } 

3. WordCount source code

import java.io.IOException;  
import java.util.StringTokenizer;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.IntWritable;  
import org.apache.hadoop.io.Text;  
import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.util.GenericOptionsParser;  
  
public class WordCount {  
  
    public static class WordMapper extends  
            Mapper<Object, Text, Text, IntWritable> {  
  
        private final static IntWritable one = new IntWritable(1);  
        private Text word = new Text();  
  
        public void map(Object key, Text value, Context context)  
                throws IOException, InterruptedException {  
            String line = value.toString();  
            StringTokenizer itr = new StringTokenizer(line);  
            while (itr.hasMoreTokens()) {  
                word.set(itr.nextToken().toLowerCase());  
                context.write(word, one);  
            }  
        }  
    }  
  
    public static class WordReducer extends  
            Reducer<Text, IntWritable, Text, IntWritable> {  
        private IntWritable result = new IntWritable();  
  
        public void reduce(Text key, Iterable<IntWritable> values,  
                Context context) throws IOException, InterruptedException {  
            int sum = 0;  
            for (IntWritable val : values) {  
                sum += val.get();  
            }  
            result.set(sum);  
            context.write(key, new IntWritable(sum));  
        }  
    }  
  
    public static void main(String[] args) throws Exception {  
        Configuration conf = new Configuration();  
        String[] otherArgs = new GenericOptionsParser(conf, args)  
                .getRemainingArgs();  
        if (otherArgs.length != 2) {  
            System.err.println("Usage: wordcount <in> <out>");  
            System.exit(2);  
        }  
        Job job = new Job(conf, "word count");  
        job.setJarByClass(WordCount.class);  
        job.setMapperClass(WordMapper.class);  
        job.setCombinerClass(WordReducer.class);  
        job.setReducerClass(WordReducer.class);  
        job.setOutputKeyClass(Text.class);  
        job.setOutputValueClass(IntWritable.class);  
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));  
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));  
        System.exit(job.waitForCompletion(true) ? 0 : 1);  
    }  
} 
 
 

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113062349