In-depth understanding of Apache Hadoop MapReduce

Author: Zen and the Art of Computer Programming

1 Introduction

Apache Hadoop MapReduce is an open source distributed computing framework for processing massive data sets and running jobs without constraints on cluster size or machine configuration. It mainly consists of two types of components: Map stage and Reduce stage. The Map phase is responsible for sharding the data and passing each shard to a different task. It receives an input key-value pair and generates zero or more output key-value pairs. The Map phase is implemented by a Mapper function, which maps input key-value pairs to a series of intermediate key-value pairs. Intermediate key-value pairs are stored in memory. When all Mapper operations are completed, it writes the results to the disk file system. The Reduce phase is responsible for summarizing the results from all intermediate key-value pairs produced by the Mapper. It is also responsible for merging the output results in a specified sorting manner, and different aggregators can be set up to control the number of final output results. The Reduce phase is implemented by a Reducer function, which constructs the final result by reading the Mapper's output file and performing aggregation operations on the same keys.

By decoupling Map and Reduce operations, Hadoop MapReduce allows users to customize their own functions. For example, users can write various custom Mappers or Reducers to solve specific types of problems, or write their own Combiner functions to perform local aggregation to further improve performance. To take advantage of the parallel nature of Hadoop, Hadoop MapReduce provides the following three main features:

Disk-based caching: Since Hadoop is a disk-based system, it takes a long time for data to be loaded into memory. To speed up data processing, Hadoop MapReduce uses a block storage mechanism in which data is divided into fixed-size blocks and cached on various nodes. At the same time, Hadoop automatically manages these caches to ensure that they are reliable enough to serve data access requests. Automatic data sharding: Hadoop MapReduce automatically completes data sharding by splitting the data into chunks suitable for individual nodes. Users do not need to consider the configuration of the underlying physical machine, nor do they need to worry about factors such as network bandwidth or disk I/O affecting application performance. Fault tolerance: Hadoop MapReduce uses a series of design principles, including replication mechanisms, transaction logs, and checkpoints, to ensure system fault tolerance. If a node fails, its tasks are automatically restarted without losing any state information. RPC-based Remote Procedure Call protocol: Hadoop MapReduce uses an RPC-based communication protocol to build cluster connections. In addition, Hadoop MapReduce provides many useful command line interfaces and graphical interfaces to facilitate users to submit jobs, monitor job execution, etc. 2. MapReduce Overview Hadoop MapReduce is a programming model and software framework for distributed computing. It supports three key operations: map, shuffle and sort, and reduce. Users only need to specify input and output locations and application logic to quickly process large-scale data using the MapReduce framework.

Map phase: The Map phase is usually performed by a user-defined mapper function, which accepts a set of input key-value pairs and produces a set of intermediate key-value pairs. The intermediate key-value pairs are stored in memory. When all mapper tasks are completed, the mapper outputs the generated intermediate key-value pairs to a temporary file. Shuffle and Sort: The shuffle and sort process is responsible for sorting, grouping, and redistributing intermediate key-value pairs, and assigning intermediate key-value pairs to the corresponding reducer for processing. shuffle and sort is an ordered process, so it has good performance. Reduce phase: The Reduce phase is performed by the user-defined reducer function, which reads the intermediate key-value pair file generated by the mapper and performs the aggregation operation of the same key to produce the final output result.

Hadoop takes advantage of scalability and fault tolerance mechanisms to ensure high availability. Hadoop clusters can automatically detect hardware failures, network problems and crashes, system errors and other anomalies, and quickly recover to ensure data security, accuracy, completeness and consistency.

Let's learn about some basic concepts and terminology of Hadoops MapReduce one by one.

3.Hadoop MapReduce basic terminology Word Count Example: Suppose we have the following text data sample:

apple banana cherry
dog cat elephant
fox grape hippopotamus
ice jupiter kangaroo
lemon mandarin orange
pear plum quince
raspberry sauce tomato
watermelon xenon yak
zebra zigzag zulu

Now, we want to count the number of times each fruit appears. Probably the simplest way is to use the Word Count model. The Word Count model represents that for each word, we count the number of times it appears in the text. For example, if the word "banana" appears once, and the word "tomato" appears once, we consider it to appear twice.

Word Count can be implemented using MapReduce. MapReduce has two phases: Map phase and Reduce phase.

Map stage:

1) Read each line in each document as input; 2) Split according to spaces or tabs, extract each word and use it as key, its number of occurrences as value, and write it to the intermediate file (Intermediate File).

<document-id> <word>   #中间文件
doc1 apple    1       #中间文件
doc1 banana   1       #中间文件
doc1 cherry   1       #中间文件
doc2 dog      1       #中间文件
......        ...      #中间文件

3) Perform cumulative sum for the same key and output the final result.

Reduce phase:

1) Read key-value pairs from the intermediate file and sort them; 2) Merge according to the value of the same key, and output the combination of key and value as the final result.

apple    3
banana   1
cherry   1
dog      1
elephant 1
fox      1
grape    1
hippopotamus       1
ice     1
jupiter 1
kangaroo 1
lemon    1
mandarin 1
orange  1
pear    1
plum    1
quince  1
raspberry 1
sauce   1
tomato  2
watermelon 1
xenon   1
yak     1
zebra   1
zigzag  1
zulu    1

4) Hadoop Streaming API: In actual development, users can write their own mapper and reducer programs, and then run the Hadoop Streaming Program (Streaming Program), which will start the MapReduce job based on the jar file provided by the user. The Hadoop Streamin API also provides some built-in functions to facilitate user use. For example, users can count using:

public static void main(String[] args) throws Exception {
    if (args.length!= 2) {
        System.err.println("Usage: wordcount <input> <output>");
        System.exit(-1);
    }

    Configuration conf = new Configuration();
    String inputPath = args[0];
    String outputPath = args[1];

    Job job = Job.getInstance(conf);
    job.setJarByClass(WordCount.class); // 此处填写用户提供的 jar 包名称
    job.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(job, new Path(inputPath));

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);

    FileOutputFormat.setOutputPath(job, new Path(outputPath));

    boolean success = job.waitForCompletion(true);
    System.exit(success? 0 : 1);
}

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133504798