hadoop of mapreduce Comments (Basics)

This article mainly from the process mapreduce running jobs, shuffle, and mapreduce job failed FT explain several aspects.

A, mapreduce job running process

1.1, mapreduce Introduction

     MapReduce is a programming model for large data sets (greater than 1TB) parallel computing. The concept "Map (Mapping)" and "Reduce (reduction)," is their main thoughts were borrowed from functional programming languages, as well as borrowed from the vector programming language features. It is very easy for programmers in the case will not be distributed and parallel programming will own programs running on a distributed system. The current software implementation is to specify a Map (mapping) function, a key-value pair is used to map into a new set of key-value pairs specified concurrent Reduce (reduction) function to ensure that all of the key-value mappings each group share the same key. --- from Baidu Encyclopedia

MapReduce is a cluster-based high-performance parallel computing platform (the Cluster Infrastructure)
MapReduce is running a parallel computing software framework (Software Framework)
MapReduce programming model is a method and a parallel (Programming Model & Methodology)

      mapreduce framework hadoop a batch calculated mapreduce in the whole process of the job, from the input data, processing input data these data portions, and wherein the data processing section will map, reduce, combiner and other operations composition. In mapreduce of a job necessarily involve some of the following components:

1, the client, to submit mapreduce job
2, yarn resource manager, responsible for coordinating the computing resources on the cluster
3, yarn Node Manager, responsible for initiating and monitoring the cluster computing container (container) on the machine
4, mapreduce the application master, responsible for coordinating the work of running mapreduce
5, hdfs, distributed file system, is responsible for sharing files with other operating entities

1.2, the operation process

Running the job process includes the following steps:

1 , submitted job
 2 , job initialization
 3 , assign the job task
 4 execution of the job task
 5 , the job execution status update
 6 , the job is completed

DETAILED job execution process flow chart as shown below:

 

 

1.2.1, the job submission

Job Submission source code analysis as detailed: hadoop2.7 of job submission explain (on)  hadoop2.7 explain the job submission (down)

     MR code calls the waitForCompletion () method, which encapsulates Job.submit () method, while Job.submit () method which creates a JobSubmmiter object. When we waitForCompletion (true), then the method will waitForCompletion progress polling operations per second, with the last query if it is found to have a state differences, the details will be printed to the console. If the job is successful, it displays the job counter, otherwise it will lead to job failed record output to the console.

Wherein JobSubmmiter probably achieved as follows:
1, submitted to the resource manager resourcemanager apply for a job ID mapreduce, as shown in Step 2 shown in FIG.
2, the output configuration and inspection work, the directory is determined whether there is other information
3, computing jobs the size of the input slice
4, copy job run jar, profile, computing resources slicing of the input to the next job ID to a temporary directory named hdfs, the more copies of the job jar, default is 10 (by mapreduce.client.submit.file.replication parameter control),
5, jobs submitted by a method Explorer submitApplication

1.2.2, the job initialization

1, when the resource manager is invoked by the process submitApplication method, the yarn passed to put the request scheduler, and the scheduler allocates a container (container0) on a node manager to enable application master (main class is MRAppMaster )process. Once the process starts, it will register with the resourcemanager and report their information, application master and can monitor the operation and reduce the map. Therefore, application master to initialize the job is by creating multiple objects bookkeeping to keep track of the progress of the job.

2, application master job submission time received hdfs temporary shared resource files in the directory, JAR, fragmentation information, the configuration information and the like. Each slice and creates a map object, and (job setting method by setNumReduceTasks ()) by mapreduce.job.reduces reduce the number of parameters determined.

3, application master determines whether Uber (operations and application master in the same jvm run, i.e. maptask and reducetask running on the same node) mode operation, Uber mode conditions: number of map less than 10, a reduce and the input data is smaller than a block hdfs

By parameters:

mapreduce.job.ubertask.enable # uber mode is enabled 
mapreduce.job.ubertask.maxmaps #ubertask the maximum number of map 
maximum number reduce mapreduce.job.ubertask.maxreduces #ubertask of 
mapreduce.job.ubertask.maxbytes #ubertask maximum job size

4, application master calls setupJob method set OutputCommiter, FileOutputCommiter as the default, meaning the establishment of the final output directory and output tasks to do temporary work space

1.2.3, work tasks assignment

1, in the case of application master determines whether the job does not meet the uber mode, then the application master task will be to map and reduce application container resources to the resource manager.

2, first of all is to issue a request for the map application resource assignment until 5% of the map task is completed, the application will request resources to reduce the tasks required.

3, in the allocation process task, the task can reduce any datanode nodes running, but when the map task execution needs to take into account the mechanism of localization data, at the time specified resource to the task of each map and reduce default 1G memory , by the following parameters:

mapreduce.map.memory.mb
mapreduce.map.cpu.vcores
mapreduce.reduce.memory.mb
mapreduce.reduce.cpu.vcores

1.2.4, perform the job tasks

       After application master submit an application, Resource Manager for resource distribution according to need, then, application master node to communicate with the manager to start the container. This task is performed by a java application YarnChild the main class. Before running the task, first of all resources required to localize, including configuration jobs, jar files. The next step is to run map and reduce tasks. YarnChild run in a separate JVM.

1.2.5, the update status of the job task

    Each job and it has a status of each task: a task or job status (running, success, failure, etc.), and reduce the value of the progress map, operation counters, status message, or when the job is running is described in time, the client can communicate with the application master, second (can be set by the parameter mapreduce.client.progressmonitor.pollinterval) polling job execution status information, and progress directly.

1.2.6, the job is completed

When the state of application master notified last task completed, they gave the job to succeed.
When the job status polling job, know task has been completed, then print a message informing the user, and () method returns from waitForCompletion.
When the job is complete, application master and container will clean up the problem temporarily intermediate data results, etc. OutputCommiter of commitJob () method is called, the job information from the job history service archived for future reference user.

二、shuffle

   mapreduce ensure each reduce the input keys are sorted in accordance with, the system performs the sort, the process input as the input map reduce process called shuffle. shuffle is our focus on optimizing the part. shuffle flowchart shown below:

 

 

 

2.1, map end

Before generating the map, calculates the size of the file fragment: Calculated See Source: hadoop2.7 job submission Explanation of file slices

   The number will then calculate the size of the fragment the map, a map is generated job, or a file for each slice are (fragment size of less than 1.1 *) map to generate a job, and then from the map by a custom method defined logical calculation, the calculation is completed will be written to the local disk.

     Here is not written directly to disk, IO in order to ensure efficiency, using a ring buffer memory to write, and do a pre-sort (quick sort). The default buffer size is 100MB (may be modified by modifying the configuration item mpareduce.task.io.sort.mb), when the size of the write buffer memory reaches a certain percentage, the default is 80% (by mapreduce.map. sort.spill.percent modify configuration items), the overflow will start a thread to write the contents of the memory buffer overflow to disk (spill to disk), the overflow write threads are independent, does not affect the result of the write buffer map thread in the process of overflow written to disk, the map continues to enter into the buffer, if the buffer is filled period, the map will be blocked to write overflow disk write process is completed. Overflow write buffer is written to the memory to the next local mapreduce.cluster.local.dir directory by way of polling. Written to disk before the spill, we will know the number of reduce, and then will be divided according to the number of partitions reduce default written to the corresponding partition according to hashpartition write data to overflow. In each district, a background thread is sorted according to key, so the overflow file is written to disk partitions and sorting. If there combiner function, which runs in the sorted output, so that the output map more compact. Reduce transmission of data written to disk and to reduce the data.

    Each annular zone punch change memory threshold is reached, will be written to a new file overflow, underflow and therefore when a map finished, there will be a plurality of partitions local cut sorted files. These files will merge prior to the completion of a partition map and sort (merge sort) file, how many files can be combined by each control parameter mapreduce.task.io.sort.factor.

    In the process map spill write disk, the data transmission speed can be submitted compression, reducing the disk io, reduce storage. By default, not compressed, mapreduce.map.output.compress control parameters, the compression algorithm used mapreduce.map.output.compress.codec parameter control.

2.2、reduce端

       After completion of map tasks, job status monitoring application master will know that the implementation of the map and reduce tasks started, and application master knows the correspondence mapping between the host and map output, reduce application master knows that the polling data from the host to be copied .

     Map the output of a task, it may be crawled more Reduce tasks. Each task may require multiple Reduce Map task outputs as its special input file, and the completion time of each Map task may be different, when there is a Map task is completed, Reduce task began to run. Reduce tasks corresponding to data partition in accordance with the plurality of partition number gripping Map output (fetch), the copy process is a process Shuffle. . reduce replication thread a small amount, it is possible to map the parallel output copy, the default is 5 threads. It can be controlled by parameters mapreduce.reduce.shuffle.parallelcopies.

    This replication process similar to the process and the map is written to disk, but also the threshold and memory size, the same threshold can be configured in the configuration file, and memory size without using the memory size of tasktracker reduce the copy will reduce the time and sort operations merge file operations.

   If the map output is small, it will be copied to the buffer memory of the node where Reducer, the buffer size can be specified by mapred-site.xml file mapreduce.reduce.shuffle.input.buffer.percent. Once the memory buffer where the node Reducer reaches a threshold, or the number of files in the buffer reaches a threshold, then the combined overflow to disk.

   If the map output is large, it is copied directly to the disk node where the Reducer in. With the increase in the disk node where the Reducer in overflow write files, background thread will merge them into larger and orderly file. When copying map outputs into the sort phase. This phase will gradually merge multiple files into large map small output files through merge sort. Finally, merge several files merged into larger as reduce output by

2.3 summary

When the input file to determine Reducer, the entire Shuffle operation finally ended. After the Reducer is performed, the final result will Reducer to save on HDFS.

In the Hadoop cluster environment, perform most of the map task and reduce task is on a different node. Of course, when the need across nodes to perform many cases Reduce pulling map task results on other nodes. If the job has a lot of running cluster, then the normal execution of the task will be very serious for the internal cluster network resource consumption. This network consumption is normal, we can not limit, can do it is to maximize reduce unnecessary consumption. There in the node, compared to memory, disk IO impact on job completion time is considerable. From the most basic requirements, we may have to expect Shuffle process:

1, a complete pull data from the map task to reduce terminal end.
2, when pulling data across nodes, as much as possible to reduce the unnecessary consumption of bandwidth.
3, to reduce the impact of disk IO on task execution.

In computing MapReduce framework, mainly used two kinds of sorting algorithms: quick sort and merge sort. In Map task to sort happened twice, Reduce task occurs once Sort:

1, 1st sorting occurs in the ring buffer memory Map output using a fast sort. When the buffer reaches the threshold, before overflow written to disk, the background thread will buffer the data into the appropriate partition, carried out within the sorted keys in each partition.

2, 2nd sort of disk space in the Map task outputs on multiple overflow write files merge into a partitioned and ordered output file. Since the spill file has been written after one sort, merge sort so just once when writing a file to overflow merge the output file as a whole and orderly.

3, the third stage of the sort occurred in the Shuffle, multiple copied Map output file merge, the same can be obtained through a merge sort order file.

Third, the job fails and fault tolerance

Since there are running jobs, the failure will certainly be a job, failed jobs (without regard to hardware, platform failure causes) there may be different issues, as follows:

3.1, the task failed to run

   User code throws an exception (code is not written): this case the task JVM will send an error report to the application master before exiting, and recorded into the user log, application master the job marked as failed, and possession of container resources freed .

     Another is the JVM suddenly quit, this is Node Manager will notice that process has exited, and notify the application master this task is marked as failed, if it is caused due to speculative execution task is terminated, it will not be marked as failure. The task pending is different, once the application master noticed for some time did not receive a progress update, the task will be marked as failed, 10 minutes by default, parameters mapreduce.task.timeout control application master was told that a task fails, will re-schedule the task execution (run on different nodes of the previous failure), the default retry four times , four times if all else fails, the job is considered as failed, control parameters:

mapreduce.map.maxattempts
mapreduce.reduce.maxattempts

3.2, application master fail

AM is also possible for various reasons (e.g., network problems or hardware failures) failure, Yarn attempt to boot the same AM
can be configured to restart attempts for each job individually AM: mapreduce.am.max-attempts, the default value is 2
the Yarn in the upper limit of increase with: yarn.resourcemanager.am.nax-attempts, the default is 2, a single application can not exceed this limit, unless modify both parameters.

The recovery process : application master sends periodic heartbeat to the resource manager. When the application master fails, the Explorer will detect the failure, and start the application master in a new container, and use the restore job history to run the task failed state in the application so that it does not have to re-run, by default under recovery function is enabled, yarn.app.mapreduce.am.job.recovery.enable control to the client application master when polling job status, if the application master fails, the client will be asked to resourcemanager Explorer and cache application master address.

3.3 Node Manager fails

    If the Node Manager crashes or running very slowly, it stops sending the heartbeat message to the resource manager, if 10 minutes (can be set by the parameter yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms) Explorer no receive a heartbeat message, the resource manager will notify stops sending heartbeat node Manager, and remove the node Manager from its own resource pool, application master and failed tasks on the node, all by restoring the above two recovery mechanisms .

3.4, the resource manager failed to run.

   When the resource manager failed a very serious problem, all the tasks can not be assigned resources, jobs, and the container can not start, then the whole yarn by controlling resources in the cluster are in a state of paralysis.

Fault Tolerance : resourcemanager HA details see: hadoop installation and the principle of availability Detailed

 

See more hadoop ecological article:  hadoop eco-series

 

reference:

"Hadoop The Definitive Guide to Big Data storage and analysis of the fourth edition of"

Guess you like

Origin www.cnblogs.com/zsql/p/11600136.html