Detailed explanation of Hadoop MapReduce framework

Detailed explanation of Hadoop MapReduce framework

1. What we need to learn is the running rules of this computing model. When running a mapreduce computing task, the task process is divided into two stages: the map stage and the reduce stage, each of which uses key/value pairs as input and output. What the programmer has to do is to define the functions of these two stages: the map function and the reduce function.
2. HDFS, Hadoop Distributed File System, is the storage foundation of Hadoop, at the data level, providing massive data storage; while MapReduce is an engine or programming model, which can be understood as the upper layer of data, we By writing MapReduce programs, massive amounts of data can be processed. This is similar to how we retrieve (MapReduce) all files (HDFS) to find the result we want.
3.Mapreduce is a programming model, a programming method, an abstract theory.
4. The mapreduce program, we implement a map function and a reduce function.
5. Hadoop is an open source framework that can write and run distributed applications to process large-scale data. It is designed for offline and large-scale data analysis.
6. Distributed computing, the core of distributed computing lies in the use of distributed The algorithm expands the program running on a single machine to run on multiple machines in parallel, thereby multiplying the data processing capacity. However, this kind of distributed computing generally has high requirements on programmers and also on servers. The cost has become very high.
7. Haddop was born to solve this problem. Haddop can easily form distributed nodes with many cheap Linux PCs, and programmers do not need to know distributed algorithms and the like, just need to The rules of mapreduce define the interface method, and leave the rest to Haddop. It will automatically distribute the relevant calculation to each node, and then get the result.
8. For example, the above example: Hadoop needs to first import 1PB data files into HDFS, and then the programmers define map and reduce, that is, define the line of the file as key, and the content of each line as value, Then perform regular matching, and if the matching is successful, the results will be aggregated and returned through reduce. Hadoop will distribute the program to N nodes for parallel operations. Then it may take several days to calculate, and when there are enough nodes After that, the time can be reduced to within a few hours.

Core components:
1. Client, that is, write mapreduce programs, configure jobs, submit jobs
2. JobTracker (a management on the Hadoop server), assign jobID, and input data Sharding, initializing jobs, assigning jobs, communicating with TaskTracker, and coordinating the execution of the entire job; determining the number of tasks and managing task status, etc.
3. TaskTracker keeps heartbeat and communication with JobTracker, and executes Map and Reduce tasks on allocated data segments. TaskTracker is a container that runs Map and Reduce tasks and maintains communication and reporting status with JobTracker.
4. The map task, which continuously reads the input data and calls the map interface, is a task run by the TaskTracker.
5. The Reduce task, which continuously reads the input data and calls the reduce interface, is a task run by the TaskTracker.
6. JobTracker is equivalent to a job management server, TaskTracker is equivalent to a server of a bunch of tasks that can be run (with JVM running environment), map and Reduce tasks are equivalent to processes or threads running on TaskTracker.
7. The JobTracker will assign map and Reduce tasks to the TaskTracker according to the job status and the running status of the TaskTracker. After the TaskTracker is completed, it will return the running status to the JobTracker.
8. After the JobTracker receives the successful completion of the last task of a Job, it returns the completion status of the Job to the client.

Startup process:
1. The client needs to write the mapreduce program, configure the mapreduce job, that is, the job, and then submit the job, which is submitted to the JobTracker.
2. The JobTracker will build the job, which is to assign The ID value of a new job task, and then it will do a check operation,
(1) determine whether the output directory exists, if it exists, the job cannot run normally, and the JobTracker will throw an error to the client,
(2) Next Also check whether the input directory exists. If it does not exist, it will throw an error. If there is a JobTracker, it will calculate the input split (Input Split) according to the input. If the split cannot be calculated, it will throw an error.
3. These are all done. The JobTracker will configure the resources required by the Job. After allocating resources, JobTracker will initialize the job. The main job of initialization is to put the Job into an internal queue, so that the configured job scheduler can schedule the job, and then schedule it. The default scheduling method is FIFO debugging. .
4. The job scheduler will initialize the job. Initialization is to create a running job object (encapsulating tasks and recording information) so that JobTracker can track the status and progress of the job.
5. After initialization, the job scheduler will obtain the input split information (input split), and create a map task for each split.
6. The next step is task allocation. At this time, tasktracker will run a simple loop mechanism to periodically send heartbeats to jobtracker. The heartbeat interval is 5 seconds. The programmer can configure this time. Heartbeat is the bridge between jobtracker and tasktracker. Through heartbeat, The jobtracker can monitor whether the tasktracker is alive, and can also obtain the status and problems processed by the tasktracker. At the same time, the tasktracker can also obtain the operation instructions given by the jobtracker through the return value in the heartbeat.
7. After the task is assigned, the task is executed. During task execution, jobtracker can monitor the status and progress of tasktracker through the heartbeat mechanism, and can also calculate the status and progress of the entire job, and tasktracker can also monitor its own status and progress locally.
8. When the jobtracker gets the notification that the last tasktracker operation to complete the specified task is successful, the jobtracker will set the entire job status to success, and then when the client queries the job running status (note: this is an asynchronous operation), the client will Find the notification of job completion.

mapreduce work process
HDFS (file) --> input split (input split) --> map task --> map interface --> ring memory buffer --> Combiner (optional, merge the same output) -- > Sorting --> HDFS (hash partition, output map task number overflow files) --> merge these overflow files --> partition output to new files (HDFS) according to the number of reduce --> reduce task -- >reduce interface-->results call the output API to output to the result file (stored on hdfs)

mapreduce running mechanism, which in chronological order include: input split, map stage, combiner stage, shuffle stage and reduce stage.
1. Input split:
(1) Before the map calculation, mapreduce will calculate the input split according to the input file, each input split is for a map task, the input split (input split) does not store the data itself, but an array of the length of the segment and the location where the data is recorded. The input segment (input split) is often closely related to the block (block) of hdfs. If we set the block of hdfs The block size is 64mb and will be sharded by the partition size.
(2) It is to divide the input file (file set) into multiple partitions according to the partition size, and as many partitions will start multiple map tasks
(3) The map task divides the data by row to generate <key, value> pair (where the offset (ie key value), value is the content of this row), and call the user-defined map method for processing to generate a new <key, value> pair.

2. The map stage:
the map function written by the programmer, the operation is to read the (Text value) from the map task, and generate the <key, value for statistics based on the (Text value) > Yes, then call the output interface to output a new <key, value> pair

3. Combiner stage:
(1) The combiner stage is optional for the programmer. The combiner is actually a reduce operation, mainly before the map calculates the intermediate file. (map task output) Do a simple operation of merging duplicate key values.
(2) The combiner is completed before sorting (sort by key), and becomes a <key, value, value, ...> pair, and the sorting is completed before the output file.

4. Shuffle stage:
(1). The process of using the output of map as the input of reduce is shuffle, which is the focus of mapreduce optimization.
(2). Generally, mapreduce calculates massive data. It is impossible to store all files in memory when map is output. Therefore, the process of writing map to disk is very complicated. The overhead is very large. When map does output, it will open a ring memory buffer in memory. This buffer is specially used for output. The default size is 100mb, and a valve is set for this buffer in the configuration file. The default value is 0.80. At the same time, map will also start a daemon thread for the output operation. If the memory of the buffer reaches 80% of the threshold, the daemon thread will write the content to the disk. This process is called spill. 20% of the memory can continue to write the data to be written to the disk. The operation of writing to the disk and writing to the memory do not interfere with each other. If the buffer area is full, then the map will block the operation of writing to the memory, allowing the write After the disk entry operation is completed, continue to perform the write memory operation. I mentioned earlier that there will be a sorting operation before writing to the disk. This is performed during the write operation to the disk, not when the memory is written. If we define combiner function, then the combiner operation will be performed before sorting.
(3). This stage is to complete the work and sorting of <key, value, value, ...> pairs, and finally output to a new file (these files are called overflow files (about 80M)).
(4). When the map output is all done, the map will merge these output files. There will also be a Partitioner operation (partitioning) in this process. The Partitioner operation will be partitioned according to the number of reducers (partitioning by key using hash), so that each reducer processes data of one partition.
(5)

5. Reduce stage:
Like the map function, it is also written by the programmer, and the final result is stored on hdfs.

Reference (detailed explanation of Hadoop MapReduce framework): http://blog.jobbole.com/84089/Reference
(the working principle of MapReduce): http://www.cnblogs.com/ywl925/p/3981360.htmlReference
(Detailed explanation of WordCount operation): http://blog.csdn .net/net19880504/article/details/17303375
reference (the number of maps is controlled by the inputSplit fragment size): http://www.cnblogs.com/yaohaitao/p/5610546.html
Reference (MapReduce Input Split (input split/slice) Detailed explanation): http://blog.csdn.net/dr_guo/article/details/51150278
Reference (MapReduce startup process): http://www.aboutyun.com/thread-6723-1-1.html
Reference (MapReduce startup process): http://blog.csdn.net/chlaws/article/details/23709571

Detailed explanation of Hadoop MapReduce framework

Guess you like