Hadoop's big data platform foundation (2)

Hadoop's big data platform foundation (2)

1. Analysis of Map/Reduce Working Mechanism - Analysis of Data Flow

In the core framework of the MspReduce algorithm, the data to be processed is initially placed in HDFS, and then each node of the network map is recited, and the output is an intermediate key-value pair output. Then, how to hand over intermediate data to Reduce, and what are the allocation rules for each worker node?

 

Shuffle:

After the Map calculation is completed, the data will be finally handed over to Reduce for processing through a process called Shuffle. This Shuffle is the core of our Hadoop data processing. It can shuffle the data scattered on the worker nodes in different map stages and merge them according to certain rules. After forming a new format, it is distributed to the map node worker nodes to process the data.

steps:

1.map task procedure:imput split  - map - buffer in momory

2.buffer in memory: partition, sort and split to disk, partition is used as an intermediate output key-value pair in each interval, the global default data processing rules of all nodes are out of order, and each interval is ordered. At this stage, each section will be allocated and processed, and the processed older section block will be stored in disk.

3. Merge and sort the processed data in each interval, and hand it over to Reduce for output.

 

Shuffle process is implanted on both sides of Map side and Reduce side

1. Map-side work:

a. Partition: According to the Key value of the key-value pair, select the Partition interval (corresponding to the Reduce node) to which the key-value pair belongs.

b. Sort: sort the key-value pairs in each partition according to the key.

c. Segmentation: The result of the Map side is first stored in the buffer. If it exceeds, the segmentation process will naturally be performed, and a part of the data will be sent to the hard disk.

d. Merge: For key-value pairs to be sent to the same node, we need to merge it. (This step is likely to target the hard disk. For massive data processing, buffer overflow is normal)

2. Reduce side work:

a. Copy: Pull data from the specified Map side by HTTP, pay attention to the local disk of the Map side.

b. Merge: A Reduce node may obtain data from multiple Map nodes, and after obtaining it

c. Sort: sort the key-value pairs in each partition according to the key. It is the same as the Map-side operation.

 

 

2. Error handling mechanism

For a Hadoop cluster, the errors of each node will not affect the sorting, and each distributed task is still tracked and allocated through JodTracker. But for fatal errors, once the JobTracker main program fails, the Hadoop cluster is even more unusable and can only be restarted.

TaskTracker node error:

The heartbeat mechanism of JonTracker and TaskTracker: TaskTracker must guarantee to report the progress of the current node to JobTracker within 1 minute.

1. If the report is still not received after the timeout, and the JobTracker still has not received the report, the TaskTracker will be removed from the waiting scheduling queue set.

2. When a report is received but fails, the TaskTracker will be moved to the end of the waiting queue and re-queued, but if the report fails four times in a row, it will also be removed.

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326447526&siteId=291194637