Hadoop platform on HDFS and MapReduce functions

1. Use your own words to clarify on the platform Hadoop HDFS and MapReduce function, working principle and process.

 HDFS

(1) first start namenode format, create fsimage and edits files. If this is not the first time you start, log and edit the image directly load files into memory.
(2) the client requests the metadata of additions and deletions.
(3) namenode record operating logs, update rolling log.
(4) namenode deletions data change in memory check.

2) The second stage: Secondary NameNode work
(1) Secondary NameNode asked whether namenode need checkpoint. Namenode whether directly back to check the results.
(2) Secondary NameNode request execution of checkpoint.
(3) namenode scroll being written edits log.
(4) mirror file and the edit log before scrolling copied to the Secondary NameNode.
(5) Secondary NameNode loading and editing the log file to the image memory, and pooled.
(6) generating a new image file fsimage.chkpoint.
(7) to copy fsimage.chkpoint namenode.
namenode will fsimage.chkpoint renamed into fsimage.

DataNode:

Stored on the disk, including two files, one for the data itself, a metadata comprises data block length, and checksum data block, and a time stamp.
2) DataNode namenode After starting the registration, by the periodic (1 hour) to report all of the information blocks namenode.
3) the heartbeat every 3 seconds, with results returned heartbeat namenode datanode command to the data to another block, such as copying machines, or deleting a data block. If more than 10 minutes did not receive a heartbeat datanode, the node is considered unavailable.
4) cluster operation can safely join and leave some machines
MapReduce:

shuffle process:
1) MapTask collect our map () kv method of output, into the memory buffer
2) continuously from the memory buffer overflow local disk file, multiple files can overflow
3) multiple spill files will be the combined into a large spill files
4) in the overflow process, and the combined process, we must call partitioner for partitioning and sorting Key
. 5) according to their partition number ReduceTask, to take the corresponding result data partition on each machine maptask
6) reducetask will be taken to the same partition of the result files from different maptask, ReduceTask will then merge the files (merge sort)
7) larger than the combined file, shuffle the process is over, the back into the logic ReduceTask operation
(withdrawn reduce () method of a key for a group, user-defined call from a file)

 

maptask

(1) Read Stage: Map Task RecordReader written by the user, parsing one key / value from the input InputSplit.
(2) Map phase: The main node is parsed key / value written to a user map () function handles, and generate a new set of key / value.
(3) Collect collection stage: preparation of the user map () function, when the data processing is completed, usually call OutputCollector.collect () output. Inside the function, it will generate a key / value partition (Partitioner call), and written into a ring buffer memory.
(4) Spill stages: the "overflow wrote," when the ring buffer is full, MapReduce will write data to a local disk, create a temporary file. Note that, before writing data to a local disk, the data must first conduct a local sorting, and data consolidation, if necessary, compression and other operations.
Write phase overflow details:
Step 1: Use quick sort algorithm to sort the data in the cache area, ordering that in accordance with the first partition number
sort partition, and then sorted according to key. In this way, after sorting, data partition to come together as a unit, and all of the data in accordance with the same partition key and orderly.
Step 2: turn the data in each partition write temporary files output / spillN.out in accordance with the mandate of the working directory partition numbers in ascending (N represents the number of write current overflow) in. If the user sets the Combiner is, before writing the file, the data in each partition once aggregation operation.
Step 3: partition data meta-information is written in the memory SpillRecord index data structure, wherein the meta information of each partition comprising an offset in the temporary file, the data size before compression and the data size after compression. If the current index memory size exceeds 1MB, then memory is written to the index file output / spillN.out.index in.
(5) Combine phase: Once all the data processing is complete, MapTask for all temporary files once the merger to ensure that it will eventually generate a data file.
When all the data after processing, MapTask will all temporary files into one large file, and to save the file output / file.out while the index file generated corresponding output / file.out.index.
File during the merging process, MapTask to merge partitions as a unit. For a partition, merge it uses multiple cycles of recursive manner. The combined round the io.sort.factor (default 100) files, and the file to be added to the generated re-merge list, the sort files, the above process is repeated, until eventually obtain a large file.
Let each MapTask ultimately generate only a data file, you can avoid a lot of files at the same time to open and read simultaneously spending large number of small random read files generated brings.

reducetask:

(1) Copy stage: ReduceTask from each MapTask a remote copy of data, and data for a certain one, if its size exceeds a certain threshold, then written to disk, or directly into memory.
(2) Merge stage: while remote copy of the data, ReduceTask background thread started two files on disk and memory consolidation, in order to prevent the excessive use too much memory or disk file.
(3) Sort phase: prepared according MapReduce reduce semantics, the user () function of the input data is a set of data gathered by the key. To the same key data together, Hadoop based sorting strategy employed. Because each MapTask have achieved their treatment results were sort of local, therefore, ReduceTask only once all the data merge sort can be.
(4) Reduce Stage: reduce () function writes the calculation result on the HDFS.

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/destinymingyun/p/10993295.html