Big Data learning process to resolve the whole route sharing MapReduce

　　Big Data learning routes Share MapReduce whole process of parsing, mobile data and mobile computing

　　When studying large data exposed to mobile data and mobile computing these two closely linked and have a very different concept, which is also called a local mobile computing computing.

　　Mobile data used when processing previous data, the data transfer is actually processed would need to be stored on different nodes respective data processing mode logic. Doing so inefficient, especially the large amount of data in the data is great, at least all GB or more, more is TB , PB even larger, and disk I / O , network I / O efficiency is very low, so deal with them will take a long time, we can not meet our requirements. The mobile computing has emerged.

　　Mobile computing, also called local computing, data is stored is no longer change on the node, but the processing logic program to the data on each node. Since the size of the processing program certainly not particularly large, this can be achieved quickly transfer the program to store data up each node, and then performs the local data processing, high efficiency. The big data processing techniques are employed in this manner.

Concise said:

Map stages:

. 1 , the Read : reading a data source, the data filter into one of K / V

2 , the Map : the map function, the process of parsing K / V , and a new K / V

. 3 , the Collect : output, stored in the ring buffer

4 , Spill : memory area is full, data is written to the local disk, and produce temporary files

5 , as Combine : merging the temporary file to ensure that the production of a data file

Reduce stages:

1 , Shuffle : Copy phase, the Reduce Task to various Map Task remote replication of data a minute, for a copy of the data, 2 , if its size exceeds a certain threshold, then write disk; otherwise into memory

3 , Merge : merge files on the memory and disk, to prevent excessive consuming too much memory or disk file

4 , the Sort : the Map Task stage partial ordering, the Reduce Task stage to conduct a merge sort

. 5 , the Reduce : the data to reduce the function

. 6 , the Write : the reduce the function and the results of calculation written HDFS on

Depth analysis, said:

MapTask stage

( . 1) the Read Stage : MapTask RecordReader written by the user from the input InputSplit parsed in one key / value.

( 2) the Map phase : The main node is parsed key / value written to a user map () function to process and a new set of key / value.

( . 3) the Collect collection stage : preparation of the user map () function, after the data processing is completed, usually call OutputCollector.collect () output. Inside the function, it will generate a key / value partition (Partitioner call), and written into a ring buffer memory in.

( 4) Spill stage : the "overflow wrote," when the ring buffer is full, MapReduce will write data to a local disk, create a temporary file. Note that, before writing data to a local disk, the data must first conduct a local sorting, and data consolidation, if necessary, compression and other operations.

Write phase overflow details:

Step 1: Use quick sort algorithm to sort the data in the cache area, ordering that the first partition according to the partition number sort, and then follow the key sort. In this way, after sorting, data partition to come together as a unit, and all of the data in accordance with the same partition key and orderly.

Step 2: turn the data in each partition write temporary files output / spillN.out in accordance with the mandate of the working directory partition numbers in ascending (N represents the number of write current overflow) in. If the user sets the Combiner is , before writing the file, the data in each partition once aggregation operation.

Step 3: The meta information is written to the memory partition data SpillRecord index data structure in which the meta information of each partition comprising an offset in the temporary file, the data size before compression and the data size after compression. If the current index memory size exceeds 1MB, then memory is written to the index file parts output / spillN.out.index in.

(5) Combine phase : Once all the data processing is complete, MapTask for all temporary files once the merger to ensure that it will eventually generate a data file . When all the data after processing, MapTask will all temporary files into one large file, and to save the file output / file.out while generating a corresponding index file Output / file.out.index . File during the merging process, MapTask to merge partitions as a unit. For a partition, merge it uses multiple cycles of recursive manner. The combined round the io.sort.factor (default 100) files, and the file to be added to the generated re-merge list, the sort files, the above process is repeated, until eventually obtain a large file. Let each MapTask ultimately generate only a data file, you can avoid a lot of files at the same time to open and read simultaneously spending large number of small random read files generated brings. Offset information including the temporary file, the data size before compression and the data size after compression. If the current index memory size exceeds 1MB, then memory is written to the index file output / spillN.out.index in.

Shuffle stage (map output terminal to reduce input )

1) maptask collect our map () kv method of output, into the memory buffer

2) continuously from memory buffer overflow local disk file, multiple files can overflow

3) multiple spill files will be merged into a large spill file

4) In the spillover process, and the process of the merger, we should call the partitioner to partition and for the conduct key sort

5) reducetask partition according to their number, to each maptask take the corresponding result data partition on the machine

6) reducetask will be taken to the same partition of the result files from different maptask, reducetask will then merge the files (merge sort )

7) larger than the combined file, the process of the shuffle is over, back into the logical operation process of reducetask (key removed one by one from a file on Group, call reduce (user-defined) method)

3) Note that the buffer size in Shuffle will affect the efficiency of the implementation of the program mapreduce, in principle, the larger the buffer, the fewer the number of disk io, perform faster. Buffer size parameters can be adjusted parameters: io.sort.mb a default 100M.

ReduceTask stage

( 1) Copy stage : ReduceTask from each MapTask a remote copy of data , and for a certain data, if its size exceeds a certain threshold, then written to disk, or directly into memory.

( 2) Merge stage : while a remote copy of the data, ReduceTask background thread started two files on disk and memory consolidation, in order to prevent the excessive use too much memory or disk file .

( . 3) the Sort phase : according to the preparation reduce MapReduce semantics, the user () function of the input data is based on a set of aggregated data key. To the same key data together, Hadoop based sorting strategy employed. Because each MapTask have achieved their treatment results were sort of local, therefore, ReduceTask only once all the data to merge sort .

( . 4) the Reduce phase: reduce () function writes the calculation result on the HDFS.

Big Data learning process to resolve the whole route sharing MapReduce

Guess you like