Good programmers to share large data line learning the whole process of MapReduce analytic

　　Good programmers Big Data learning routes Share MapReduce whole process of parsing, mobile data and mobile computing

　　When studying large data exposed to mobile data and mobile computing these two closely linked and have a very different concept, which is also called a local mobile computing computing.

　　Mobile data used when processing previous data, the data transfer is actually processed would need to be stored on different nodes respective data processing mode logic. Doing so inefficient, especially the large amount of data in the data is great, at least all GB or more, more is TB , PB even larger, and disk I / O , network I / O efficiency is very low, so deal with them will take a long time, we can not meet our requirements. The mobile computing has emerged.

　　Mobile computing, also called local computing, data is stored is no longer change on the node, but the processing logic program to the data on each node. Since the size of the processing program certainly not particularly large, this can be achieved quickly transfer the program to store data up each node, and then performs the local data processing, high efficiency. The big data processing techniques are employed in this manner.

Concise said:

Map stages:

. 1 , the Read : reading a data source, the data filter into one of K / V

2 , the Map : the map function, the process of parsing K / V , and a new K / V

. 3 , the Collect : output, stored in the ring buffer

4 , Spill : memory area is full, data is written to the local disk, and produce temporary files

5 , as Combine : merging the temporary file to ensure that the production of a data file

Reduce stages:

1 , Shuffle : Copy phase, the Reduce Task to various Map Task remote replication of data a minute, for a copy of the data, 2 , if its size exceeds a certain threshold, then write disk; otherwise into memory

3 , Merge : merge files on the memory and disk, to prevent excessive consuming too much memory or disk file

4 , the Sort : the Map Task stage partial ordering, the Reduce Task stage to conduct a merge sort

. 5 , the Reduce : the data to reduce the function

. 6 , the Write : the reduce the function and the results of calculation written HDFS on

Depth analysis, said:

MapTask stage

( . 1) the Read Stage : MapTask user by writing the RecordReader , input InputSplit parsed in one key / value.

( 2) the Map phase : The main node is parsed key / value written to a user map () function to process and a new set of key / value.

( . 3) the Collect collection stage : preparation of the user map () function, after the data processing is completed, usually call OutputCollector.collect () output. Inside the function, it will generate a key / value partition (call Partitioner ), and writes a ring buffer memory in.

( 4) Spill stage : the "overflow wrote," when the ring buffer is full, MapReduce will write data to a local disk on, create a temporary file. Note that, before writing data to a local disk, the data must first conduct a local sorting and data where necessary consolidation, compression and other operations.

Write phase overflow details:

Step 1: Use quick sort algorithm to sort the data in the cache area, ordering that in accordance with the first partition number partition sorted, and then follow the key sort. In this way, after sorting, data partition to come together as a unit, and all of the data in accordance with the same partition key and orderly.

Step 2: turn the data in each partition write temporary files output / spillN.out in accordance with the mandate of the working directory partition numbers in ascending (N represents the number of write current overflow) in. If the user sets the Combiner is , before writing the file, the data in each partition once aggregation operation.

Step 3: meta-information of data written to the memory partition index data structure SpillRecord in which the meta information of each partition comprising an offset in the temporary file, the data size before compression and the data size after compression. If the current index memory size exceeds 1MB, then memory is written to the index file parts output / spillN.out.index in.

(. 5) Combin. E stage : when all the data processing is completed, MapTask all the temporary files once combined , to ensure that the end will result in one data file . When all the data after processing, MapTask will all temporary files into one large file, and to save the file output / file.out while generating a corresponding index file Output / file.out.index . File during the merging process, MapTask to merge partitions as a unit. For a partition, it will use multiple rounds of recursive merge way. The combined round the io.sort.factor (default 100) files, and the file to be added to the generated re-merge list, the sort files, the above process is repeated, until eventually obtain a large file. Let each MapTask ultimately generate only a data file, you can avoid a lot of files at the same time to open and read simultaneously spending large number of small random read files generated brings. Offset information including the temporary file, the data size before compression and the data size after compression. If the current index memory size exceeds 1MB, then memory is written to the index file output / spillN.out.index in.

Shuffle stage (map output terminal to reduce input )

1) maptask collect our map () kv method of output, into the memory buffer in

2) continuously from memory buffer overflow local disk file, multiple files can overflow

3）多个溢出文件会被合并成大的溢出文件

4）在溢出过程中，及合并的过程中，都要调用partitioner进行分区和针对key进行排序

5）reducetask根据自己的分区号，去各个maptask机器上取相应的结果分区数据

6）reducetask会取到同一个分区的来自不同maptask的结果文件，reducetask会将这些文件再进行合并（归并排序）

7）合并成大文件后，shuffle的过程也就结束了，后面进入reducetask的逻辑运算过程（从文件中取出一个一个的键值对group，调用用户自定义的reduce()方法）

3）注意Shuffle中的缓冲区大小会影响到mapreduce程序的执行效率，原则上说，缓冲区越大，磁盘io的次数越少，执行速度就越快。缓冲区的大小可以通过参数调整，参数：io.sort.mb默认100M。

ReduceTask阶段

（1）Copy阶段：ReduceTask从各个MapTask上远程拷贝一片数据，并针对某一片数据，如果其大小超过一定阈值，则写到磁盘上，否则直接放到内存中。

（2）Merge阶段：在远程拷贝数据的同时，ReduceTask启动了两个后台线程对内存和磁盘上的文件进行合并，以防止内存使用过多或磁盘上文件过多。

（3）Sort阶段：按照MapReduce语义，用户编写reduce()函数输入数据是按key进行聚集的一组数据。为了将key相同的数据聚在一起，Hadoop采用了基于排序的策略。由于各个MapTask已经实现对自己的处理结果进行了局部排序，因此，ReduceTask只需对所有数据进行一次归并排序即可。

（4）Reduce阶段：reduce()函数将计算结果写到HDFS上。

Good programmers to share large data line learning the whole process of MapReduce analytic

Guess you like