MapReduce common knowledge summary

A, mapreduce ideas

MapReduce good at dealing with large data, why does it have such power? This can be found MapReduce design ideas. MapReduce The idea is to " divide and rule ."

  (1) Mapper is responsible for "minute", namely the complex task into several "simple task" to deal with. "Simple task" contains three meanings: First, data or calculate the relative size of the original task to be greatly reduced; the second is the nearest computing principle that the task will be assigned to store the data required to calculate node; Third, these small parallel computing task, almost no dependencies between each other.

  (2) Reducer responsible for the results map stage are aggregated. As the number of required Reducer, the user according to the specific problem, by setting the value of the parameter mapred.reduce.tasks mapred-site.xml configuration file, a default value.

Two, mapreduce preparation of specifications

1, a user program written into three parts: Mapper, Reducer, Driver (filed client program running MR)

2, the input data is in the form Mapper pair KV (KV customizable type)

3, the output data is in the form Mapper pair KV (KV customizable type)

4, Mapper logic within the map () method

5, map () method (maptask process) for each <k, v> is called once

6, in the form of input data Reducer Mapper corresponding to the type of output data type, but also on the KV

7, Reducer of the logic within reduce () method

8, Reducetask process called once reduce () method for each set of the same k <k, v> group

9, Mapper and Reducer user-defined class should inherit their father

10, the whole process needs a Drvier to submit, submit a description of various objects necessary job information

Third, the process mapreduce instance running in a distributed when

1, MRAppMaster: responsible for the entire process of program scheduling and coordination of state

2, Yarnchild: responsible for the entire data processing flow map stage

3, Yarnchild: responsible for the entire data MapTask and ReduceTask two stages of the process flow more than reduce processing stages are YarnChild, not to say it MapTask and ReduceTask be run in the same YarnChild

Fourth, the process of running mapreduce

1, when a program starts mr, is the first to start MRAppMaster, MRAppMaster after starting the description information of the current job, the number of calculating a desired maptask example, and the machine starts the application corresponding to the cluster number of process maptask

2, after the process starts maptask, for data processing according to a given data slice (which range to which the offset file) range, as the main flow:

   A, using customer specified InputFormat RecordReader acquires read data form the input to the KV

  B, and input to the KV map is transmitted to the customer-defined () method, do logical operations, and the map () method KV output to the cache collected

  C, KV for the cache partition K in accordance with the sort continue to overflow to disk file

3, MRAppMaster after all maptask to monitor the process of task completion (The truth is, some maptask treatment process is completed, it will fetch data maptask started at reducetask to have been completed), will start the appropriate number of parameters specified according to customer reducetask process, and inform reducetask process to deal with the range of data (data partition)

4, after Reducetask process starts, the position MRAppMaster inform the data to be processed is located, obtaining from the machine a number of Taiwan maptask is running to several maptask output file, and re-merge sort locally, and then follow KV same key for a group, customer-defined call reduce () method of performing a logic operation, and outputs the result of operation to collect KV, and then call the customer specified OutputFormat outputs the resulting data to an external storage

Five, the degree of parallelism maptask

Hadoop determining mechanism parallelism in the MapTask. In the run MapReduce program, not MapTask the more the better. We need to consider how to configure the machine and the amount of data. If the amount of data is small, it may be time to start the task far exceeds the processing time data. The same is not better.

So how slicing it?

If we have a file 300M, it will be cut into three in HDFS. 0-128M, 128-256M, 256-300M. And placed in the different nodes go up. In MapReduce task, this 3 Block will be given to three MapTask.

MapTask actually assigned when a task slice range, but this range is a logical concept, has nothing to do with the physical partitioning of the block. But in practice, if the read data MapTask not running the machine must perform data transmission through the network, a very large impact on performance. Therefore, the policy is often taken on sub-cut MapTask A storage block such that each read data MapTask the machine as possible.

If a Block is very small, you can put multiple small Block to a MapTask.

So cut MapTask sub-process depends on the circumstances. The default implementation is to be segmented in accordance with the Block Size. MapTask slicing the responsibility of the (we write the main method) client. A slice corresponds to a MapTask example.

Six, maptask determine the degree of parallelism mechanism

a phase map parallelism job is determined by the client when submitting job.

And the client the basic logic of the planning phase map parallelism as follows:

Performing a logic data to be processed slice (i.e., slice According to a particular size, the data to be processed into a plurality of split logically), then each assigned a split example of parallel processing mapTask

Slice mechanisms:

FileInputFormat the default mechanism sliced

1, sectioned according to simple file content length

2, tile size, the default block size is equal to

3, the slice is not considered the overall data set, but a single slice by slice for each data file to be processed, such as two files:

  File1.txt 200M

  File2.txt 100M

After getSplits () method of processing information following slice, is formed:

File1.txt-split1 0-128M

File1.txt-split2 129M-200M

File2.txt-split1 0-100M

 
The size of the slice parameters in FileInputFormat

 By analyzing the source, in FileInputFormat, the computation logic slice size: long splitSize = computeSplitSize (blockSize, minSize, maxSize), translation of what is to compute the intermediate value of these three values

Slice operation mainly determined by these values:

blocksize: The default is 128M, can be modified by dfs.blocksize

minSize: Default is 1, may be modified by mapreduce.input.fileinputformat.split.minsize

maxsize: The default is Long.MaxValue, can be modified by mapreduce.input.fileinputformat.split.maxsize

Therefore, if the ratio blocksize maxsize tone is small, less than blocksize if a large slice will minsize tone than the blocksize, the slice will be greater than blocksize However, no matter how tone parameters do not allow multiple small files "classified" a split

 

 

Seven, reducetask parallelism

Reducetask also affects the degree of parallelism of the entire job concurrent and efficiency, but the number of concurrent maptask determined by the number of different sections, the number of decision ReduceTask directly manually set: job.setNumReduceTasks (4);

The default value is 1,

Manually set to 4, showing four running ReduceTask,

It is set to 0, which means no running reduceTask task, that is, no reducer stage, only the mapper stage

If the data is unevenly distributed, it is possible to generate data tilt reduce phase

Note: The number reducetask not be set, but also consider the needs of the business logic, in some cases, need to calculate the overall aggregate results, you can only have one reducetask

Try not to run too many reducetask. For most job, the best and most rduce the number of clusters reduce flat or smaller than reduce slots cluster. For this small cluster, it is particularly important.

Eight, reducetask determine the degree of parallelism mechanism

1、job.setNumReduceTasks(number);
2、job.setReducerClass(MyReducer.class);
3、job.setPartitioonerClass(MyPTN.class);

Discuss the following situations:

1, if the number is 1 and 2 has been set to a custom Reducer, the number of reduceTask is 1
the MR program regardless of the user's writing has not set Partitioner, then the district assembly will not work

2, if the number is not set, and 2 has been set to customize the Reducer, the number reduceTask 1 is
under the influence of default partition assembly, regardless of the number set by the user, regardless of a few, any number greater than 1, can perform all normal of.
If the partition when you set custom components, you need to pay attention to:
the number you set reduceTasks must be ==== partition number of maximum + 1
best-case scenario: the partition number is continuous.
Then the maximum total number of reduceTasks = partition number of partition number = 1 +

3, if the number is> = 2 and 2 has been set to the number of custom Reducer reduceTask number is
the underlying data will default partition assembly work

4, if you set the number of number, but did not set a custom reducer, then the program does not mean no reducer stage mapreduce
real reducer logic, it is to call the default parent class Reducer in the implementation logic: output as
reduceTask of the number is number the

5, if a MR program, do not want reducer stage. So just do what you can operate:
job.setNumberReudceTasks (0);
the entire MR program only mapper stage. No reducer stage.
Then there is no shuffle stage

Effect of 91, partitioner of

During MapReduce computation, sometimes need to put the final output data into different files, for example in accordance with the division of the provinces, it needs the data of the same province in a file; requires the same gender data in accordance with the sex talk into a file. We know that the final output data from Reducer task. So, if you want to get more than one file, which means the same number of Reducer task running. Data Mapper Reducer task from the task, said Mapper task to divide the data, assigned to different tasks running Reducer for different data. Process Mapper data on the division of tasks is called Partition. The division is responsible for implementing the data type called Partitioner.

 

 

Ten, combinar role

combiner fact belong optimization scheme, due to bandwidth limitations, and should try to map the number of data transfer between reduce. Map it end the same key value pairs and combined calculation, calculation rules consistent with reduce, so combiner can also be seen as a special Reducer.

Combiner performs operations required developers combiner is provided in the program (program using job.setCombinerClass (myCombine.class) custom combiner operation).

Combiner component is used to make partial summary, it is aggregated in mapTask in; Reducer assembly is used to make a global summary of the final, final summary.

 Eleven, mapreduce the shuffle Comments

. 1, MapReduce, the stage of processing data is transmitted to a reducer how mapper stage, MapReduce framework is a most critical process, this process is called Shuffle

2, Shuffle: data shuffling - (Core Mechanism: data partitioning, sorting, partial polymerization buffer, pull, and then merge sort)

3, specifically: the results of data processing is to MapTask output, according to the rules established by Partitioner components distributed ReduceTask, and in the distribution process, the data is partitioned and sorted by key

Shuffle MapReduce process of introduction:

Shuffle the original meaning of shuffling, shuffle, possible to convert a set of data with a certain set of data into a regular of irregular, more random, the better. The MapReduce like Shuffle shuffling is the reverse process, to try to convert a set of random data into a set of data having a certain rule.

Why MapReduce computing model requires Shuffle process? We all know MapReduce computation model typically includes two major phases: Map is a mapping, data filtering responsible for distribution; Reduce the statute is calculated is responsible for data merge. Reduce the data from the Map, Map output that is input to Reduce, Reduce the need to obtain the data by Shuffle.

Map output from the input Reduce the entire process may be broadly referred to Shuffle. Map and Reduce across Shuffle end end end comprises at Map Spill process, and a sort copy Reduce end comprises a process, as shown:

Spill procedure:

Spill process includes an output, ordering, write overflow, the step of merging the like, as shown:

Collect

Each Map task continuously in the form of the output data to an annular configuration data structure in memory of. An annular structure using the data in order to more efficient use of memory space, placing as many data in memory.

This data structure is actually a byte array, called Kvbuffer, as the name implies justice, but there is not only placed the data, but also placed a number of index data, the index data of the area to be placed from the alias of a Kvmeta, in an area of ​​Kvbuffer on wearing a IntBuffer (byte sequence is used in the platform itself endian) vest. Data area and a data area Kvbuffer index is adjacent to two regions do not overlap, with a cut-off point to divide the two, a demarcation point is not unchangeable, but are updated after each Spill. 0 is the initial cut-off point, the data stored in upward growth direction, the direction of the index data is stored grows downwards, as shown:

After Kvbuffer bufIndex pointer storage has a stuffy head upward growth, the initial value 0 bufIndex example, a key-type Int finished, growth bufIndex 4, Int type value after a finish, 8 bufIndex growth.

索引是对在kvbuffer中的索引,是个四元组,包括:value的起始位置、key的起始位置、partition值、value的长度,占用四个Int长度,Kvmeta的存放指针Kvindex每次都是向下跳四个“格子”,然后再向上一个格子一个格子地填充四元组的数据。比如Kvindex初始位置是-4,当第一个写完之后,(Kvindex+0)的位置存放value的起始位置、(Kvindex+1)的位置存放key的起始位置、(Kvindex+2)的位置存放partition的值、(Kvindex+3)的位置存放value的长度,然后Kvindex跳到-8位置,等第二个和索引写完之后,Kvindex跳到-32位置。

Kvbuffer的大小虽然可以通过参数设置,但是总共就那么大,和索引不断地增加,加着加着,Kvbuffer总有不够用的那天,那怎么办?把数据从内存刷到磁盘上再接着往内存写数据,把Kvbuffer中的数据刷到磁盘上的过程就叫Spill,多么明了的叫法,内存中的数据满了就自动地spill到具有更大空间的磁盘。

关于Spill触发的条件,也就是Kvbuffer用到什么程度开始Spill,还是要讲究一下的。如果把Kvbuffer用得死死得,一点缝都不剩的时候再开始Spill,那Map任务就需要等Spill完成腾出空间之后才能继续写数据;如果Kvbuffer只是满到一定程度,比如80%的时候就开始Spill,那在Spill的同时,Map任务还能继续写数据,如果Spill够快,Map可能都不需要为空闲空间而发愁。两利相衡取其大,一般选择后者。

Spill这个重要的过程是由Spill线程承担,Spill线程从Map任务接到“命令”之后就开始正式干活,干的活叫SortAndSpill,原来不仅仅是Spill,在Spill之前还有个颇具争议性的Sort。

Sort:

先把Kvbuffer中的数据按照partition值和key两个关键字升序排序,移动的只是索引数据,排序结果是Kvmeta中数据按照partition为单位聚集在一起,同一partition内的按照key有序。

Spill:

Spill线程为这次Spill过程创建一个磁盘文件:从所有的本地目录中轮训查找能存储这么大空间的目录,找到之后在其中创建一个类似于“spill12.out”的文件。Spill线程根据排过序的Kvmeta挨个partition的把数据吐到这个文件中,一个partition对应的数据吐完之后顺序地吐下个partition,直到把所有的partition遍历完。一个partition在文件中对应的数据也叫段(segment)。

所有的partition对应的数据都放在这个文件里,虽然是顺序存放的,但是怎么直接知道某个partition在这个文件中存放的起始位置呢?强大的索引又出场了。有一个三元组记录某个partition对应的数据在这个文件中的索引:起始位置、原始数据长度、压缩之后的数据长度,一个partition对应一个三元组。然后把这些索引信息存放在内存中,如果内存中放不下了,后续的索引信息就需要写到磁盘文件中了:从所有的本地目录中轮训查找能存储这么大空间的目录,找到之后在其中创建一个类似于“spill12.out.index”的文件,文件中不光存储了索引数据,还存储了crc32的校验数据。(spill12.out.index不一定在磁盘上创建,如果内存(默认1M空间)中能放得下就放在内存中,即使在磁盘上创建了,和spill12.out文件也不一定在同一个目录下。)

每一次Spill过程就会最少生成一个out文件,有时还会生成index文件,Spill的次数也烙印在文件名中。索引文件和数据文件的对应关系如下图所示:

在Spill线程如火如荼的进行SortAndSpill工作的同时,Map任务不会因此而停歇,而是一无既往地进行着数据输出。Map还是把数据写到kvbuffer中,那问题就来了:只顾着闷头按照bufindex指针向上增长,kvmeta只顾着按照Kvindex向下增长,是保持指针起始位置不变继续跑呢,还是另谋它路?如果保持指针起始位置不变,很快bufindex和Kvindex就碰头了,碰头之后再重新开始或者移动内存都比较麻烦,不可取。Map取kvbuffer中剩余空间的中间位置,用这个位置设置为新的分界点,bufindex指针移动到这个分界点,Kvindex移动到这个分界点的-16位置,然后两者就可以和谐地按照自己既定的轨迹放置数据了,当Spill完成,空间腾出之后,不需要做任何改动继续前进。分界点的转换如下图所示:

Map任务总要把输出的数据写到磁盘上,即使输出数据量很小在内存中全部能装得下,在最后也会把数据刷到磁盘上。

Merge

Map任务如果输出数据量很大,可能会进行好几次Spill,out文件和Index文件会产生很多,分布在不同的磁盘上。最后把这些文件进行合并的merge过程闪亮登场。

Merge过程怎么知道产生的Spill文件都在哪了呢?从所有的本地目录上扫描得到产生的Spill文件,然后把路径存储在一个数组里。Merge过程又怎么知道Spill的索引信息呢?没错,也是从所有的本地目录上扫描得到Index文件,然后把索引信息存储在一个列表里。到这里,又遇到了一个值得纳闷的地方。在之前Spill过程中的时候为什么不直接把这些信息存储在内存中呢,何必又多了这步扫描的操作?特别是Spill的索引数据,之前当内存超限之后就把数据写到磁盘,现在又要从磁盘把这些数据读出来,还是需要装到更多的内存中。之所以多此一举,是因为这时kvbuffer这个内存大户已经不再使用可以回收,有内存空间来装这些数据了。(对于内存空间较大的土豪来说,用内存来省却这两个io步骤还是值得考虑的。)

然后为merge过程创建一个叫file.out的文件和一个叫file.out.Index的文件用来存储最终的输出和索引。

一个partition一个partition的进行合并输出。对于某个partition来说,从索引列表中查询这个partition对应的所有索引信息,每个对应一个段插入到段列表中。也就是这个partition对应一个段列表,记录所有的Spill文件中对应的这个partition那段数据的文件名、起始位置、长度等等。

然后对这个partition对应的所有的segment进行合并,目标是合并成一个segment。当这个partition对应很多个segment时,会分批地进行合并:先从segment列表中把第一批取出来,以key为关键字放置成最小堆,然后从最小堆中每次取出最小的输出到一个临时文件中,这样就把这一批段合并成一个临时的段,把它加回到segment列表中;再从segment列表中把第二批取出来合并输出到一个临时segment,把其加入到列表中;这样往复执行,直到剩下的段是一批,输出到最终的文件中。

最终的索引数据仍然输出到Index文件中。

Map端的Shuffle过程到此结束。

Copy:

Reduce任务通过HTTP向各个Map任务拖取它所需要的数据。每个节点都会启动一个常驻的HTTP server,其中一项服务就是响应Reduce拖取Map数据。当有MapOutput的HTTP请求过来的时候,HTTP server就读取相应的Map输出文件中对应这个Reduce部分的数据通过网络流输出给Reduce。

Reduce任务拖取某个Map对应的数据,如果在内存中能放得下这次数据的话就直接把数据写到内存中。Reduce要向每个Map去拖取数据,在内存中每个Map对应一块数据,当内存中存储的Map数据占用空间达到一定程度的时候,开始启动内存中merge,把内存中的数据merge输出到磁盘上一个文件中。

如果在内存中不能放得下这个Map的数据的话,直接把Map数据写到磁盘上,在本地目录创建一个文件,从HTTP流中读取数据然后写到磁盘,使用的缓存区大小是64K。拖一个Map数据过来就会创建一个文件,当文件数量达到一定阈值时,开始启动磁盘文件merge,把这些文件合并输出到一个文件。

有些Map的数据较小是可以放在内存中的,有些Map的数据较大需要放在磁盘上,这样最后Reduce任务拖过来的数据有些放在内存中了有些放在磁盘上,最后会对这些来一个全局合并。

Merge Sort:

这里使用的Merge和Map端使用的Merge过程一样。Map的输出数据已经是有序的,Merge进行一次合并排序,所谓Reduce端的sort过程就是这个合并的过程。一般Reduce是一边copy一边sort,即copy和sort两个阶段是重叠而不是完全分开的。

Reduce端的Shuffle过程至此结束。

Guess you like

Origin www.cnblogs.com/zhangfuxiao/p/11374546.html