[TOC]

First, the reason for the slower runners MapReduce

1) computer performance
CPU, memory, disk health network.
When the file system can set file access, do not update atime

2) I / O optimization operations
(1) Data inclination
(2) map and reduce the number of unjustified
(3) map to run too long, leading to reduce waiting too long
(4) too many small files
(5) the number of unavailable block large file
(6) spill too many times
(7) merge too many times and so on.

Second, the optimization program

MapReduce optimization mainly from several aspects to consider: data entry, Map stage, Reduce stage, IO transmission, data skew problem.

1, the data input stage

1) with small file: mr task before performing small files merge, a large number of small files will generate a lot of map task, task map to increase the number of loading, and loading the task is time-consuming, resulting in a slower run mr.
2) CombineTextInputFormat as an input, the input terminal number of small files to solve the scene.

2, Map stage

1) Write to reduce overflow (spill) times: by adjusting mapreduce.task.io.sort.mb and mapreduce.map.sort.spill.percent parameter value, increasing the trigger memory cap spill, reduce the number of spill, thus reducing disk IO frequency.
2) to reduce combined (merge) times: mapreduce.task.io.sort.factor by adjusting the parameters, is increased while the number of open files merge, reduce the number of merge, thereby shortening the processing time mr. In fact, the essence is to increase the number of open file handles.
3) After the map, without affecting the service logic premise, combine to be processed, to reduce I / O.

3, Reduce stage

1) a reasonable set of map and reduce the number of: both can not be set too low, it can not be set too. Too little can lead to task wait, the treatment time; too much will lead to inter-map, reduce tasks compete for resources, resulting in processing timeout error.
2) setting map, reduce coexist: mapreduce.job.reduce.slowstart.completedmaps adjustment parameters, default 0.05. Representing at least map task, the number of maptask implementation of the completion of not less than 5% before the start reducetask. After the completion of the operation of the map to a certain extent, reduce began to run, reduce the reduce the waiting time.
3) reduce avoid using: because a large amount will reduce network connection data sets for the time consumption.
4) a reasonable set of buffer reduce side: By default, the data reaches a threshold value, the data in the buffer will be written to disk, and then reduce'll get all the data from the disk. In other words, reduce buffer and are not directly related, in the middle of a multi-disk write - Process> read the disk, since there are drawbacks to this, then it can be configured by parameters such that a portion of the data buffer can be directly transported to reduce , thereby reducing the IO overhead: mapreduce.reduce.input.buffer.percent, the default is 0.0. When the value is greater than 0, it will retain the data showed directly reduce the specified proportions of buffer memory read. Thus, the need to set the buffer memory, the memory needs to read the data, but also the reduce computing memory, so to be adjusted according to the operation of the job.

4, IO transfer phase

1) using the data compression mode, reducing the time the IO network. Installation and Snappy LZO compression coder.
2) Use SequenceFile binaries. Write faster and write faster than ordinary text

5, data skew problem

1) Data tilt phenomena
data frequency inclined - data amount of one area is much larger than in other regions.
Tilt data size - much larger than the size of the recording portion of the average value.

2) a method of reducing data skew.

MapReduce reduce data skew usually occurs in stages, since the data map output after partitions, each made a reduce handling, while the amount of data for each partition if very different, then reduce the processing time is certainly different, resulting in wood send barrel effect, it is necessary to avoid data skew.
Method 1: Sampling and scope of the partitioning
can be obtained by sampling the result set of the original data partition boundary preset value. (Hive there comes)

Method 2: Custom Partition
custom partitioning based on the output key background. For example, if the map output key words from a book. And wherein a few more specialized vocabulary. Then you can send these custom partitioning part of the specialized vocabulary to reduce instances fixed. While the others are sent to the remaining reduce instance.

Method 3: Combine
the use of a large amount can be reduced Combine data skew. In possible, combine the object is aggregated and reduced data.

Method 4: The Map Join, avoid Reduce Join. No it will not reduce the problems related to data skew

Third, the common tuning parameters

1, resource-related parameters

(1) in the mr program can be configured directly through parameter configuration objects

parameter	Explanation
mapreduce.map.memory.mb	The upper limit of a Map Task resource that can be used (unit: MB), the default is 1024. If the amount of resources Map Task actual use exceeds this value, it will be forced to kill.
mapreduce.reduce.memory.mb	A Reduce Task resource limit that can be used (unit: MB), the default is 1024. If the amount of the actual use of resources Reduce Task exceeds this value, it will be forced to kill.
mapreduce.map.cpu.vcores	Each Map task cpu up number may be used core, default: 1
mapreduce.reduce.cpu.vcores	Each cpu up number may be used Reduce task core, default: 1
mapreduce.reduce.shuffle.parallelcopies	Each reduce the number of parallel data of the map to take. The default value is 5
mapreduce.reduce.shuffle.merge.percent	merge process reduce end, the data buffer reaches the beginning of what percentage is written to disk. The default value is 0.66
mapreduce.reduce.shuffle.input.buffer.percent	buffer size reduce the proportion of available memory. The default value of 0.7
mapreduce.reduce.input.buffer.percent	Data will retain specified percentage of memory read buffer directly showed reduce use, without going through the process of writing to disk give reduce. The default value is 0.0
mapreduce.map.speculative	Whether to set tasks can execute concurrently

(2) yarn startup parameters

parameter	Explanation
yarn.scheduler.minimum-allocation-mb 1024	To the application of the minimum memory allocation container
yarn.scheduler.maximum-allocation-mb 8192	To the application of the maximum memory allocation container
yarn.scheduler.minimum-allocation-vcores 1	The minimum number of CPU cores per container application
yarn.scheduler.maximum-allocation-vcores 32	The maximum number of CPU cores per container application
yarn.nodemanager.resource.memory-mb 8192	To the maximum physical memory allocated containers

(3) shuffle key factor in performance optimization, it is configured in the configuration file before starting

parameter	Explanation
mapreduce.task.io.sort.mb 100	Circular buffer size shuffle default 100m
mapreduce.map.sort.spill.percent 0.8	Ring buffer overflow threshold, default 80%
mapreduce.task.io.sort.factor	At the same time the number of open files. When the merge is increased while the number of open files, reduce the number of merge

2, fault tolerance parameters

parameter	Explanation
mapreduce.map.maxattempts	The maximum number of times each Map Task retry, retry once the parameter exceeds this value, it is considered Map Task fails, default: 4.
mapreduce.reduce.maxattempts	The maximum number of times each Reduce Task retry, retry once the parameter exceeds this value, it is considered Map Task fails, default: 4.
mapreduce.task.timeout	Task timeout, a regular parameter to be set, the meaning of this parameter is expressed: If a task without any access within a certain time, i.e., does not read the new data, there is no output data is considered the task block state might be stuck, perhaps forever will be the main card, in order to prevent because the user does not exit the program live forever block, then the mandatory set a timeout (in milliseconds), the default is 600,000. If your program each time data input process is too long (for example, accesses the database, pulls data through the network, etc.), it recommended that the big adjustment parameters, this parameter is too small error message often appears is "AttemptID: attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster. ".

Fourth, small file optimization

Each file on HDFS to be built on a namenode index, the index size is about 150byte, so that when the relatively large number of small files and they will produce a lot of index files, on the one hand would take up a lot of memory space namenode, On the other hand the index file is too large or slow down the index.

1, archive

hadoop archive 是一个高效地将小文件放入HDFS块中的文件存档工具，它能够将多个小文件打包成一个HAR文件，这样就减少了namenode的内存使用。

2、Sequence file

sequence file由一系列的二进制key/value组成，如果key为文件名，value为文件内容，则可以将大批小文件合并成一个大文件。
将多个文件合并成一个文件的时候，需要提供一个索引文件，说明每个文件在总文件中的起始位置，长度等信息

3、CombineFileInputFormat

继承于 FileInputFormat,实现子类是CombineTextInputFormat。它是是一种新的inputformat，用于将多个文件合并成一个单独的split，另外，它会考虑数据的存储位置。

4、开启JVM重用

对于大量小文件Job，可以开启JVM重用会减少45%运行时间。
JVM重用理解：一个map运行一个jvm，重用的话，在一个map在jvm上运行完毕后，jvm继续运行其他map。
具体设置：mapreduce.job.jvm.numtasks值在10-20之间。
但是开启jvm有个缺陷。但是某个jvm被某个map或者reduce任务使用过之后，只有当当前MapReduce整个任务结束之后，其他MapReduce任务才能使用该jvm，也就是说即便该jvm是空闲的，也无法给其他MapReduce任务使用。一定程序上造成资源的浪费。

Sixteen, MapReduce-- tune