MapReduce optimization (total (overall how tuning), points (map how tuning, ruduce how tuning))

1, configuration tuning

Tuning general principle to shuffle process as much memory space provided in the map side, you can write multiple disks by avoiding overflow to get the best performance (configuration io.sort. *, Io.sort.mb), in the end reduce intermediate data resides in all memory, can obtain the best performance, but by default, this is unlikely to happen, because generally all the memory reserved for the function containing reduce (to modify the configuration requires mapred.inmem .merge.threshold, mapred.job.reduce.input.buffer.percent)

If we can tune the shuffle procedure according to circumstances, be helpful to provide MapReduce performance.
As a general rule is to shuffle the process of allocating memory as large as possible, of course, you need to make sure the map and reduce have enough memory to run the business logic. Therefore, when implementing Mapper and Reducer, should minimize the use of memory, such as avoiding constantly superimposed on the Map.
JVM run map and reduce tasks, memory set by mapred.child.java.opts property, as set large memory. Memory size of the container and is set by mapreduce.map.memory.mb mapreduce.reduce.memory.mb, default is 1024M.
You can improve the efficiency of ordering and write to the disk cache in the following ways:

Mapreduce.task.io.sort.mb size adjustment, to avoid or reduce the number of buffer overflow. When adjusting this parameter, the best the simultaneous detection of Map task heap size of the JVM, and increase heap space when necessary.
mapreduce.task.io.sort.factor value of the property increased by about 100 times, which can make the merger process faster and reduce disk access.
KV to provide a more efficient tool custom sequence, the less space data serialized, the higher the cache usage.
Combiner is to provide a more efficient (combiner), making the output higher Map task polymerization efficiency.
Provide a more efficient packet key comparator and comparator values.

Output is dependent on the number of jobs in the Reduce task, here are some optimization tips:

Compressed output, to save storage space, but also enhance the HDFS write throughput
avoid writing a file with an outer end (out-of-band side file ) Reduce tasks as output
requirements based on consumer job output file can be divided compression techniques may be appropriate
to set up a larger block size, the larger the HDFS file is written to help reduce the number of tasks Map

2, Map / Reduce end Tuning

General Optimization
Hadoop default 4KB as a buffer, this be small, the size of the buffer pool can be raised by io.file.buffer.size.

map-side optimization
avoid writing the plurality of files may spill achieve the best performance, a spill file is the best. By outputting the estimated size of the map, set a reasonable mapreduce.task.io.sort. * Attributes, such that the minimum number of spill file. For example tune as large as possible mapreduce.task.io.sort.mb.

Here Insert Picture Description
reduce end optimization
if they can get all the data stored in memory, you can achieve the best performance. Under normal circumstances, are reserved to reduce memory function, but if you reduce the function is not very high memory requirements, the mapreduce.reduce.merge.inmem.threshold (trigger output merged map file number) is set to 0, mapreduce.reduce.input .buffer.percent (heap memory for saving ratio map output file) is set to 1.0, you can achieve good performance. In the TB level data sorting performance tests, Hadoop is through the middle reduce data are stored in the memory of victory.

Here Insert Picture Description
Here Insert Picture Description

Published 56 original articles · won praise 561 · views 20000 +

Guess you like

Origin blog.csdn.net/CZXY18ji/article/details/103078745