Large data optimized [] (ii) Optimization of the MapReduce

Reasons for the slow running MapReduce (☆☆☆☆☆)

A, Mapreduce program efficiency bottleneck lies in two things:

1) the performance of the computer
CPU, memory, disk health, network
2) I / O optimization operations
(1) Data inclination
(2) map set unreasonable and reduce the number of
(. 3) reduce wait too long
(4) too many small files
(5 ) can not block a large number of large files
(6) too often spill
(7) merge excessive number of times.

Two, MapReduce optimization method (☆☆☆☆☆)

1) Data input:
(1) with small file: mr task before performing small files merge, a large number of small files will generate a lot of map tasks, increasing
large number of loading map task, and the task of loading is time-consuming, so mr lead to run slower.
(2) ConbinFileInputFormat as an input to address input end of the scene number of small files.
2) map Stage
(1) reduce the number of spill: By adjusting the parameters sort.spill.percent io.sort.mb and increased memory limit spill of the trigger,
to reduce the number of spill, thereby reducing disk IO.
(2) reduce the number of merge: io.sort.factor by adjusting the parameters, increasing the number of the merge document, reduce the number of merge,
thereby shortening the processing time mr.
(3) to be processed after the combine map, reducing the I / O.
3) reduce phase
(1) a reasonable set of map and reduce the number of: both can not be set too low, it can not be set too. Too little can lead to task such as
standby, the treatment time; too much will lead to inter-map, reduce tasks compete for resources, resulting in processing timeout error.
(2) setting map, reduce coexist: slowstart.completedmaps adjustment parameters to map run to a certain extent, reduce
began operation, reduce the latency reduction.
(3) use of circumvention reduce, because Reduce network will consume a large amount of data at the time set for the connection.
(4) a reasonable set buffer Reduc end, by default, the data reaches a threshold value, the data in the buffer will be written to disk, and then reduce get all the data from the disk. That is, buffer, and are not directly related to educe, a plurality of intermediate disk write -> disk read process, since there are drawbacks to this, then it may be configured by parameters, such that a portion of the data buffer can be delivered directly to reduce , thereby reducing the IO overhead: mapred.job.reduce.input.buffer.percent, the default is 0.0. When the value is greater than 0, it will retain the data showed directly reduce the specified proportions of buffer memory read. Thus, the need to set the buffer memory, the memory needs to read the data, but also the reduce computing memory, so to be adjusted according to the operation of the job.
. 4) IO transmission
(1) using the data compression mode, reducing the time the IO network. Installation and LZOP Snappy compression coder.
(2) using the binary SequenceFile
5) data skew problem
(1) data skew phenomenon
data frequency inclined - data amount of one area is much larger than in other regions. Tilt data size - much larger than the size of the recording portion of the average value.
(2) How to collect data inclined
added details of the recording function key map in the output process reduce.

public static final String MAX_VALUES = "skew.maxvalues";
private int maxValueThreshold;
@Override
public void configure(JobConf job) {
maxValueThreshold = job.getInt(MAX_VALUES, 100);
}
@Override
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
	int i = 0;
	while (values.hasNext()) {
		values.next();
		i++;
	}
	if (++i > maxValueThreshold) {
		log.info("Received " + i + " values for key " + key);
	}
}

Method (3) is inclined to reduce data
Method 1: Sampling and scope of the partitioning
       can be obtained by sampling the result set of the original data partition boundary preset value.
Method 2: Custom partitioning
       further alternative sampling range partitioning and the background is based on the output key custom partitioning. For example, if the map output key words from a book. Most of which must be omitted word (stopword). Then it can be sent to a custom partitioning reduce instance a portion of this part of the fixed word will be omitted. While the others are sent to the remaining reduce instance.
Method 3: Combine
       the use of a large amount Combine can reduce the data size of the inclination data and the inclination frequency. In possible,
Combine the object is aggregated and reduced data.

6) common parameter tuning
(1) resource-related parameters
       (a) The following parameters are in the user's own applications mr configuration can take effect (mapred-default.xml)
Here Insert Picture Description       (B) before the yarn starts to be disposed in the server configuration file to take effect (yarn-default.xml)
Here Insert Picture Description       key parameters shuffle performance optimization (c), shall be configured before yarn a good start (mapred-default.xml)
Here Insert Picture Description
(2) Fault-tolerant parameters (mapreduce performance optimization)
Here Insert Picture Description

Published 334 original articles · won praise 227 · views 80000 +

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/104601853