18. Hadoop optimization

       I finally saw such a people-friendly title. Generally, when it comes to the optimization part, it is almost over. That's right, the Hadoop part is about to end, and the explanation of the Hadoop HA part will be after Zookeeper. After all, HA needs Zookeeper. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


1. Reasons for the slow operation of MapReduce

       First, let's take a look at what causes MapReduce to run slowly. First of all, it may be because of the computer's performance, such as: CPU, memory, disk, network, etc. Of course, there are some other reasons, such as: (1) data skew, (2) the number of Map and Reduce settings is unreasonable, (3) the Map running time is too long and Reduce waits too long, (4) too many small files, ( 5) A large number of large files that are not blockable, (6) Spill too many times, (7) Merge too many times, etc.

Two, MapReduce optimization method

       The MapReduce optimization method is mainly considered from six aspects: data input, Map phase, Reduce phase, IO transmission, data skew issues and commonly used tuning parameters.

2.1 Data input

       1. Combine small files. Before executing the MapReduce task, merge small files. A large number of small files will generate a large number of Map tasks, increasing the number of Map tasks loading, and the loading of tasks is time-consuming, which causes MapReduce to run slowly.

       2. Use CombineTextInputFormat as input to solve a large number of small file scenarios at the input end.

2.2 Map stage

       1. Reduce the number of spills. By adjusting the io.sort.mb and sort.spill.percent parameter values, the upper limit of the memory that triggers Spill is increased, the number of Spills is reduced, and the disk IO is reduced.

       2. Reduce the number of merges. By adjusting the io.sort.factor parameter, increase the number of merge files and reduce the number of merges, thereby shortening the processing time of MapReduce.

       3. After the Map, without affecting the business logic, perform Combiner processing first to reduce IO.

2.3 Reduce phase

       1. Set the number of Map and Reduce reasonably. Neither of them can be set too little or too many. Too little will cause Task to wait and prolong processing time. Too much will cause resource competition between Map and Reduce tasks and cause errors such as processing timeouts.

       2. Set up Map and Reduce to coexist. Adjust the slowstart.completedmaps parameter so that after the Map runs to a certain extent, Reduce also starts running, reducing the waiting time of Reduce.

       3. Avoid using Reduce. Because Reduce will generate a lot of network consumption when used to connect data sets.

       4. Set the Buffer on the Reduce side reasonably. By default, when the data reaches a threshold, the data in the Buffer will be written to the disk, and then Reduce will get all the data from the disk. That is to say, Buffer and Reduce are not directly related. The process of writing to and reading from the disk multiple times in the middle. Since there is such a drawback, it can be configured through parameters. Yes, part of the data in the Buffer can be directly sent to Reduce. , Thereby reducing IO overhead. mapreduce.reduce.input.buffer.percent is 0.0 by default. When the value is greater than 0, the specified proportion of the data in the memory read buffer will be reserved and used directly by Reduce. In this way, memory is needed to set up Buffer, memory is needed to read data, and memory is also needed for Reduce calculation, so it needs to be adjusted according to the running status of the job.

2.4 IO transmission

       1. Use data compression to reduce network IO time. Install Snappy and LZO compression encoder.

       2. Use SequenceFile binary file.

2.5 Data skew problem

       Data tilt can be divided into two categories, one is data frequency tilt, and the other is data size tilt. Data frequency tilt means that the amount of data in a certain area is much larger than other areas. Data size skew means that the size of some records is much larger than the average. The following methods can be used to reduce data skew.

       1. Sampling and range division. The partition boundary value can be preset by the result set obtained by sampling the original data.

       2. Customize the partition. Customize partitions based on the background knowledge of the output key.

       3. Combine. Using Combine can greatly reduce data skew. When possible, the purpose of Combine is to aggregate and streamline data.

       4. Use Map Join and try to avoid Reduce Join.

2.6 Commonly used parameter tuning

       1. The following parameters can be configured in your own application (mapred-default.xml).

Configuration parameter

Parameter Description

mapreduce.map.memory.mb

The upper limit of resources that can be used by a MapTask (unit: MB), the default is 1024. If the amount of resources actually used by MapTask exceeds this value, it will be forcibly killed.

mapreduce.reduce.memory.mb

The upper limit of resources (unit: MB) that can be used by a ReduceTask, the default is 1024. If the amount of resources actually used by ReduceTask exceeds this value, it will be forcibly killed.

mapreduce.map.cpu.vcores

The maximum number of cpu cores that can be used by each MapTask, default value: 1

mapreduce.reduce.cpu.vcores

The maximum number of cpu cores that can be used by each ReduceTask, default value: 1

mapreduce.reduce.shuffle.parallelcopies

The number of parallels for each Reduce to fetch data from the Map. The default value is 5

mapreduce.reduce.shuffle.merge.percent

What percentage of the data in the buffer starts to be written to disk. Default value 0.66

mapreduce.reduce.shuffle.input.buffer.percent

The ratio of Buffer size to Reduce available memory. Default value 0.7

mapreduce.reduce.input.buffer.percent

Specify the percentage of memory used to store the data in the Buffer, the default value is 0.0

       2. Before YARN is started, it can be configured in the server configuration file to take effect (yarn-default.xml).

Configuration parameter

Parameter Description

yarn.scheduler.minimum-allocation-mb  

The minimum memory allocated to the application Container, the default value: 1024

yarn.scheduler.maximum-allocation-mb         

The maximum memory allocated to the application Container, the default value: 8192

yarn.scheduler.minimum-allocation-vcores  

The minimum number of CPU cores requested by each Container, the default value: 1

yarn.scheduler.maximum-allocation-vcores 

The maximum number of CPU cores requested by each Container, the default value: 32

yarn.nodemanager.resource.memory-mb  

Maximum physical memory allocated to Containers, default value: 8192

       3. The key parameters of Shuffle performance optimization should be configured (mapred-default.xml) before YARN is started.

Configuration parameter

Parameter Description

mapreduce.task.io.sort.mb  

Shuffle ring buffer size, default 100m

mapreduce.map.sort.spill.percent  

Threshold of ring buffer overflow, default 80%

       4. Fault tolerance related parameters (MapReduce performance optimization).

Configuration parameter

Parameter Description

mapreduce.map.maxattempts

The maximum number of retries for each Map Task. Once the retry parameter exceeds this value, the Map Task is considered to have failed. The default value is 4.

mapreduce.reduce.maxattempts

The maximum number of retries for each Reduce Task. Once the retry parameter exceeds this value, the Map Task is considered to have failed. The default value is 4.

mapreduce.task.timeout

Task超时时间,经常需要设置的一个参数,该参数表达的意思为:如果一个Task在一定时间内没有任何进入,即不会读取新的数据,也没有输出数据,则认为该Task处于Block状态,可能是卡住了,也许永远会卡住,为了防止因为用户程序永远Block住不退出,则强制设置了一个该超时时间(单位毫秒),默认是600000。如果你的程序对每条输入数据的处理时间过长(比如会访问数据库,通过网络拉取数据等),建议将该参数调大,该参数过小常出现的错误提示是“AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster.”。

三、HDFS小文件优化方法

3.1 小文件的弊端

       HDFS上每个文件都要在NameNode上建立一个索引,这个索引的大小约为150byte,这样当小文件比较多的时候,就会产生很多的索引文件,一方面会大量占用NameNode的内存空间,另一方面就是索引文件过大使得索引速度变慢。

3.2 小文件解决方案

       小文件的优化无非以下几种方式:(1)在数据采集的时候,就将小文件或小批数据合成大文件再上传HDFS。(2)在业务处理之前,在HDFS上使用MapReduce程序对小文件进行合并。(3)在MapReduce处理时,可采用CombineTextInputFormat提高效率。(4)开启JVM重用,将mapreduce.job.jvm.numtasks的值设置在10-20之间。

 

       本文到此也就结束了,你们在此过程中存在什么问题,欢迎留言,让我看看你们都遇到了什么问题~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/108363378