Hadoop small file optimization

1. HDFS small file impact

(1) Affect the life of the NameNode, because the file metadata is stored in the memory of the NameNode

(2) The number of tasks that affect the calculation engine, for example, each small file will generate a Map task

2. Data input small file processing:

(1) Combine small files: archive small files (Har), customize Inputformat to store small files as SequenceFile files.

(2) Use CombineFileInputFormat as input to solve a large number of small file scenarios at the input end.

Hadoop itself includes CombineFileInputFormat, whose function is to merge multiple small files into one slice and process them by one map task, thus reducing the number of unnecessary maps.

(3) For a large number of small file jobs, JVM reuse can be turned on.

A large number of small files will have a large number of maps. Turning on JVM reuse can save a lot of time required for JVM creation

3. Map stage

(1) Increase the size of the ring buffer. Expand from 100m to 200m

(2) Increase the proportion of ring buffer overflow. Expanded from 80% to 90%

1, 2 indirectly improve performance by reducing the number of marge overwrite files

(3) Reduce the number of merges for overflow write files. (10 files, 20 merges at a time)

(4) Under the premise of not affecting actual business, use Combiner to merge in advance to reduce I/O.

Detailed explanation: open two job tasks, the first job performs a merge operation similar to Combiner, and the second job performs calculation operations on the results of the first job

4. Reduce phase

(1) Set the number of Map and Reduce reasonably: neither can be set too little or too much. Too little will cause Tasks to wait and prolong the processing time; too much will cause resource competition between Map and Reduce tasks, causing errors such as processing timeouts.

(2) Set the coexistence of Map and Reduce: adjust the slowstart.completedmaps parameter to make the Map run to a certain extent,

Reduce also starts to run, reducing the waiting time of Reduce.

(3) Avoid using Reduce, because Reduce will generate a lot of network consumption when used to connect data sets.

(4) Increase the number of parallels for each Reduce to get data from the Map

(5) If the cluster performance is sufficient, increase the size of the data storage memory on the Reduce side.

5. IO transmission

(1) Use data compression to reduce network IO time. Install Snappy and LZOP compression encoder.

(2) Use SequenceFile binary file

The SequenceFile class can efficiently store and process small files

6. Overall

(1) The default memory size of MapTask is 1G, and the memory size of MapTask can be increased to 4-5g

(2) ReduceTask default memory size is 1G, you can increase ReduceTask memory size to 4-5g

(3) The number of CPU cores of MapTask can be increased, and the number of CPU cores of ReduceTask can be increased

(4) Increase the number of CPU cores and memory size of each Container

Container: used for resource scheduling in yarn to improve its performance and complete task resource scheduling faster

(5) Adjust the maximum number of retries for each Map Task and Reduce Task

Guess you like

Origin blog.csdn.net/zh2475855601/article/details/114631501