Hadoop: Small file optimization method

Hadoop small file optimization method

Disadvantages of small files in Hadoop

Each file on HDFS must create corresponding metadata on the NameNode. The size of this metadata is about 150 bytes. In this way, when there are many small files, a lot of metadata files will be generated, which will occupy a large amount of NameNode space. Memory space, on the other hand, is too many metadata files, making addressing indexing slower.
If there are too many small files, too many slices will be generated during MR calculation, and too many MapTasks need to be started. The amount of data processed by each MapTask is small, causing the processing time of the MapTask to be shorter than the startup time, consuming resources in vain.

Hadoop small file solution

1) The direction of small file optimization:
(1) During data collection, synthesize small files or small batches of data into large files and then upload them to HDFS.
(2) Before business processing, use the MapReduce program on HDFS to merge small files.
(3) During MapReduce processing, CombineTextInputFormat can be used to improve efficiency.
(4) Turn on uber mode to realize jvm reuse
2) Hadoop Archive
is an efficient file archiving tool that puts small files into HDFS blocks. It can package multiple small files into one HAR file, thereby reducing the memory usage of NameNode.
3) CombineTextInputFormat
CombineTextInputFormat is used to generate a single slice or a small number of slices from multiple small files during the slicing process.
4) Turn on uber mode to realize JVM reuse.
By default, each Task needs to start a JVM to run. If the amount of data calculated by the Task task is small, we can have multiple Tasks of the same Job run in one JVM without having to open one for each Task. JVM.
Turn on uber mode and add the following configuration to mapred-site.xml

<!--  开启uber模式 -->
<property>
	<name>mapreduce.job.ubertask.enable</name>
	<value>true</value>
</property>

<!-- uber模式中最大的mapTask数量,可向下修改  --> 
<property>
	<name>mapreduce.job.ubertask.maxmaps</name>
	<value>9</value>
</property>
<!-- uber模式中最大的reduce数量,可向下修改 -->
<property>
	<name>mapreduce.job.ubertask.maxreduces</name>
	<value>1</value>
</property>
<!-- uber模式中最大的输入数据量,默认使用dfs.blocksize 的值,可向下修改 -->
<property>
	<name>mapreduce.job.ubertask.maxbytes</name>
	<value></value>
</property>

Guess you like

Origin blog.csdn.net/weixin_45427648/article/details/131795110