HDFS small file problem and its solution

hdfs small file problem

1. What will be the impact

  • namenode memory
  • Fragmentation => the number of maptasks
  • JVM switch

①Namenode's memory

A file block occupies 150 bytes of memory in the namenode,
so if there are 100 million small files, each occupies 150 bytes of memory in the namenode, which is very large.

② Fragmentation => the number of maptasks

If a small file is less than 128M, each file corresponds to a maptask.
If a maptask occupies 1G of memory, then your memory will be exhausted.

2. Solve

namenode memory

  1. The first is to use har archives to archive multiple small files into one file, which only occupies 150 bytes of namenode memory.
  2. The second is to use a custom Inputformat to put data into the Sequencefile

Fragmentation => the number of maptasks

  1. Use combineFileIputformat (equivalent to aggregating small files and then slicing them)

JVM switch

  1. JVM reuse (if there is no small file, do not enable JVM reuse)

Guess you like

Origin blog.csdn.net/weixin_47699191/article/details/111299820