The impact of small files on Hadoop clusters

1 Definition of small files

Small files refer to files whose size is much smaller than the HDFS block size (default 128M). Hadoop is suitable for processing a small number of large files, rather than a large number of small files.

2 Problems caused by small files

  •        First of all, in HDFS, any block, file or directory is stored in memory in the form of an object, and each object occupies about 150 bytes. If there are 1000 0000 small files, and each file occupies a block, the namenode needs about 2G space. . If 100 million files are stored, the namenode needs 20G space. In this way, the memory capacity of the namenode severely restricts the expansion of the cluster.
  •        Secondly, the speed of accessing a large number of small files is far less than that of accessing several large files. HDFS was originally developed for streaming access to large files. If you access a large number of small files, you need to constantly jump from one datanode to another, which will seriously affect performance.
  •        Finally, processing a large number of small files is much slower than processing large files of the same size. Each small file occupies a slot, and task startup ( JVM startup and destruction ) will spend a lot of time or even most of the time spent on starting and releasing tasks.

 

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/114378020