Hadoop 中的小文件

一、何为小文件?

A small file can be defined as any file that is significantly smaller than the Hadoop block size. The Hadoop block size is usually set to 64,128, or 256 MB, trending toward increasingly larger block sizes. Throughout the rest of this blog when providing examples we will use a 128MB block size. I use the rule if a file is not at least 75% of the block size, it is a small file. However, the small file problem does not just affect small files. If a large number of files in your Hadoop cluster are marginally larger than an increment of your block size you will encounter the same challenges as small files. For example if your block size is 128MB but all of the files you load into Hadoop are 136MB you will have a significant number of small 8MB blocks. The good news is that solving the small block problem is as simple as choosing an appropriate (larger) block size. Solving the small file problem is significantly more complex. Notice I never mentioned number of rows. Although number of rows can impact MapReduce performance, it is much less important than file size when determining how to write files to HDFS.

小文件一般是指明显小于Hadoop的block size的文件。Hadoop的block size一般是64MB,128MB或者256MB,现在一般趋向于设置的越来越大。后文要讨论的内容会基于128MB,这也是CDH中的默认值。这里我们假定如果文件大小小于block size的75%,则定义为小文件。但小文件不仅是指文件比较小,如果Hadoop集群中的大量文件略大于block size,同样也会存在小文件问题。

比如,假设block size是128MB,但加载到Hadoop的所有文件都是136MB,就会存在大量8MB的block。处理这种“小块”问题你可以调大block size来解决,但解决小文件问题却要复杂的多。
二、小文件是怎么来的?

The small file problem is an issue a Pentaho Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including: Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period. The source system generates thousands of small files which are copied directly into Hadoop without modification. The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

一个Hadoop集群中存在小文件问题是很正常的,可能的原因如下:

  1. 现在我们越来越多的将Hadoop用于(准)实时计算,在做数据抽取时处理的频率可能是每小时,每天,每周等,每次可能就只生成一个不到10MB的文件。

2.数据源有大量小文件,未做处理直接拷贝到Hadoop集群。

3.MapReduce作业的配置未设置合理的reducer或者未做限制,每个reduce都会生成一个独立的文件。另外如果数据倾斜,导致大量的数据都shuffle到一个reduce,然后其他的reduce都会处理较小的数据量并输出小文件。
三、为什么Hadoop会出现小文件的问题

There are two primary reasons Hadoop has a small file problem: NameNode memory management and MapReduce performance. The namenode memory problem Every directory, file, and block in Hadoop is represented as an object in memory on the NameNode. As a rule of thumb, each object requires 150 bytes of memory. If you have 20 million files each requiring 1 block, your NameNode needs 6GB of memory. This is obviously quite doable, but as you scale up you eventually reach a practical limit on how many files (blocks) your NameNode can handle. A billion files will require 300GB of memory and that is assuming every file is in the same folder! Let’s consider the impact of a 300GB NameNode memory requirement… When a NameNode restarts, it must read the metadata of every file from a cache on local disk. This means reading 300GB of data from disk — likely causing quite a delay in startup time. In normal operation, the NameNode must constantly track and check where every block of data is stored in the cluster. This is done by listening for data nodes to report on all of their blocks of data. The more blocks a data node must report, the more network bandwidth it will consume. Even with high-speed interconnects between the nodes, simple block reporting at this scale could become disruptive. The optimization is clear. If you can reduce the number of small files on your cluster, you can reduce the NameNode memory footprint, startup time and network impact.

Hadoop的小文件问题主要是会对NameNode内存管理和MapReduce性能造成影响。Hadoop中的每个目录、文件和block都会以对象的形式保存在NameNode的内存中。根据经验每个对象在内存中大概占用150个字节。如果HDFS中保存2000万个文件,每个文件都在同一个文件夹中,而且每个文件都只有一个block,则NameNode需要6GB内存。如果只有这么点文件,这当然也没有什么问题,但是随着数据量的增长集群的扩容,最终会达到单台NameNode可以处理的文件(block)数量的上限。基于同样的假设,如果HDFS中保存的数据文件增长到10亿个,则NameNode需要300GB内存。以下Fayson带大家看看300GB内存的NameNode会有什么影响:

1.当NameNode重启时,它都需要从本地磁盘读取每个文件的元数据,意味着你要读取300GB数据到内存中,不可避免导致NameNode启动时间较长。

2.一般来说,NameNode会不断跟踪并检查每个数据块的存储位置。这是通过DataNode的定时心跳上报其数据块来实现的。数据节点需要上报的block越多,则也会消耗越多的网络带宽/时延。即使节点之间是高速网络(万兆/光纤),但不可避免的会带来一些不好的影响。

3.NameNode本身使用300G内存,相当于JVM你需要配置300GB的heap,对于JVM来说本来就存在稳定性的风险,比如GC时间较长。

所以优化和解决小文件问题基本是必须的。如果可以减少集群上的小文件数,则可以减少NameNode的内存占用,启动时间以及网络影响。

猜你喜欢

转载自blog.csdn.net/shkstart/article/details/108327633