Detailed explanation of HDFS replica storage disk selection strategy


In HDFS, DataNode stores data blocks in the local file system directory. The specific directory can be configured by the dfs.datanode.data.dir parameter in hdfs-site.xml. In a typical installation configuration, multiple directories are generally configured, and these directories are configured to different devices, for example, to different HDDs (the full name of HDD is Hard Disk Drive) and SSD (the full name of Solid State Drives, It is the solid-state hard drive that we are familiar with).

When we write a new block to HDFS, the DataNode will use the volume selection policy to choose where to store the block. Set by the parameter dfs.datanode.fsdataset.volume.choosing.policy, this parameter currently supports two disk selection policies

  • round-robin

  • available space

The default value of the dfs.datanode.fsdataset.volume.choosing.policy parameter is org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy. Both of these disk selection strategies implement the org.apache.hadoop.hdfs.server.datanode.fsdataset.VolumeChoosingPolicy interface. The VolumeChoosingPolicy interface actually defines a function: chooseVolume is as follows:

The chooseVolume function selects the disks that meet the conditions from the volumes for the specified replica. The following is a detailed introduction to the two built-in disk selection strategies in Hadoop. 

round-robin disk selection strategy 

As can be seen from the name, this disk selection strategy is based on polling, and the specific implementation class is org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy. Its implementation is simple:

The volumes parameter is actually the directory configured through dfs.datanode.data.dir. blockSize is the size of our copy. The RoundRobinVolumeChoosingPolicy strategy first polls to get the next volume. If the free space of the volume is larger than the size of the copy to be stored, it will directly return the volume to store the data; if the free space of the current volume is not enough to store the copy, then Select the next volume in a polling manner until an available volume is found. If no volume that can store the copy is found after traversing all the volumes, a DiskOutOfSpaceException will be thrown. 

As can be seen from the above strategy, although this polling method can ensure that all disks can be used, if there are a large number of deletion operations on files on HDFS, it may cause uneven distribution of disk data, such as some disks. The storage is very full, and some disks may still have a lot of storage space that is not being used. 

available space disk selection policy

The free space disk selection policy was introduced since Hadoop 2.1.0 (see: HDFS-1804 for details). This strategy preferentially writes data to the disk with the most free space (calculated by percentage). In the implementation of the available space selection strategy, the polling disk selection strategy described above is used internally. The specific implementation code is in the org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy class. The core implementation is as follows:

areAllVolumesWithinFreeSpaceThreshold 函数的作用是先计算所有 volumes 的最大可用空间和最小可用空间,然后使用最大可用空间减去最小可用空间得到的结果和 balancedSpaceThreshold(通过 dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold 参数进行配置,默认值是 10G) 进行比较。

可用空间策略会以下面三种情况进行处理:

1、如果所有的 volumes 磁盘可用空间都差不多,那么这些磁盘得到的最大可用空间和最小可用空间差值就会很小,这时候就会使用轮询磁盘选择策略来存放副本。

2、如果 volumes 磁盘可用空间相差比较大,那么可用空间策略会将 volumes 配置中的磁盘按照一定的规则分为 highAvailableVolumes 和 lowAvailableVolumes。具体分配规则是先获取 volumes 配置的磁盘中最小可用空间,加上 balancedSpaceThreshold(10G),然后将磁盘空间大于这个值的 volumes 放到 highAvailableVolumes 里面;小于等于这个值的 volumes 放到 lowAvailableVolumes 里面。

比如我们拥有5个磁盘组成的 volumes,编号和可用空间分别为 1(1G)、2(50G)、3(25G)、4(5G)、5(30G)。按照上面的规则,这些磁盘的最小可用空间为 1G,然后加上 balancedSpaceThreshold,得到 11G,那么磁盘编号为1、4的磁盘将会放到 lowAvailableVolumes 里面,磁盘编号为2,3和5将会放到 highAvailableVolumes 里面。

到现在 volumes 里面的磁盘已经都分到 highAvailableVolumes 和 lowAvailableVolumes 里面了。

2.1、如果当前副本的大小大于 lowAvailableVolumes 里面所有磁盘最大的可用空间(mostAvailableAmongLowVolumes,在上面例子中,lowAvailableVolumes 里面最大磁盘可用空间为 5G),那么会采用轮询的方式从 highAvailableVolumes 里面获取相关 volumes 来存放副本。

2.2、剩下的情况会以 75%(通过 dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction 参数进行配置,推荐将这个参数设置成 0.5 到 1.0 之间)的概率在 highAvailableVolumes 里面以轮询的方式 volumes 来存放副本;25% 的概率在 lowAvailableVolumes 里面以轮询的方式 volumes 来存放副本。

然而在一个长时间运行的集群中,由于 HDFS 中的大规模文件删除或者通过往 DataNode 中添加新的磁盘仍然会导致同一个 DataNode 中的不同磁盘存储的数据很不均衡。即使你使用的是基于可用空间的策略,卷(volume)不平衡仍可导致较低效率的磁盘I/O。比如所有新增的数据块都会往新增的磁盘上写,在此期间,其他的磁盘会处于空闲状态,这样新的磁盘将会是整个系统的瓶颈

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324728786&siteId=291194637