浅析Hadoop 之 HDFS 的 磁盘配置(包含读写策略)

一、前言:

玩过Hadoop 简单配置的同学都知道core-site.xml 和 hdfs-site.xml里有配置 tmp 数据的目录 和 实际数据 data 的目录,那么这些目录在真正任务运行时,磁盘(集群下每个Node多块磁盘时)读写策略 是怎么 选择的呢?

二 、相关概念介绍:

1. core-site.xml里配置临时数据存放的目录,也可以是临时运行的目录

hadoop.tmp.dir                                 /tmp/hadoop-${user.name}

2. hdfs-site.xml里配置任务最终数据存放的目录及NameNode的数据

dfs.namenode.name.dir                 file://${hadoop.tmp.dir}/dfs/name

dfs.datanode.data.dir                     file://${hadoop.tmp.dir}/dfs/datas.datanode.data.dirdfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/da

3. hdfs-site.xml里还有配置hdfs volume 选择策略,这个策略就是block读写的策略。

    <property>
    <name>dfs.datanode.fsdataset.volume.choosing.policy</name>
    <value>org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy</value>
    <description>
    The class name of the policy for choosing volumes in the list of
    directories.  Defaults to
    org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy.
    If you would like to take into account available disk space, set the
    value to
    "org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy".
   </description>

   </property>

4. 层次结构:

|BlockPool 

    |FsVolumeList

       |FsVolumeImpl

            |BlockPoolSlice

每个BlockPoolSlice是管理这个目录下所有的数据块:


datanode.log如下:

-----------------------------------------------------------------------------------------------------------------------------------

INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /home/hadoop-user/ssd2/datanode is not formatted for BP-1908069725-10.64.16.27-1530698640947
INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ...
INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-1908069725-10.64.16.27-1530698640947
INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /home/hadoop-user/ssd2/datanode/current/BP-1908069725-10.64.16.27-1530698640947
INFO org.apache.hadoop.hdfs.server.common.Storage: Block pool storage directory /home/hadoop-user/ssd2/datanode/current/BP-1908069725-10.64.16.27-1530698640947 is not formatted for BP-1908069725-10.64.16.27-1530698640947
INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ...
INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting block pool BP-1908069725-10.64.16.27-1530698640947 directory /home/hadoop-user/ssd2/datanode/current/BP-1908069725-10.64.16.27-1530698640947/current
INFO org.apache.hadoop.hdfs.server.common.Storage: Restored 0 block files from trash.
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=843268798;bpid=BP-1908069725-10.64.16.27-1530698640947;lv=-56;nsInfo=lv=-63;cid=CID-20d23311-8321-4761-b9a8-8fb405727dca;nsid=843268798;c=0;bpid=BP-1908069725-10.64.16.27-1530698640947;dnuuid=null
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Generated and persisted new Datanode UUID 3f12792f-ae24-42cb-9b67-425625918424
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added new volume: DS-51cce613-b820-40b6-b429-2ab072c2eecc
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added volume - /home/hadoop-user/data/hdfs/datanode/current, StorageType: DISK
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added new volume: DS-38cc1cfc-bbf0-4b6b-adf7-2f63b97ff206
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added volume - /home/hadoop-user/ssd2/datanode/current, StorageType: DISK
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Registered FSDatasetState MBean
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding block pool BP-1908069725-10.64.16.27-1530698640947
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1908069725-10.64.16.27-1530698640947 on volume /home/hadoop-user/data/hdfs/datanode/current...
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1908069725-10.64.16.27-1530698640947 on volume /home/hadoop-user/ssd2/datanode/current...
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1908069725-10.64.16.27-1530698640947 on /home/hadoop-user/data/hdfs/datanode/current: 22ms
INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1908069725-10.64.16.27-1530698640947 on /home/hadoop-user/ssd2/datanode/current: 22ms

INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to scan all replicas for block pool BP-1908069725-10.64.16.27-1530698640947: 23ms

-----------------------------------------------------------------------------------------------------------------------------------

三、 内容:

1. hadoop.tmp.dir 是给NodeManager用的,其用于LocalizedResource、DefaultContainerExecutor 和 launchContainer 执行用,可参考nodemanger.log。

2.  dfs.namenode.name.dir 是给NameNode 用的,包括FSImage 和 FileJournalManager 类,可参考namenode.log.

3. dfs.datanode.data.dir 是给DataNode 用的,其最终对应到代码里的一个volume(其相应操作参考代码FsVolumeImpl.java),而hdfs的磁盘读写正是按volume策略来执行的。所有的volume形成FsVolumeList(其相应操作参考代码 FsVolumeList.java/FsDatasetImpl.java)。

volume策略 可以通过 参数 dfs.datanode.fsdataset.volume.choosing.policy 来配置,默认为 org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy,即按顺序轮流来分配。而 org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy 策略即为选择最空的volume (即文件夹来写),直到到了它设定的最小值(dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold)或者 百分比门阀值(dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction)。当volume使用超过上限时,该DataNode被置为 Unhealthy Nodes(可以通过cluster node 页面查看状态)

用图解释如下(RR 策略的磁盘吞吐率 >= AvailableSpace 策略的磁盘吞吐率):


Data为已写数据,New Block 为即将写的新数据。

四、参考

DataNode的初始化过程:http://www.itboth.com/d/Y3YFJ3am2um2/hdfs

Hdfs的配置参数:http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Volume的分配策略介绍:http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/


猜你喜欢

转载自blog.csdn.net/don_chiang709/article/details/80997741
今日推荐