On the HDFS architecture and design

Author | great respect

hdfs is hadoop distributed file system that Hadoop Distributed Filesystem. The following is mainly about the more important point under HDFS design, so that readers can find it all HDFS brief article is suitable for HDFS have little understanding, but HDFS and confusing for beginners. The main reference of this article is hadoop official documentation 3.0.

link:http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

When the size of the data set exceeds the capacity of a single physical machine that can be stored at the time, it needs to be partitioned and stored on a number of different independent computers, wherein the storage management across multiple computers distributed file system is called a file system.

table of Contents

  • Use of HDFS scene

  • HDFS operating mode

  • File system namespace (namespace)

  • Data Replication

  • Persistent file system metadata

  • Protocol

  • Robustness

  • Data Organization

  • Accessibility

  • Storage space recovery

1, the use of HDFS scene

HDFS adapted to access the data streaming mode to store large files. That write-once, read many times, a long time in various analyzes on data sets, each analysis involves most or all of the data set of data for large files, hadoop currently supported storage PB-level data.

HDFS is not suitable for applications that require low latency data access time, because HDFS is designed for high data throughput applications optimized, which is likely to significantly delay time costs.

HDFS total number of files stored in the file system can be limited to namenode memory capacity, based on experience, 100 millions of files, and each file representing a block of data, it requires at least 300MB of memory.

Currently hadoop file may be a writer, and writes always be added at the end of file data, does not support any modification in the file.

With respect to the data block ordinary file system, HDFS also concept block, default is 128MB, file on HDFS also divided into a plurality of divided blocks into a block size, as a separate storage unit, but HDFS less than a block size of the file It does not occupy the entire space of the block. Unless otherwise indicated, it refers specifically mentioned in the text block HDFS block.

Why HDFS block so large, its purpose is to minimize the addressing overhead. This number can not be too large setting, in MapReduce map tasks typically process only one block of data, if the number of tasks so too, the job will be slower operating speed.

2, HDFS operating mode

HDFS using master / slave architecture, i.e. a NameNode (manager) a plurality Datanode (workers).

Namespace namenode responsible for managing the file system. Maintains a file system tree and all files and directories within the entire tree, this information is stored in two files, image files and edit namespaces log files. namenode also record the node information of each file data block is located in each. datanode node file systems work, they need to store and retrieve data blocks (or clients receiving namenode schedule), and periodically transmits the stored list of blocks to namenode thereof.

If you do not namenode, the file system will not work, because we do not know how the blocks to rebuild the file datanode, so namenode fault tolerance is very important. To this end provides two mechanisms hadoop.

The first mechanism is to back up those files that make up the file system metadata in the persistent state. Generally, while the persistent files written to the local disk, writes remotely mounted NFS.

The second method is to run an auxiliary namenode, the auxiliary namenode regularly by editing the log consolidation namespace mirror, and named after a local save space merge mirrored copies, enabled when namenode failure. But if the primary node fails, it will inevitably lose some data, then you can copy the metadata stored in namenode NFS onto the auxiliary namenode as the new namenode run. Which involves the mechanism failover. We will do a little later analysis.

3, the file system namespace (namespace)

HDFS supports the traditional hierarchical file organization structure. User or application can create a directory, and then save the file in the directory.

File system namespace hierarchy and most of the existing file system is similar: Users can create, delete, move or rename files. HDFS support user disk quotas and access controls, currently does not support hard links and soft links. But HDFS architecture does not preclude implementing these features.

Namenode responsible for maintaining the name space file system, any changes to the file system name space, or property will be Namenode recorded. The application can set the number of copies of the files stored in HDFS. The number of copies of the file is called the replication factor of a file, this information is also saved by the Namenode.

4, data replication

HDFS is designed to cross-machine reliably store large files in a large cluster. Each file stores it into a series of blocks, except the last, all the data blocks are the same size.

For fault tolerance, all data blocks will have a copy of the file. And coefficient data block size copy of each file are configurable. An application can specify the number of copies of a file. A copy of the coefficients can be specified when the file is created, it can be changed later.

HDFS files are write-once, and strict requirements at any time only one writer.

Namenode full copy management data blocks, each of which is periodically received Datanode heartbeat signal and the block status reports from the cluster (Blockreport). When a Datanode starts, it scans the local file system, generates a list of all HDFS data blocks corresponding to one of these local file, and then sent to a Namenode report, this report is the block status report. Receive the heartbeat signal Datanode means that the node is working properly. Block status report contains a list of all data blocks on the Datanode.

Data acquisition block list. See block health: hdfs fsck / -files -block or hdfs fsck /

On the HDFS architecture and design

HDFS data blocks stored in the file name prefixed _blk, each block metadata file, and a metadata file associated with a suffix comprising a series .meta checksum of the block and the head of each section with.

When the number of data blocks to a certain scale, datanode creates a subdirectory to store new data blocks and metadata information. If the current directory already stored 64 (by dfs.datanode.numlocks property set) data blocks, create a subdirectory, the ultimate goal is to design a high fan-out of the tree.

If dfs.datanode.data.dir property specifies a list of the different plurality of disks, then the data blocks will rotate (round-robin) is written into the respective directory manner.

A scanner will run on each block Datanode, periodic testing all the blocks on the node so that the client can timely read the detect and repair bad blocks before the bad block. , And the possible failure to repair the state every three weeks will test block by default.

Users can http: // datanode: 50070 / blockScannerReport datanode acquisition block of the inspection report.

A copy of the deposit

The key is to store a copy of the reliability and performance of HDFS. Optimized copy storage strategy is distinguished from most other HDFS distributed file system, an important characteristic. This feature requires a lot of tuning, and need to accumulate experience. Reliability called strategy employed HDFS rack awareness (rack-aware) to improve data availability and utilization of network bandwidth. Storing a copy of the policy currently implemented only the first step in this direction.

A rack perceived by a process, Namenode id may be determined for each rack Datanode belongs. A simple but no optimization strategy is a copy stored on different racks. This can effectively prevent the loss of the entire rack when data failure and allowed time to fully utilize the bandwidth of the read data of the plurality of racks. This strategy can be set uniform distribution of copies in the cluster, when the component is conducive to load balancing in the case of failure. However, because this strategy needs to transmit a block of data write operation to the plurality of racks, which increases the cost of writing.

In most cases, the replication factor is 3, HDFS storage strategy is to store a copy on the local node rack, a copy of another node on the same rack, and the last copy in different racks node. This strategy reduces the data transmission between the racks, which improves the efficiency of the write operation.

In reality, in the hadoop2.0, datanode copy of the data storage disk selection policy in two ways:

The first followed the hadoop1.0 disk directory polling, implementation class:

RoundRobinVolumeChoosingPolicy.java

The second is to select the available disk space is enough stored, implementation class: AvailableSpaceVolumeChoosingPolicy.java

The second strategy is the corresponding configuration items:

On the HDFS architecture and design

如果不配置,默认使用第一种方式,既轮询选择磁盘来存储数据副本,但是轮询的方式虽然能够保证所有磁盘都能够被使用,但是经常会出现各个磁盘直接数据存储不均衡问题,有的磁盘存储得很满了,而有的磁盘可能还有很多存储空间没有得到利用,所有在hadoop2.0集群中,最好将磁盘选择策略配置成第二种,根据磁盘空间剩余量来选择磁盘存储数据副本,这样一样能保证所有磁盘都能得到利用,还能保证所有磁盘都被利用均衡。

在采用第二种方式时还有另外两个参数会用到:

On the HDFS architecture and design

默认值是10737418240,既10G,一般使用默认值就行。官方解释为,首先计算出两个值,一个是所有磁盘中最大可用空间,另外一个值是所有磁盘中最小可用空间,如果这两个值相差小于该配置项指定的阀值时,则就用轮询方式的磁盘选择策略选择磁盘存储数据副本。

On the HDFS architecture and design

默认值是0.75f,一般使用默认值就行。官方解释为,有多少比例的数据副本应该存储到剩余空间足够多的磁盘上。该配置项取值范围是0.0-1.0,一般取0.5-1.0,如果配置太小,会导致剩余空间足够的磁盘实际上没分配足够的数据副本,而剩余空间不足的磁盘取需要存储更多的数据副本,导致磁盘数据存储不均衡。

副本选择

为了降低整体的带宽消耗和读取延时,HDFS会尽量让读取程序读取离它最近的副本。如果在读取程序的同一个机架上有一个副本,那么就读取该副本。如果一个HDFS集群跨越多个数据中心,那么客户端也将首先读本地数据中心的副本。

安全模式

Namenode启动后会进入一个称为安全模式的特殊状态。 处于安全模式的Namenode是不会进行数据块的复制的。Namenode从所有的 Datanode接收心跳信号和块状态报告。块状态报告包括了某个Datanode所有的数据块列表。每个数据块都有一个指定的最小副本数。

当Namenode检测确认某个数据块的副本数目达到这个最小值(最小值默认是1,由dfs.namenode.replication.min属性设置),那么该数据块就会被认为是副本安全(safely replicated)的;在一定百分比(这个参数可配置,默认是0.999f,属性值为dfs.safemode.threshold.pct)的数据块被Namenode检测确认是安全之后(加上一个额外的30秒等待时间),Namenode将退出安全模式状态。接下来它会确定还有哪些数据块的副本没有达到指定数目,并将这些数据块复制到其他Datanode上。

如果datanode丢失的block达到一定的比例,namenode就会一直处于安全模式即只读模式。

当namenode处于安全模式时,该怎么处理?

找到问题所在,进行修复(比如修复宕机的datanode)。

或者可以手动强行退出安全模式(没有真正解决问题): hdfs namenode --safemode leave。

在hdfs集群正常冷启动时,namenode也会在safemode状态下维持相当长的一段时间,此时你不需要去理会,等待它自动退出安全模式即可。

用户可以通过dfsadmin -safemode value 来操作安全模式,参数value的说明如下:

enter - 进入安全模式

leave - 强制NameNode离开安全模式

get - 返回安全模式是否开启的信息

wait - 等待,在执行某条命令前先退出安全模式。

5、文件系统元数据的持久化

Namenode上保存着HDFS的名字空间。对于任何对文件系统元数据产生修改的操作,Namenode都会使用一种称为EditLog的事务日志记录下来。例如,在HDFS中创建一个文件,Namenode就会在Editlog中插入一条记录来表示;同样地,修改文件的副本系数也将往Editlog插入一条记录。Namenode在本地操作系统的文件系统中存储这个Editlog。

整个文件系统的名字空间,包括数据块到文件的映射、文件的属性等,都存储在一个称为FsImage的文件中,这个文件也是放在Namenode所在的本地文件系统上。

Namenode在内存中保存着整个文件系统的名字空间和文件数据块映射(Blockmap)的映像(即FsImage)。这个关键的元数据结构设计得很紧凑,因而一个有4G内存的Namenode足够支撑大量的文件和目录。

当Namenode启动时,或者检查点达到配置文件中的阀值,它从硬盘中读取Editlog和FsImage,将所有Editlog中的事务作用在内存中的FsImage上,并将这个新版本的FsImage从内存中保存到本地磁盘上,然后删除旧的Editlog,因为这个旧的Editlog的事务都已经作用在FsImage上了。这个过程称为一个检查点(checkpoint)。

hdfs dfsadmin -fetchImage fsimage.backup

//手动从namenode获取最新fsimage文件,并保存为本地文件。

因为编辑日志会无限增长,那么恢复编辑日志的过程就会比较长,解决方案是,运行辅助namenode,为主namenode内存中的文件系统元数据创建检查点。最终主namenode拥有最新的fsimage文件和更小的edits文件。

这也解释了辅助namenode和主namenode拥有相近内存需求的原因(辅助namenode也需要把fsimage文件载入内存)。

创建检查点的触发条件受两个配置参数控制,

dfs.namenode.checkpoint.period属性(辅助namenode每隔一段时间就创建检查点,单位s)。dfs.namenode.checkpoint.txns,如果从上一个检查点开始编辑日志大小达到多少的事务数时,创建检查点。

在主namenode发生故障时(假设没有备份),就可以从辅助的namenode上恢复数据。有两种实现方式。

方法一,将相关的存储目录复制到新的namenode中 。

方法二,使用-importCheckpoint选项启动namenode守护进程,从而将辅助namenode用作新的主namenode,有个前提时,dfs.namenode.dir属性定义的目录中没有元数据时。

6、通讯协议

所有的HDFS通讯协议都是建立在TCP/IP协议之上。客户端通过一个可配置的TCP端口连接到Namenode,通过ClientProtocol协议与Namenode交互。而Datanode使用DatanodeProtocol协议与Namenode交互。

一个远程过程调用(RPC)模型被抽象出来封装ClientProtocol和Datanodeprotocol协议。在设计上,Namenode不会主动发起RPC,而是响应来自客户端或 Datanode 的RPC请求。

7、健壮性

HDFS的主要目标就是即使在出错的情况下也要保证数据存储的可靠性。

常见的三种出错情况是:Namenode出错, Datanode出错和网络割裂(network partitions)。

心跳检测,磁盘数据错误和重新复制。

每个Datanode节点周期性地向Namenode发送心跳信号。网络割裂可能导致一部分Datanode跟Namenode失去联系。Namenode通过心跳信号的缺失来检测这一情况,并将这些近期不再发送心跳信号Datanode标记为宕机,不会再将新的IO请求发给它们。任何存储在宕机Datanode上的数据将不再有效。

Datanode的宕机可能会引起一些数据块的副本系数低于指定值,Namenode不断地检测这些需要复制的数据块,一旦发现就启动复制操作。

设置合适的datanode心跳超时时间,避免用datanode不稳定导致的复制风暴。

在下列情况下,也可能需要重新复制:某个Datanode节点失效,某个副本遭到损坏,Datanode上的硬盘错误,或者文件的副本系数增大。

集群均衡(针对datanode)

HDFS的架构支持数据均衡策略。如果某个Datanode节点上的空闲空间低于特定的临界点,按照均衡策略系统就会自动地将数据从这个Datanode移动到其他空闲的Datanode。

个文件的请求突然增加,那么也可能启动一个计划创建该文件新的副本,并且同时重新平衡集群中的其他数据。这个均衡策略目前还没有实现。

数据完整性(针对datanode)

从某个Datanode获取的数据块有可能是损坏的,损坏可能是由Datanode的存储设备错误、网络错误或者软件bug造成的。HDFS客户端软件实现了对HDFS文件内容的校验和(checksum)检查。

当客户端创建一个新的HDFS文件,会计算这个文件每个数据块的校验和,并将校验和作为一个单独的隐藏文件保存在同一个HDFS名字空间下。当客户端获取文件内容后,它会检验从Datanode获取的数据跟相应的校验和文件中的校验和是否匹配,如果不匹配,客户端可以选择从其他Datanode获取该数据块的副本。

元数据磁盘错误(针对namenode出错)

FsImage和Editlog是HDFS的核心数据结构。如果这些文件损坏了,整个HDFS实例都将失效。因而,Namenode可以配置成支持维护多个FsImage和Editlog的副本。任何对FsImage或者Editlog的修改,都将同步到它们的副本上。这种多副本的同步操作可能会降低Namenode每秒处理的名字空间事务数量。然而这个代价是可以接受的,因为即使HDFS的应用是数据密集的,它们也非元数据密集的。当Namenode重启的时候,它会选取最近的完整的FsImage和Editlog来使用。

另外一个可选方案是通过共享存储NFS或一个分布式编辑日志(也叫journal)实现多namenode节点(HA),来增强故障恢复能力。

在HDFS HA的实现中,配置了一对active-standby的namenode,当活动的namenode失效,备用的namenode就会接管它的任务并开始服务于客户端的请求。

实现HA需要在架构上做如下修改:

namenode之间通过高可用共享存储实现编辑日志的共享,当备用namenode接管工作之后,它将通读共享编辑日志直到末尾,实现与active namenode状态同步,并继续读取由活动namenode写入的新条目。

datanode需要同时向两个namenode发送数据块处理报告,因为数据块映射信息存在namenode的内存,而非硬盘。

客户端使用特定的机制处理namenode的失效,这一机制对于用户是透明的。

辅助namenode的角色被namenode所包含,备用namenode为活动的namenode命名空间设置周期性检查。

快照

快照支持某一特定时刻的数据的复制备份。利用快照,可以让HDFS在数据损坏时恢复到过去一个已知正确的时间点。

8、数据组织

数据块

HDFS被设计成支持大文件,适用HDFS的是那些需要处理大规模的数据集的应用。这些应用都是只写入数据一次,但却读取一次或多次,并且读取速度应能满足流式读取的需要。HDFS支持文件的“一次写入多次读取”语义。一个典型的数据块大小是128MB。因而,HDFS中的文件总是按照128M被切分成不同的块,每个块尽可能地存储于不同的Datanode中。

流水线复制

当客户端向HDFS文件写入数据的时候,一开始是写到本地临时文件中。假设该文件的副本系数设置为3,当本地临时文件累积到一个数据块的大小时,客户端会从Namenode获取一个Datanode列表用于存放副本。然后客户端开始向第一个Datanode传输数据,第一个Datanode一小部分一小部分(4 KB)地接收数据,将每一部分写入本地仓库,并同时传输该部分到列表中第二个Datanode节点。第二个Datanode也是这样,一小部分一小部分地接收数据,写入本地仓库,并同时传给第三个Datanode。最后,第三个Datanode接收数据并存储在本地。

因此,Datanode能流水线式地从前一个节点接收数据,并在同时转发给下一个节点,数据以流水线的方式从前一个Datanode复制到下一个。

9、可访问性

HDFS给应用提供了多种访问方式。用户可以通过Java API接口访问,也可以通过C语言的封装API访问,还可以通过浏览器的方式访问HDFS中的文件。通过WebDAV协议访问的方式正在开发中。

DFSShell

HDFS以文件和目录的形式组织用户数据。它提供了一个命令行的接口(DFSShell)让用户与HDFS中的数据进行交互。命令的语法和用户熟悉的其他shell(例如 bash, csh)工具类似。下面是一些动作/命令的示例:

On the HDFS architecture and design

DFSAdmin

DFSAdmin 命令用来管理HDFS集群。这些命令只有HDSF的管理员才能使用。下面是一些动作/命令的示例:

On the HDFS architecture and design

浏览器接口

一个典型的HDFS安装会在一个可配置的TCP端口开启一个Web服务器用于暴露HDFS的名字空间。用户可以用浏览器来浏览HDFS的名字空间和查看文件的内容。

http://ip:50070

10、存储空间回收

文件的删除和恢复

当垃圾回收生效时,通过fs shell删除的文件并没有立刻从HDFS中删除。实际上,HDFS会将这个文件重命名转移到user//.Trash目录。只要文件还在.Trash目录中,该文件就可以被迅速地恢复。文件在Trash中保存的时间是可配置的,当超过这个时间时,Namenode就会将该文件从名字空间中删除。删除文件会使得该文件相关的数据块被释放。注意,从用户删除文件到HDFS空闲空间的增加之间会有一定时间的延迟。

只要被删除的文件还在.Trash目录中,用户就可以恢复这个文件。如果用户想恢复被删除的文件,他/她可以浏览.Trash目录找回该文件。

减少副本系数

When a copy of a file coefficient is reduced, Namenode excess copies will choose Delete. The next heartbeat will pass information to Datanode detection. Datanode moved for removing corresponding data block, the cluster of free space increases. Similarly, in the end setReplication API calls and increase the free space between the cluster there will be some delays.

[This article is user-generated content, it must be marked with the source of the article, the article links, author of the article and other basic information Reserved]

Guess you like

Origin blog.51cto.com/14463231/2422569