大数据笔记：HDFS读官方文档

标签：大数据

大数据笔记：HDFS读官方文档

介绍

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/.

主要说的是：HDFS具有高容错、高可靠性、高扩展性的特点，可以部署在低成本硬件上。

HDFS提供对应用程序数据的高吞吐量访问，适用于具有大型数据集的应用程序。

一般来讲，hadoop：

适用于大规模数据吞吐而不是境地数据访问的延时
流式访问数据集（写一次，读多次），不适用于频繁修改文件
HDFS设计更多用于批处理，而不是用户交互式使用。

NameNode and DataNodes

HDFS架构及其内部

HDFS has a master/slave architecture.

首先说HDFS架构是：1个master带n个slaves

An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

一个HDFS集群存在：

一个single NameNode，master服务器管理着文件系统的namespace并调节客户端访问文件。
多个DataNode，通常是群集中的每个节点一个DataNode，用于管理连接到它们所运行的节点的存储。

HDFS exposes a file system namespace and allows user data to be stored in files.

HDFS暴露了一个文件系统的namespace，并允许数据存储到namespace的文件中的

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

但是在内部，一个文件是被拆分成多个Block，其大小为128M。这些块分散地存储在一些DataNode中（防止机器挂掉）。

A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

典型的部署是有一台只运行NameNode软件的机器。群集中的每台其他机器运行一个DataNode。
也不排除在同一台机器上运行多个DataNode，但在实际部署中很少出现这种情况。

NameNode和Data的作用

NameNode：

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

NameNode执行的是文件系统的namspace操作，比如打开、不安比、重命名文件和目录，也决定这DataNode中块的映射。
主要负责：
1. 客户端请求的响应
2. 元数据的管理，包括文件的名称、副本系数、Block存放的DataNode的管理

DataNode：

The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

DataNode负责提供来自文件系统客户端的读取和写入请求。
DataNode还根据来自NameNode的指令执行block的创建，删除和副本的一些操作。
主要负责：
1. 存储用户的文件对应的数据块
2. 定期向NameNode发送心跳信息，汇报本身及其所有的block信息，健康状况

命名空间和副本

命名空间

那么我们看看文件系统的命名空间是怎样的。

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories.

HDFS支持传统的层级目录结构，用户或者程序可以在一个目录下创建目录或存放文件。

The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

NameNode维持一个文件系统命名空间，任何文件系统命名空间或属性的改动，都会被NomeNode记录。一个应用程序可以指定HDFS维持的副本文件的数量。这个数量成为该文件的副本因子。这个信息是存在NameNode上的。

副本机制

我们知道了，副本因子就是HDFS维持的副本文件的数量，那么为什么有副本因子?

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.

HDFS被设计成可以在大集群上跨机器可靠存储大量文件。它将每个文件存储为一系列的块。

All blocks in a file except the last block are the same size

文件中除最后一个块之外的所有块都具有相同的大小。

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.

应用程序可以指定文件中副本的数量，副本因子可以在文件创建时被指定，之后也可以被更改。文件在HDFS中只能被写一次，并且严格限定只写一次。

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

NameNode做出关于副本block的所有的决定。定期获取每个DataNode的心跳和Blockreport。回复心跳意味着DataNode在正常工作，Blockreport包含DtaNode中所有块的列表。

副本存放策略

如何为每个block副本选择存放的DataNode呢？

副本存放策略是：
1. 对于第一个副本，放在当前操作的客户端上
2. 第二个副本与第一个副本不再同一个机架上
3. 第三个副本放在与第一个副本相同机架的不同节点上

防止某一个机架挂掉。

Hadoop笔记：HDFS读官方文档