HDFS Overview (a)

1. Background and definition of HDFS output

1.1 HDFS generated background

With the increasing amount of data, stored in an operating system no less than all of the data, then distributed to more disk operating system management, but is not convenient to manage and maintain, the urgent need for a system to manage multiple machines files, which is a distributed file system management. HDFS distributed file management system in just one.

1.2 HDFS definition

HDFS (Hadoop Distributed File System), which is a file system for storing files, to locate files by directory tree; secondly, it is distributed by many servers work together to achieve its functionality, servers in the cluster have their own Character.

HDFS usage scenarios: suitable for write once, read many of the scenes, and does not support modifying files. Suitable for doing data analysis, do not suitable for network application.

 

 

2. HDFS advantages and disadvantages

2.1 Advantages:

1) The high fault tolerance

(1) multiple copies of data is automatically saved, that by increasing the copy of the data pattern, improved fault tolerance

     

(2) a copy of the data after a loss, it can automatically restore

        

2) suitable for handling large data

(1) Data Scale: able to handle scale reached GB, TB, even large data level of PB;

3) can be constructed in an inexpensive machine, by multiple copies reliability is improved.

2.2 shortcomings

1) is not suitable for low latency access to data;

2) can not be efficient for a large number of small files are stored:

(1) storing lots of small files, then it will take NameNode directory information and block a lot of memory to store files;

(2) storing small files addressing longer than the read time, in violation of HDFS design goals.

3) does not support concurrent write, modify files randomly

(1) a file can have only one write, do not allow multiple threads to write;

(2) only supports data append (additional), does not support the random modification of files

 

3. HDFS architecture consisting of

3.1 overall structure is as follows:

       

3.2 HDFS architecture Detailed

1) NameNode (abbreviation: ND): is the master, it is an executive, responsible for the management of HDFS information:

(1) Management HDFS namespace;

(2) a copy of the management strategy;

(3) management block (Block) mapping information;

(4) processing the client-side read and write requests.

2)DataNode(简称:DN):就是slave,NameNode下达指令,DataNode执行实际的操作:

(1)存储实际的数据块;

(2)执行数据块的读/写操作。

3)Client:客户端,与NameNode交互的程序,职责或功能如下:

(1)文件切分:在上传文件至HDFS的时候,Client会将文件分切成一个个的Block上传;

(2)与NameNode交互,可以获取文件的位置信息(存在哪个节点上)

(3)Client可以通过一些命令来访问HDFS,比如增删改查操作;

(4)Client通过一些命令来管理HDFS,比如将NameNode格式化。

4)SecondaryNameNode:并非是NameNode的热备。当NameNode挂掉的时候,它并不会立即替换NameNode并提供服务。

(1)辅助NameNode,分担其工作量,比如定期合并FsImage和Edits(后边会讲到,这里不用理解),并将合并后的FsImage.checkPoint推送给NameNode;

(2)在紧急情况下可以辅助恢复NameNode。

 

4 HDFS的文件块大小

1)HDFS中的文件在物理上是按照块(Block)存储的,块id大小可以通过配置参数(dfs.blocksize)来规定,默认大小在Hadoop2.x的版本中是128M,老版本的是64M。

2)块的大小设定:文件的寻址时间应为块文件的传输时间的1%,这是比较合理的设定。

3)思考:为什么块的大小不能设置太小,也不能设置太大?

(1)HDFS的块如果设置的太小,会增加寻址时间,程序长时间在寻找块的存储位置;

(2)如果设置太大,从磁盘传输的时间会明显大于定位这个块的起始位置所需的时间。导致在处理这个块的数据时,浪费了大量的时间在IO上。

因此,块的大小可以根据数据量和磁盘的IO速度决定如何设置。

 

 

 

Guess you like

Origin www.cnblogs.com/simon-1024/p/11741184.html