9, hadoop-HDFS Overview

1, HDFS and defining the background

Background

With the increasing amount of data in a system memory no less than all of the data, it would need to allocate more disk operating system management

Singular convenient management and maintenance, the urgent need for a system to manage files on multiple machines, which is a distributed file system, where shares

HDFS is just a distributed file management system

 

HDFS defined

HDFS
(hadoop Distributed File System) he is a file system for storing files, the file is defined by the directory book

Next he is distributed by a number of servers together to achieve its function, servers in the cluster have their respective roles

 

HDFS usage scenarios:

Scene read out multiple times, and does not support modifying files, suitable for doing data analysis, do not suitable for network application

 

2, advantages and disadvantages

advantage

1, high fault tolerance

 

 2, for large data

Data Scale: capable of processing data reached GB, TB or even PB-level data
files Size: capable of handling more than one million the number of file size, the number is quite large

 

 3, can be built on low-cost machines, through a multi-copy mechanism to improve reliability

 

 Shortcoming

1, is not suitable for low-latency access to data, such as data storage millisecond, it can not be done

2, can not be efficient for a large number of small files stored

Store a large number of small files, he would take up a lot of memory to store NameNode file directory and block information. This is undesirable because memory is limited NameNode

Small files stored addressing times more than read time, in violation of the design goals of HDFS

3, does not support concurrent write, modify the file random
file can have only one write, it does not allow multiple threads to write
only support for data append (additional), does not support the random modification of files

 

3, HDFS architecture consisting of

 

 

 

1, NameNode (nn): is the Master, is a director, manager
(1), namespace management of HDFS
(2), a copy of the configuration policy
(3), the management data block (Block) mapping information
(4), handle customer end of the read and write requests

2, FateNode: is Slave, NamdeNode orders, DataNodes actual operation is performed
(1), storing the actual data block
(2), the data block read / write operation

3, Client: the client
(1), document segmentation.
HDFS file upload time, a Client to a file into the Block, and then upload
(2), interacting with the NameNode, acquired log information file
(3), DataNode interact with, read or write data
(4), Client provides special commands Claudia HDFS management, such as formatting NameNode
(5), Client HDFS additions and deletions can be accessed by the operating commands

4, SecondaryNameNode: not NameNode hot standby NameNode party hang
it does not immediately and replace NameNode service
(1), the auxiliary NameNode, share its workload, such as jacking combined Fsimage and pushes NameNode Edits and
(2), in emergency, may assist recovery NamdNode

 

4, HDFS file block size

HDFS I see it physically block storage (Block)
block size can be specified by the configuration parameters (dfs.blocksize)
default in Hadoop2.x version 128M, the old version is 64M

 

 

 

 

为什么块的大小不饿能设置太小,也不能设置太大?
(1)、HDFS的块设置太小,会增加寻址时间,程序一直在找块的开始位置
(2)、如果块设置太大,从磁盘上传输的数据时间会明显大于定位这个块开始位置所需要的时间
导致程序在处理这块数据时,会非常慢

总结:
HDFS的块大小设置主要时取决于磁盘的传输速率 

 

Guess you like

Origin www.cnblogs.com/Mrchengs/p/11316096.html