1, HDFS and defining the background
Background
With the increasing amount of data in a system memory no less than all of the data, it would need to allocate more disk operating system management
Singular convenient management and maintenance, the urgent need for a system to manage files on multiple machines, which is a distributed file system, where shares
HDFS is just a distributed file management system
HDFS defined
HDFS
(hadoop Distributed File System) he is a file system for storing files, the file is defined by the directory book
Next he is distributed by a number of servers together to achieve its function, servers in the cluster have their respective roles
HDFS usage scenarios:
Scene read out multiple times, and does not support modifying files, suitable for doing data analysis, do not suitable for network application
2, advantages and disadvantages
advantage
1, high fault tolerance
2, for large data
Data Scale: capable of processing data reached GB, TB or even PB-level data
files Size: capable of handling more than one million the number of file size, the number is quite large
3, can be built on low-cost machines, through a multi-copy mechanism to improve reliability
Shortcoming
1, is not suitable for low-latency access to data, such as data storage millisecond, it can not be done
2, can not be efficient for a large number of small files stored
Store a large number of small files, he would take up a lot of memory to store NameNode file directory and block information. This is undesirable because memory is limited NameNode
Small files stored addressing times more than read time, in violation of the design goals of HDFS
3, does not support concurrent write, modify the file random
file can have only one write, it does not allow multiple threads to write
only support for data append (additional), does not support the random modification of files
3, HDFS architecture consisting of
1, NameNode (nn): is the Master, is a director, manager
(1), namespace management of HDFS
(2), a copy of the configuration policy
(3), the management data block (Block) mapping information
(4), handle customer end of the read and write requests
2, FateNode: is Slave, NamdeNode orders, DataNodes actual operation is performed
(1), storing the actual data block
(2), the data block read / write operation
3, Client: the client
(1), document segmentation.
HDFS file upload time, a Client to a file into the Block, and then upload
(2), interacting with the NameNode, acquired log information file
(3), DataNode interact with, read or write data
(4), Client provides special commands Claudia HDFS management, such as formatting NameNode
(5), Client HDFS additions and deletions can be accessed by the operating commands
4, SecondaryNameNode: not NameNode hot standby NameNode party hang
it does not immediately and replace NameNode service
(1), the auxiliary NameNode, share its workload, such as jacking combined Fsimage and pushes NameNode Edits and
(2), in emergency, may assist recovery NamdNode
4, HDFS file block size
HDFS I see it physically block storage (Block)
block size can be specified by the configuration parameters (dfs.blocksize)
default in Hadoop2.x version 128M, the old version is 64M
为什么块的大小不饿能设置太小,也不能设置太大?
(1)、HDFS的块设置太小,会增加寻址时间,程序一直在找块的开始位置
(2)、如果块设置太大,从磁盘上传输的数据时间会明显大于定位这个块开始位置所需要的时间
导致程序在处理这块数据时,会非常慢
总结:
HDFS的块大小设置主要时取决于磁盘的传输速率