A, HDFS design ideas
1) ideas
- Segmentation data, and storing multiple copies;
2) If the file is only stored in multiple copies, without segmentation, what's the problem
-
Shortcoming
- No matter how much files are stored on a node, when data processing is difficult during parallel processing nodes may become network bottlenecks, difficult to handle big data;
- Storage load balancing is difficult, utilization of each node is very low;
Two, HDFS design goals
- Hadoop Distributed File System (HDFS): from Google's GFS paper;
-
Design goals
- Distributed Storage: if necessary, to increase the level of lateral nodes;
- Run on commodity hardware cheap
- Easy to expand, to provide users with a good performance (if inexpensive hardware damage, will not bring serious damage to the user) file storage services;
Three, HDFS architecture
- Usually a cluster of HDFS by a NameNode (NN) and a plurality of DataNodes (the DN) composition; DataNodes general NameNode and deployed on different nodes;
-
NameNode :
- namespace file system management, and client access to files;
-
Features:
- Responsible for responding to client requests;
- Responsible for the source data (name of the file, a copy of the coefficient, Block storage DataNode) management;
-
DataNodes
- Bolck of operation; generally each node has a DataNodes (there are several nodes in the cluster, it corresponds to a few DataNodes, a node can also run multiple DataNodes, but generally not used, use one), manage files should be stored to which node;
-
Features:
- Storing the user file corresponding to the data block (Block);
- Periodically sends the block itself and all of its information, the health of the NameNode;
- Blocks is sliced according blocksize; (blocksize = 128M, 130M ==> 128M + 2M)
- Open, close, rename a file or directory rename; file operations;: namespaces
-
Four, HDFS copy mechanism
- HDFS supports multi-tiered file storage (folder other folders),
- File system namespace do anything, it will be recorded in NameNode;
- Block all of a file, except the last one BLOCK, Block all other sizes are the same (the same Blocksize);
-
Fifth, a copy of the HDFS storage strategy
- Usually the default storage 3 copies: also fault-tolerant security reasons
- The first copy is stored in the default operating current node;
- The second copy is stored on a different node of the current node is located in a rack;
- Third copy stored in the same rack as the second copy of the different nodes;