"Big Data Technology Principles and Applications" Xiamen University woods rain opened the second chapter notes HDFS distributed file system

HDFS is mainly used for distributed file storage

HDFS goals:

  1. Compatible with inexpensive hardware devices
  2. Implements a stream data read and write
  3. Support for large data sets
  4. It supports simple document model
  5. Powerful cross-platform compatibility

HDFS own limitations:

  1. Not suitable for low-latency data access, real-time performance is not high
  2. Not efficiently store large amounts of small files, index structure is very large
  3. It does not support multi-user write and modify files

Related concepts:

  • Block -> entire HDFS among the core concept of default 64MB can also design a larger but not the bigger the better
  • Support large-scale file storage, cut into different pieces
  • Simplify system design, to facilitate the management of metadata

Metadata: What is the file; the file is divided into a number of blocks; each block and how files are mapped; each block is stored in the server above.

For data backup, a redundant memory block is up to a different device

HDFS two major components:

1, NameNode -> HDFS clusters throughout the housekeeper, the equivalent data directory

  NameNode core structure:

    FsImage, save the file system tree

  •       Maintenance: Copy the file level
  •       Block file block size and the composition
  •       Modification and access times
  •       access permission

    EditLog: record data such as creating, deleting, or renaming

Every shell command to start, FsImage and EditLog merged to form the metadata, and after the formation of a new FsImage empty EditLog, but EditLog with the increase in operations, will continue to increase, this time you need a Secondary NameNode to deal with.

Secondary NameNode solve the problem:

  • NameNode to do cold backup
  • Processing EditLog

And regularly communicate with node name, the name of the node to stop using EditLog file, drag it to the own, after the name of the node generates a new Edit.new re-read and write. Then SecondaryNameNode the FsImage and EditLog copy to the local, merged into a new FsImage
, after then sent to the name of the node, so that both achieve a cold backup, but also solve the growing problems Editlog.

2, DataNode -> actual data is stored, saved to disk, to the local storage file system linux


HDFS namespace:


Directory / file / block

 

 

HDFS limitations:

  • Namespace limitation, a node name is stored in memory, so the number of object receiving space size is limited.
  • Performance bottleneck: a distributed file overall throughput, limited by the throughput of a single node name
  • Isolate the problem: Because there is only one cluster node name, only one namespace, and therefore can not be isolated for different applications
  • Cluster availability: Once the unique name node fails, it will cause the entire cluster unusable.

The second is a cold backup node name, not the hot backup, hot backup is called after a fault occurs, the second immediately go to the top can be used directly, but after the first cold backup is defective and must be stopped for a period of time, slowly recovery, after the provision of external services.

HDFS1.0 version there is a single point of failure problem, to HDFS2.0 will provide a hot backup, set up two name nodes.

 

HDFS storage principle of {save redundant data problems, data retention policy issues, data recovery problems}

  1. Redundant data storage problems: because the underlying will continue to fail, the redundant data stored default factor is 3, the default data storage block 3 parts, can be personalized. (Faster data transfer speeds, it is easy to check data errors, ensure data reliability)
  2. Data retention policy question: if initiated within the cluster, the first node into the data originated. If it were not on the cluster, then pick a disk is not full, cpu less busy node. A second copy in a different node of the frame, the third node into the same rack as the other.
  3. Data reading: calling rack ID API calculation belongs, select the most recent. We did not find a copy of the randomly selected to read data.
  4. Data recovery problem: the name of the failed node (second node name backup and recovery), node data error (according to data node sends heartbeat information, if not available, and the flag is down, and then copy to other machines), error data itself (to verify whether the data out of the problem by checking the code again redundant copies of copy).
  5. HDFS commonly used commands:

Guess you like

Origin www.cnblogs.com/zxgCoding/p/12638189.html