Hadoop Distributed File System --HDFS Introduction Principles

Before reading must see

This article is not so much an article, as it is a home study notes, most of it from github open source notes on a big God, is attached hereto link Copyright expression and respect: github source address

I. INTRODUCTION

HDFS (Hadoop Distributed File System) is a distributed file system in Hadoop. Can execute any file system should have a function, the directory structure is similar to the Linux file system. It is characterized by large and distributed data storage.
Features having a high fault tolerance, high throughput, can be deployed on low-cost hardware.

Two, HDFS design principles

HDFS design schematics

2.1 HDFS architecture

HDFS follow M / S (master / slave) architecture, a single NameNode (NN) and a plurality of DataNode (DN) Composition

  • NameNode: responsible for the implementation of the relevant file system namespace operations, such as opening, closing, renaming files and directories. It is also responsible for storing the cluster metadata, location information file recorded in each data block
  • DataNode: responsible for providing read and write requests from the file system client creates an execution block, delete and other operations.

2.2 file system namespace

HDFS file system namespace hierarchy is similar to comparing Linux file system, support the creation of directories and files, move, delete and rename files, configuration support configuration and user access rights, but does not support hard links and soft connection . NameNode responsible for maintaining the file system namespace, record any changes to the namespace or its properties

2.3 Data Replication

Since Hadoop was designed to run on low-cost machines, which means that the hardware is not reliable, in order to ensure fault tolerance, HDFS provides data replication mechanism. Each of HDFS file is stored as a series of blocks , each block is composed of a plurality of copies to ensure fault tolerance, the block size and replication factor can configure itself (by default, the block size is 128M, the default is the replication factor 3)

HDFS replication mechanism

2.4 data replication implementation principle

Datanode located on the writer, the priority will be to write a copy of the document placed on the datanode, or on a random datanode. After placing another copy on any node on another remote rack, and place the last copy on another node machine frame. This strategy can reduce write traffic between racks, thereby improving write performance.

Data Replication Works

If the replication factor is greater than 3, and then randomly determined after the fourth replica placement position, while maintaining the number of copies of each rack is below the upper limit, the upper limit is usually (replication factor -1) / Number of racks +2
NOTE: not allowed to have multiple copies of the same block on the same datanode.

2.5 copy of the selection

One sentence is: the principle of proximity
in order to minimize bandwidth consumption and read latency, the HDFS when performing a read request, the priority is read from the latest copy of the reader. If there is a copy of the frame and reading on the same node, the copy is preferentially selected. If HDFS cluster spans multiple data centers, the preference on the local copy of the data center

2.6 schema stability

  1. Heartbeat mechanism

    Each DataNode periodically sends a heartbeat message to NameNode, if more than the specified time has not received a heartbeat message, the DataNode marked dead. NameNode will not be any new I / O requests to mark DataNode death, we will not use data on these DataNode. Since the data is no longer available, may cause some of its replication factor is less than the specified value, the NameNode tracks these blocks, and recopied when necessary.

  2. Data integrity

    Due to equipment failure or the like, the data stored on the DataNode damage will occur. To avoid errors in the read data has been corrupted resulting, HDFS provides data integrity checking mechanisms to ensure data integrity, as follows:
    When creating a file HDFS client, it will calculate for each block of the file separate hidden file checksum and the checksum stored in the same HDFS namespace in. When the client retrieves the content file, it verifies the received data from each of the DataNode whether stored in an associated checksum matches the checksum file. If the match fails, the certification data has been damaged, then the client can choose to get the other available block copy from another DataNode.

  3. Disk failure metadata

    FsImage and EditLog is at the heart of HDFS data, accidental data loss can cause the whole HDFS service is unavailable. To avoid this problem, you can configure NameNode to support FsImage and EditLog multiple copies synchronized, so any changes will cause FsImage or EditLog of each copy of FsImage and updated simultaneously EditLog.

  4. Support Snapshot

    Support snapshot copies of data stored at a particular time, when the accident damaged data, the data can be restored to a healthy condition by a rollback operation.

Three, HDFS characteristics

3.1 high fault tolerance

Because HDFS scheme using multiple copies of data, so some hardware damage will not result in the loss of all data

3.2 high throughput

Key design HDFS data access support high throughput, low latency, rather than the number of access

3.3 Large File Support

HDFS suitable for storing large files, the size of the document should be the level of GB to TB

3.4 simple model of consistency

HDFS is more suitable for Write Once Read Many (write-once-read-many) access model. Support will append to the end of the file, but does not support random access to data, data from the file can not be added anywhere

3.5 cross-platform portability

Java features, HDFS has a good cross-platform portability, so that other big data framework will be used as data persistence preferred option.

Graphic HDFS storage principle

Source: HDFS cartoon illustration

  1. HDFS write data Principle
    HDFS write data Principle

  2. HDFS write data Principle
    write01write02
    write03

  3. HDFS fault type and detection method
    01
    02

  • Read Troubleshooting
    03
  • DataNode Troubleshooting
    04
  • A copy of the layout policy
    05

Reference material

  1. Apache Hadoop 2.9.2 > HDFS Architecture
  2. Tom White. Hadoop Definitive Guide [M]. Tsinghua University Press. 2017.
  3. Translation classic cartoons to explain the principles of HDFS
  4. Attach again: This reference
Published an original article · won praise 0 · Views 160

Guess you like

Origin blog.csdn.net/Brave_man_O/article/details/104182430