The overall architecture of the large distributed file system HDFS

1 Overview

HDFS is a distributed file system that is highly fault tolerant and runs on inexpensive machines. HDFS provides high-throughput access to application data and is suitable for applications with large datasets.

2 HDFS features

(1) It can be used to store very large files, such as (GB, TB, PB) level files.

(2) Can run on cheap hardware, HDFS ensures data security through its own fault tolerance mechanism

(3) Streaming data access, which supports an efficient access mode of one write and multiple reads.

(4) It is suitable for application scenarios with high throughput of data access, but not suitable for application scenarios requiring low-latency data access.

(5) In order to ensure the efficiency of data access, HDFS does not support multiple write operations, nor does it support modification in any position of the file, but it does not rule out that these operations will be supported in the future.

This article is about the Hadoop 2.x distribution

3 HDFS data blocks

The default data block size of HDFS is 128MB. The reason why such a large block occupies space is mainly to reduce the minimum addressing overhead of the disk. The files on HDFS will be cut into multiple blocks with block size as the unit. To ensure data security, each data block will be replicated to multiple service nodes for backup (3 by default). The application can also specify the number of copies of each file. If a block is found to be unavailable, the system reads another replica block from elsewhere.

4 NameNode Sum DataNodes

HDFS belongs to the master/slave architecture and consists of a name node (NameNode) and multiple data nodes (DataNodes).

(1) NameNode: stores metadata (name, number of replicas, permissions, block list...), responsible for namespace and client (client) access

(2) DataNode: Responsible for serving read and write requests from file system clients, and periodically send the NameNode a list of the blocks they store.

The metadata of the NameNode mainly consists of the image file (fsimage) and the edit log file (edits). It should be noted that the metadata does not store the server node where each data block is located. Each DataNode node automatically reports data information to the NameNode service node.

View image files and edit log files through the offline viewing tool that comes with HDFS

(1) View the fsimage image file: $ hdfs oiv -p XML -i image file -o output file path

(1) View the edits log file: $ hdfs oev -p XML -i image file -o output file path

5 Fault tolerance mechanism of NameNode

The NameNode in the system is very important. If the NameNode node hangs or the metadata of the NameNode is damaged, the entire file system will not be able to be used normally. Therefore, the fault tolerance mechanism of the NameNode is very important. Hadoop provides two protection measures.

(1)NameNode HA With QJM 或者 NameNode HA With NFS

(2) Regularly back up metadata through a secondary node and SecondaryNameNode

After the NameNode fails, in order to switch to the normal service state of the NameNode more quickly, the first mechanism is generally selected. The specific implementation details of the first mechanism will be explained in the subsequent articles.

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325232741&siteId=291194637