Hadoop - HDFS HA and Federation Principle

Objectives of this section:

1. Understand the limitations and deficiencies of Hadoop 1.0

2. Master the new features of HDFS 2.0

1 The limitations and disadvantages of Hadoop

The core components of Hadoop 1.0, MR and HDFS, have several shortcomings:

(1) The abstraction level is low. For simple functions, write a lot of code.

(2) The ability to express is limited. MR highly abstracts complex distributed programming work into two functions, namely Map and Reduce. In the actual production environment, some cannot be completed with only two simple functions.

(3) To manage complex dependencies between jobs. Practical applications usually require a large number of jobs to complete collaboratively, and there are often complex dependencies between jobs.

(4) The iterative efficiency is low. For tasks that require iteration, data in HDFS files needs to be read and written repeatedly, which greatly reduces the iteration efficiency.

(5) Waste of resources. The Reduce task needs to wait for all Map tasks to complete before starting.

(6) Poor real-time performance. Suitable for offline batch processing.

2 Improvements from 1.0 to 2.0


3 New features of HDFS2.0

Mainly two new features of HDFS HA ​​and HDFS federation.

3.1 HDFS HA

For the distributed file system HDFS, NN is the core node of the system, which stores various metadata information, and is responsible for managing the namespace of the file system and the client's access to files. However, in HDFS 1.0, there is only one NN, and once a "single point of failure" occurs, the entire system will fail. Although there is an SNN, it is not a hot backup of NN. The main function of SNN is to periodically obtain FsImage and EditLog from NN, merge them and send them to NN to replace the original FsImage to prevent the EditLog file from being too large. It takes too much time to recover from NN failures. The merged FsImage is also saved in the SNN. When the NN fails, the FsImage in the SNN can be used to restore it.

Since the SNN cannot provide the "hot backup" function, when the NN fails, it cannot immediately switch to the SNN to provide external services, and it still needs to be shut down for recovery. HDFS2.0 adopts HA (High Availability) architecture. In an HA cluster, two NNs are generally set, one of which is in the "Active" state and the other in the "Standby" state. The NN in the Active state is responsible for processing all external client requests. The NN in the Standby state acts as a hot backup node and saves enough metadata. When the Active node fails, it immediately switches to the active state to provide external services.

Since Standby NN is a "hot backup" of Active NN, the state information of Active NN must be synchronized to StandbyNN in real time. For state synchronization, a shared storage system such as NFS (Network File System), QJM (Quorum Journal Manager) or Zookeeper can be used. Active NN writes updated data to the shared storage system , and Standby NN will always monitor the system. Once it finds new writes, it will immediately read the data from the public storage system and load it into its own memory, thus ensuring that it is compatible with Active NN. NN state is consistent.

In addition, the NN saves the mapping information of the data blocks to the actual storage location, that is, which DN stores each data block by. When a DN joins the cluster, it will send the list of data blocks it contains to the NN, and periodically heartbeat to ensure that the block mapping in the NN is up-to-date. Therefore, in order to achieve fast switching in the event of a failure, it is necessary to ensure that the StandbyNN also contains the latest block mapping information. For this purpose, it is necessary to configure the addresses of the Active and Standby NNs for the DN, and send the block location and heartbeat information to the two NNs at the same time. superior. In order to prevent the phenomenon of "two housekeepers", it is also necessary to ensure that only one NN is in the Active state at any time, which needs to be implemented by Zookeeper.

3.2 HDFS Federation

Although HDFS HA ​​solves the "single point of failure" problem, there are still problems with system scalability, overall performance and isolation.

(1) In terms of system scalability, metadata is stored in the NN memory, which is restricted by the upper memory limit.

(2) In terms of overall performance, throughput is affected by a single NN.

(3) In terms of isolation, a program may affect other running programs, for example, one program consumes too many resources and other programs cannot run smoothly. HDFS HA ​​is also essentially a single name node.

HDFS federation can solve the above three problems.

In the HDFS federation, multiple independent NNs are designed to enable the HDFS naming service to expand horizontally. These NNs manage their own namespaces and blocks respectively, and do not need to coordinate with each other. Each DN needs to register with all NNs in the cluster, and periodically sends heartbeat information and block information to report its status.

HDFS federation has multiple independent namespaces, each of which manages its own set of blocks, and these blocks belonging to the same namespace form a "block pool". Each DN provides block storage for multiple block pools, and each block in the block pool is actually stored in different DNs.









Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324615029&siteId=291194637