It's 2022, why is HDFS still so powerful?

This article is shared from Huawei Cloud Community " Why does HDFS endure in the field of big data? , by JavaEdge.

1 Overview

1.1 Introduction

  • A distributed file system implemented by Hadoop (Hadoop Distributed File System), referred to as HDFS
  • From Google's GFS paper, published in 2003, HDFS is a clone of GFS

The most valuable and difficult to replace in big data is data, and everything revolves around data.

HDFS is the earliest big data storage system, which stores valuable data assets. If various new algorithms and frameworks are to be widely used, HDFS must be supported to obtain the data stored in it. Therefore, the more big data technology develops and the more new technologies, the more support HDFS gets, and the more it is inseparable from HDFS. HDFS may not be the best big data storage technology, but it is still the most important big data storage technology .

How does HDFS achieve high-speed and reliable storage and access of big data?

The design goal of the Hadoop distributed file system HDFS is to manage thousands of servers and tens of thousands of disks, manage large-scale server computing resources as a single storage system, and provide tens of petabytes of data for applications. Storage capacity, allowing applications to store large-scale file data as if they were using a normal file system.

1.2 Design goals

Files are stored in multiple copies:

filel:node1 node2 node3
file2: node2 node3 node4
file3: node3 node4 node5
file4: node5 node6 node7

shortcoming:

  • No matter how big the file is, it is stored in one node. When processing data, it is difficult to perform parallel processing. The node may become a network bottleneck, making it difficult to process big data.
  • The storage load is difficult to balance, and the utilization of each node is very low

advantage:

  • Huge distributed file system

  • runs on ordinary cheap hardware

  • Easy to expand and provide users with a file storage service with good performance

2 How to design a distributed file system

Implementation of HDFS for mass storage and high-speed access.

After RAID shards data, read and write access is performed concurrently on multiple disks, which increases storage capacity, speeds up access, and improves data reliability through data redundancy check. Even if a certain disk is damaged, it will not be lost. data. Extending the design concept of RAID to the entire distributed server cluster produces a distributed file system, which is the core principle of the Hadoop distributed file system.

Similar to RAID's idea of ​​file storage and parallel reading and writing on multiple disks, HDFS is a large-scale distributed server cluster that performs parallel reading and writing and redundant storage of data after sharding. Because HDFS can be deployed in a large server cluster, the disks of all servers in the cluster can be used by HDFS, so the storage space of the entire HDFS can reach PB level.

HDFS is a master-slave architecture. An HDFS cluster will have a NameNode (named node, NN for short) as the master server.

  • NameNode is used to manage the namespace of the file system and regulate client access to files
  • There are also multiple DataNodes (DN for short), data nodes, which exist as slave nodes (slave server)
  • Usually, the DataNodes in each cluster are managed by the NameNode, and the DataNodes are used to store data

HDFS exposes the file system namespace, allowing users to store data in files, just like we usually use the file system in os, users do not need to care about how the underlying data is stored.
Under the hood, a file is divided into one or more data blocks, and these database blocks are stored in a set of data nodes. The default 128M data block in CDH.
At NameNode, filesystem namespace operations such as opening, closing, renaming files, etc. can be performed. This also determines the mapping of data blocks to data nodes.

HDFS is designed to run on ordinary cheap machines, which usually run a Linux operating system. A typical HDFS cluster deployment will have a dedicated machine running only one instance NameNode, while the machines in the other clusters each run one DataNodeinstance. While it is possible to run multiple nodes on a single machine, it is not recommended.

DataNode

  • Store the data block (Block) corresponding to the user's file
  • It will periodically send heartbeat information to NN, reporting itself and all its block information and health status

Responsible for the storage and read and write operations of file data, HDFS divides the file data into several data blocks (Block), and each DataNode stores a part of the Block, so that the files are distributed and stored in the entire HDFS server cluster.

Application clients (Client) can access these blocks in parallel, so that HDFS can achieve parallel data access on the scale of server clusters, greatly improving the access speed.

There are many DataNode servers in an HDFS cluster, generally ranging from hundreds to thousands. Each server is equipped with several disks, and the storage capacity of the entire cluster is about several PB to several hundreds of PB.

NameNode

  • Responsible for responding to client requests
  • Responsible for the management of metadata (name of file, copy coefficient, DN stored by Block)

Responsible for the metadata (MetaData) management of the entire distributed file system, that is, the file path name, data block ID and storage location and other information, similar to the file allocation table (FAT) in os.

To ensure high data availability, HDFS will copy a block into multiple copies (3 copies by default), and store multiple copies of the same block on different servers or even different racks. When a disk is damaged or a DataNode server is down, or even a switch is down, so that the data blocks stored in it cannot be accessed, the client will look for its backup block to access.

3 S replica mechanism

In HDFS, a file is split into one or more data blocks. By default, there are three copies of each data block, each copy is stored on a different machine, and each copy has its own unique number:

Block diagram of multiple copy storage

The number of replicated backups of the file /users/sameerp/data/part-0 is set to 2, and the stored BlockIDs are 1 and 3 respectively:

  • The two backups of Block1 are stored on two servers, DataNode0 and DataNode2
  • Two backups of Block3 are stored on two servers, DataNode4 and DataNode6

After any of the above servers goes down, at least one backup of each data block exists, which will not affect access to the file /users/sameerp/data/part-0.

Like RAID, data is divided into several blocks and stored on different servers to achieve large-capacity storage of data, and data in different shards can be read/written in parallel to achieve high-speed data access.

copy storage strategy

Replica storage: The NameNode node selects a DataNode node to store block replicas. The strategy of this process is to balance reliability and read and write bandwidth.

The default way in the Hadoop Definitive Guide:

  • The first replica will be randomly selected, but no nodes with overfilled storage will be selected
  • The second copy is placed on a different and randomly selected rack than the first copy
  • 3rd and 2nd on different nodes on the same rack
  • The remaining replicas are completely random nodes

Reasonable Analysis

  • Reliability: blocks are stored in two racks
  • Write Bandwidth: Writes traverse only one network switch
  • Read operation: select one of the racks to read
  • blocks are distributed throughout the cluster

The first drive of Google's big data "troika" is GFS (Google File System), and the first product of Hadoop is HDFS, and distributed file storage is the basis of distributed computing.

Over the years, various computing frameworks, various algorithms, and various application scenarios have been continuously introduced, but the king of big data storage is still HDFS.

5 High Availability Design of HDFS

5.1 Data Storage Fault Tolerance

The disk media is affected by the environment or aging during the storage process, and the stored data may be chaotic.

HDFS calculates and stores the checksum (CheckSum) of the data blocks stored on the DataNode. When reading data, recalculate the checksum of the read data. If the checksum is incorrect, an exception will be thrown. After the application catches the exception, it will read the backup data on other DataNodes.

5.2 Disk Failure Tolerance

When the DataNode detects that a certain disk of the local machine is damaged, it reports all BlockIDs stored on the disk to the NameNode. The NameNode checks which DataNodes have backups for these data blocks, and informs the corresponding DataNode server to copy the corresponding data blocks to On other servers, to ensure that the number of backups of data blocks meets the requirements.

5.3 DataNode fault tolerance

The DataNode will maintain communication with the NameNode through heartbeats. If the DataNode fails to send a heartbeat over time, the NameNode will consider the DataNode to be down and fail, and immediately find out which data blocks are stored on the DataNode and which servers are these data blocks stored. Then notify these servers to copy another data block to other servers to ensure that the number of data block backups stored in HDFS meets the number set by the user. Even if the server goes down again, the data will not be lost.

5.4 NameNode Fault Tolerance

The NameNode is the core of the entire HDFS. It records the information of the HDFS file allocation table. All file paths and data block storage information are stored in the NameNode. If the NameNode fails, the entire HDFS system cluster cannot be used; if the data recorded on the NameNode is lost, the entire The data stored by all DataNodes in the cluster is useless.

Therefore, the high availability and fault tolerance of the NameNode is very important. NameNode provides high-availability services in a master-slave hot standby mode:

The cluster deploys two NameNode servers:

  • One serves as the master server
  • One as a slave server for hot backup

The two servers are elected by Zk, mainly by competing for znode lock resources to decide who is the master server. The DataNode will send heartbeat data to both NameNodes at the same time, but only the primary NameNode can return control information to the DataNode.

During normal operation, the metadata information of the file system is synchronized between the master and slave NameNodes through a shared storage system shared edits. When the primary NameNode server goes down, the secondary NameNode will be upgraded to become the primary server through ZooKeeper, and ensure that the metadata information of the HDFS cluster, that is, the file allocation table information, is complete and consistent.

The software system, with poor performance, may be acceptable to users; poor user experience may also be tolerated. However, if the availability is poor, it will be troublesome if it fails frequently and is unavailable; if important data is lost, then the development will be a major event.

Distributed systems may fail in many places. Memory, CPU, motherboard, and disk will be damaged, servers will be down, the network will be interrupted, and the computer room will be powered off. All of these may cause the software system to become unavailable, or even permanent data. lost.

Therefore, when designing a distributed system, software engineers must tighten the usability string and think about how to ensure that the entire software system is still available under various possible failure conditions.

6 Strategies to Ensure System Availability

redundant backup

Any program and any data must have at least one backup, that is, the program must be deployed to at least two servers, and the data must be backed up to at least another server. In addition, Internet companies of a small scale will build multiple data centers, and the data centers will back up each other. User requests may be distributed to any data center, which is the so-called remote multi-active. In the event of natural disasters, the high availability of applications is still guaranteed.

failover

When the program or data to be accessed cannot be accessed, the access request needs to be transferred to the server where the backup program or data is located, which is also called failover . In failover, you should pay attention to the identification of failure. In the scenario where the master and slave servers manage the same data like NameNode, if the slave server mistakenly thinks that the master server is down and takes over the cluster management, the master and slave servers will send instructions to the DataNode together. This in turn leads to cluster confusion, a so-called "split brain". This is also the reason why ZooKeeper is introduced when electing the master server in such scenarios. How ZooKeeper works, I will analyze it later.

downgrade

When a large number of user requests or data processing requests arrive, due to limited computing resources, it may not be possible to process such a large number of requests, resulting in resource exhaustion and system crash. In this case, some requests can be rejected, that is, current limiting ; some functions can also be turned off to reduce resource consumption, that is, downgrading . Current throttling is a regular feature of Internet applications, because you cannot predict when the access traffic that exceeds the load capacity will suddenly arrive, so you must prepare in advance. When you encounter sudden peak traffic, you can immediately start the current throttling. . The downgrade is usually prepared for predictable scenarios, such as the "Double Eleven" promotion of e-commerce. In order to ensure the normal operation of the core functions of the application during the promotion, such as the ordering function, the system can be downgraded and closed. Non-important functions, such as product evaluation functions.

Summarize

How HDFS achieves large-capacity, high-speed, reliable storage and access of data through large-scale distributed server clusters.

1. The file data is divided into data blocks, and the data blocks can be stored on any DataNode server in the cluster, so the files stored in HDFS can be very large, and one file can theoretically occupy all the disks on the entire HDFS server cluster. Mass storage.

2. The general access mode of HDFS is to read through the MapReduce program during computing, and MapReduce reads the input data in shards. Usually, a shard is a data block, and each data block is assigned a computing process, so that it can be started at the same time. Many processes concurrently access multiple data blocks of an HDFS file, thereby realizing high-speed data access. Regarding the specific processing process of MapReduce, we will discuss it in detail later in the column.

3. The data blocks stored by the DataNode will be replicated, so that each data block has multiple backups in the cluster, which ensures the reliability of the data, and realizes the high availability of the main components in the HDFS system through a series of fault tolerance methods, and then Ensure high availability of data and the entire system.

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~​

Guess you like

Origin blog.csdn.net/devcloud/article/details/124092651