11 pictures to understand the architecture design of HDFS

Introduction to HDFS

HDFS is a highly fault-tolerant, high-throughput distributed file system suitable for deployment on cheap machines.

HDFS design concept

Support for very large data sets

Applications running on HDFS have large data sets. A typical file size on HDFS is generally between G bytes and T bytes. Therefore, HDFS is designed to support large file storage, can be expanded to hundreds of nodes in a cluster, and can store massive amounts of data.

For example, do you still use MySQL to store 10 billion pieces of data in a certain table? How high is your configuration for a machine, and then consider the efficiency of the query. If distributed storage is used, that is to say, these data are stored in more than N machines, and each machine only stores part of this 10 billion data, for example, a machine stores 5 million data, the positioning of HDFS It is for this huge data set storage.

Hardware error

HDFS may consist of hundreds or thousands of servers, each of which stores part of the file system data. The probability of failure of the machine storing these data is still quite high. For example, the disk is damaged, cannot be read or written, and the network fails. This machine can not work normally.

Therefore, error detection and rapid and automatic recovery are the core architectural goals of HDFS. Once HDFS detects a problem with a machine in the cluster, it can then recover from the failure.

Streaming data processing

The applications running on HDFS are different from the ordinary applications that we usually write. When HDFS reads and writes data on the file system, it is based on the concept of a stream. The so-called stream processing is to read files in batches. Write to ensure high-throughput file reading and writing, not low-latency file reading and writing.

HDFS is used in offline batch processing scenarios, especially the current offline data warehouse, which is used for data analysis. Our current offline analysis is to extract the data from the source system into HDFS early this morning, and then Processing involves processing this data batch by batch, instead of calculating one by one.

Simplified data consistency model

At the same time, it supports reading and writing of files, and it needs to deal with a large number of concurrency conflicts. Think about the read-write locks that are added to the read and write sets when writing code. If you don't add it, there will be more than N problems. For distributed storage of large-scale data sets like HDFS, its model must be simplified. Its files can only be written once, and then can only be appended, and the previous data cannot be modified casually.

So its philosophy is: write-once, ready-many-times, write once, then read multiple times, so that there is no data read and write concurrency conflict, and how to maintain data consistency.

This write-once, read-many model can also greatly improve the throughput of HDFS.

Mobile computing, don't move data

Think about it, when we read data, is it faster to read data from the local disk, or from other machines on the network? Is the data closer to the machine, the faster the speed and the higher the efficiency.

Therefore, in the data distributed on multiple machines, for distributed computing, you need to make your computing tasks closer to these data, instead of transferring data across the network in the cluster.

Distributed architecture of master-slave mode

NameNode sum DataNode

The classic architecture in distributed systems is master-slave. HDFS uses this master-slave architecture. An HDFS cluster is composed of one NameNode and multiple DataNodes.

NameNode is a process, JVM process, and also a system. We can think of NameNode as this master (it can be understood as a big steward). In an ordinary HDFS cluster, there is only one NameNode. It is responsible for managing the filesystem namespace and client access to files.

DataNode is also a process. Each machine in the cluster has a DataNode process, which is mainly responsible for the storage of data on the machine.

Filesystem Namespace

Namespace, how to understand? As far as the folder of our computer is concerned, the data stored is in this directory -> file method, and the directory may be the next level of directory.

For example, the following directory structure on my computer, /documents/data warehouse/ This directory has file data warehouse construction method xxx.doc and directory /12_data, directory /12_data has 04_big data system technical structure xxx .doc These documents.

Therefore, the corresponding relationship between the hierarchical structure of the file system and the files is the so-called Filesystem Namspace, which is stored in the NameNode.

block file block

In an HDFS cluster, a large file will be divided into N multiple blocks for splitting, and stored on different machines. For example, if there is a 1G large file, HDFS will split it, for example, 128M, then it will be divided into 8 file blocks, and then put on a different machine, that is, the DataNode, the DataNode is responsible Store these file blocks.

The NameNode knows how many file blocks a file is divided into and which DataNode each file block is stored on, so the NameNode is the master steward of an HDFS cluster.

When you want to read the large file above, you will first ask the manager of the NameNode, which file blocks the file corresponds to and where each file block is located, and then the NameNode tells you which DataNodes are on, Then you can go to the corresponding DataNode.

Metadata Persistence Management Mechanism of File System

Editlog and FsImage

The HDFS is stored on the NameNode. For any modification operations on the file system metadata, the NameNode will record the operations one by one in an operation log called EditsLog.

For example, if a file is created in HDFS, Namenode will insert a record in Editlog to indicate it; similarly, modifying the copy coefficient of the file will also insert a record in Editlog. Namenode stores this Editlog in the file system of the local operating system. Create a directory on hdfs, hadoop fs -mkdir -p /usr/dir1, create a directory hierarchy

Then the entire file system organization structure, including data block to file mapping, file attributes, etc., are stored in a file called FsImage, which is also placed on the local file system where Namenode is located.

Before the NameNode stores the metadata in the FsImage file, it will now be stored in the memory for a period of time. If the file is written every time the metadata changes, the performance is very poor. Therefore, the NameNode will read the Editlog file on the disk every once in a while, apply all of it to the fsimage cache in the memory, and then dump the fsimage to the disk again, and then save the Editlog file previously applied Empty out.

checkpoint operation

In the above, the editlog is read from the disk and applied to the FsImage, and then the FsImage is saved to the disk, and finally the Editslog is deleted. This operation is called checkpoint. We can configure the checkpoint interval by ourselves and start it on the NameNode At the time, it will also first read the edits log and Fsimage from the disk to construct a cached data in the memory.

dfs.namenode.checkpoint.period:这个参数配置几秒钟执行一次checkpoint

dfs.namenode.checkpoint.txns:这个参数配置当edits log里有多少条数据的时候,就执行一次checkpoint

复制代码

SecondaryNameNode and BackupNode in the hadoop 1.x era

SecondaryNameNode

In the era of Hadoop 1.x, when NameNode was started, FsImage was read into memory, and then the log content of EditsLog was restored to fsimgae, keeping the metadata in the memory up to date, and then writing the latest fsimage Go to the disk, and finally clear the Editlog file.

Next, the client is continuously operating the HDFS cluster, and the metadata of the NameNode will be continuously modified. These modifications are to directly modify the FsImage cache in the memory, and the log file for modifying the metadata will be appended to the edits log file. .

There is a problem. The modification log of metadata will always be appended to the edits log file. If it is appended all the time, it will become larger and larger. It will not be available until the next NameNode restart and the edits log merges with fsimage. Empty. If the edits log is large, it will take a lot of time to merge, which may cause NameNode startup to become slower and slower.

So one way of thinking is to merge the edits log and fsimage regularly. After the merge, write the new fsimage to the disk and clear the edits log at the same time, so that the edits log will not become too big.

So hdfs has another role as Secondary NameNode, it will be deployed on another machine, it will be dedicated to merge edits log and fsimage, execute it every time, when it is executed, tell NameNode not to write the current edits log Now, open a new edits log to write. Then pull the fsimage file and edits log of the NameNode to the local, merge them in the memory, and push them to the NameNode after the merge is completed.

This operation is the so-called checkpoint operation, so that the edits log will not be too big.

BackupNode

BackupNode is also a mechanism provided in Hadoop 1.x. Its idea is to optimize and replace checkpoint's merged idea of ​​downloading edits log and fsimage.

The idea of ​​the backup node is to store a copy of fsimage data the same as the namenode in the memory, and at the same time receive the edits log data stream sent by the NameNode, and write one to its own local edits log after obtaining the edits log data stream.

At the same time, the edits log will be played back to the fsimage in the memory of the user, so that the fsimage in the memory is the same as the fsimage in the namenode memory.

The backup node will also periodically perform a checkpoint operation, write the fsimage in its memory this time to its own disk, replace the old fsimage file, and then clear the edits log.

A NameNode can only be linked to one backup node. The advantage of using a backup node is that you don’t have to let the namenode maintain such an edits log file, and you don’t need to write to the disk by yourself. Just maintain the fsimage in its own memory, and then receive it every time. When you need an operation to modify metadata, apply it to the fsimage in your memory, and at the same time push the edits log data stream to the backup node.

Every time the namenode restarts, you need to use -importCheckpoint to load fsimage from other places into your own this time, and this -importCheckpoint data is taken from the backup node.

The above is the management of metadata in Hadoop 1.x. Hadoop 2.x is not so fun anymore.

Dual-instance HA high availability mechanism in hadoop 2.x era

Defects of Hadoop 1.x metadata management

In the era of Hadoop 1.x, NameNode linked to a Secondary NameNode to manage metadata will cause metadata loss and high availability problems.

High availability issues. Once the NameNode fails or goes down, the entire cluster will be unavailable. Data loss problem. If the disk of NamaNode is damaged, it will cause a certain amount of data loss. The edits log after the last checkpoint will be lost.

In Hadoop 2.x, the problem of metadata loss and high availability is solved.

So NameNode

The Hadoop 2.x cluster will start two NameNodes, one in the active state and the other in the standby state. We can understand that the active state is the master and the standby state is the standby. All client operations are sent to the NameNode in the active state, and the NameNode in the standby state will be used as a hot standby for the active NameNode to continuously synchronize metadata.

Journal Node

The NameNode in the standby state does not directly request metadata from the NameNode. There is also a middleman, the Journal Node, which stores the edits log. Generally, the Journal Node is also a cluster, starting with three.

When the NameNode has metadata changes, it will write the edits log to most nodes in the Journal node cluster. For example, now there are 3 Journal Node clusters, and most of the nodes are 3/2 + 1 = 2. That is, if the NameNode writes the edits log to 2 clusters, the write is considered successful.

Then StandbyNameNode will always monitor the edits log changes in the JournalNode. As long as there are changes, it will pull the new edits log to the local, and then apply it to its own memory, keeping it consistent with the active NameNode.

When the active NameNode hangs up, the standby NameNode will immediately perceive it, and then make sure that it reads all the edits logs from the journal node, and the fsimage in the memory is also the latest data, and then switches itself to the active NameNode.

In this way, the problem of data loss is solved. One machine is down, and the other machine is immediately switched to active namenode to continue to provide services to the outside world, and the high availability problem is also solved.

ZKFC (ZKFailoverController)

So how do two NameNodes switch over automatically after one fails.

It relies on the ZKFailoverController process, ZKFC for short. Each NameNode will have a ZKFC process. They will constantly monitor the status of the two NameNodes and maintain the status of the two NameNodes on Zookeeper.

If the active NameNode is down, the HealthMonitor in ZKFC will monitor it, and then the FailoverController in ZKFC will notify the NameNode that it is down, and then the FailoverController will find an ActiveStandbyElector component to re-election.

ActiveStandbyElector will re-elect a standby NameNode as the main NameNode based on zk.

Then zk will notify the ActiveStandbyElector component in ZKFC on the standby NameNode, notify the FailoverController to switch the standby state to the active state, and then the FailoverController to notify the NameNode to switch to the active NameNode.

In this way, it is possible to switch between active and standby, and HA is realized.

How does Hadoop 2.x HA dual instance manage metadata

In Hadoop 1.x, the backup node will periodically perform a checkpoint operation, write the fsimage in its memory this time to its own disk, replace the old fsimage file, and then clear the edits log.

In Hadoop 2.x, the Standby NameNode is actually similar to the backup node in 1.x, and it will perform a checkpoint operation regularly.

The Standby NameNode will run a background thread called CheckpointerThread, which is once an hour by default, or if 1 million edits logs are not merged into fsimage, then a checkpoint operation will be performed.

Write the latest fsimage in the memory to the disk, clear the edits log, and finally push the latest fsimage file to the active NameNode, replacing the old fsimage of the active NameNode, and also clear the old edits log file on the active NameNode .

Distributed storage mechanism for large file data

If you need to upload a very large file, when the hdfs client submits the upload command, the NameNode will first split the file into multiple blocks according to the size of the file. 2.x defaults to 128 M, 1.x It is 64 M.

The NameNode will plan out which DataNode each block file is stored in. It will distribute the data as much as possible according to the data storage situation of each DataNode, and then the hdfs client will upload the file block to each DataNode. After the DataNode receives the data, it will store these file blocks in different file directories. It will establish a certain hierarchical structure to store these file blocks and will not put them under a folder.

When the DataNode starts, it also scans its local file data directory and reports to the NameNode the file block list of which files it keeps.

Copy-based fault tolerance mechanism

To store a large file based on block splitting, the storage problem is solved and there is nothing wrong with it. But there is another problem. If a machine goes down, such as excessive resource consumption, downtime, or restarting, then the block stored on this machine cannot be accessed. Originally, a file is divided into There are 10 blocks, and now one is faulty, and there are only 9 normal blocks. Then when the client reads, your data is incomplete.

Copy mechanism

So how to solve the various machine failures between the above clusters for fault tolerance, the answer is the copy mechanism.

There is a parameter in HDFS called replication factor, which means that each block will be copied for you and put on different machines. The downtime of one machine will not affect, and the other machine will have exactly the same backup.

When HDFS writes a file, each block has 3 copies by default. The NameNode will first select 3 DataNodes. Each DataNode will place a block and return it to the client. Then the client will transmit the block to the first DataNode. After a DataNode writes this block to the local, this DataNode will copy the block to the second DataNode. After the second DataNode writes to the local, it will copy the block to the third DataNode.

Rack Awareness

The communication between the machines in the HDFS cluster is to communicate. The machines and the machines may exist on different racks. The data transmission between the machines on the same rack is definitely faster than the machines on different racks. of.

A block usually has 3 copies by default, then the NameNode will place 2 copies on a machine on one rack, and one copy on a machine on another rack.

The data transmission on the same rack is very fast. Another point is that even if all the machines on your rack are down, the machines on the other rack will have a copy of this block, which can provide services normally.

Safe mode

When the NameNode is just started, it will enter a mode called safe mode, safe mode, in this mode, the hdfs cluster will not perform block replication

At this time, the NameNode will wait to obtain the heartbeat and block report from each DataNode, and then look at the overall block situation in the cluster, and how many copies of each block, the default is to have 3 copies, if a block has 3 Copy, then it’s ok, safe

If a certain percentage (80%) of the block has enough 3 copies, then the namenode will exit the safe mode.

NameNode has always been in safe mode because it has not reached a certain ratio. For example, only 50% of the blocks have 3 copies.

At this time, if it is found that the number of copies of a certain block is not enough (for example, there are only 2 copies), instruct the DataNode to replicate enough copies

Federal mechanism

All files, directories, and configuration information in HDFS are stored in the NameNode. When your HDFS cluster has tens of thousands and hundreds of thousands of units, the amount of data is extremely large, and the size of your metadata may be hundreds of gigabytes. It is obviously not feasible on a NameNode.

In order to solve this problem, HDFS has developed a federation mechanism, that is, there are multiple NameNodes in a cluster, and each NameNode stores a part of the metadata. No matter how the metadata rises, the NameNode machine can be scaled horizontally to solve the problem. The situation is that many companies simply don't use this federal mechanism and do not have this size.

Under the federation mechanism, each active NameNode stores some metadata, and then a standby NameNode is attached to each active NameNode. In this way, the problems of horizontal expansion and high availability are solved.

Although there are multiple NameNodes, there is still a set of DataNodes, and blocks are stored in a DataNode cluster. It just means that each DataNode will report its heartbeat and block report to all NameNodes.

hdfs-11.png

follow me

** May you taste the fireworks and still believe that the world is worth it! **

Original link: https://juejin.cn/post/6915947365648039950

If you think this article is helpful to you, you can follow my official account and reply to the keyword [Interview] to get a compilation of Java core knowledge points and an interview gift package! There are more technical dry goods articles and related materials to share, let everyone learn and progress together!

 

Guess you like

Origin blog.csdn.net/weixin_48182198/article/details/112469440
Recommended