How Hadoop Ozone uses Multi-Raft mechanism to optimize data node throughput

background

As a recent star project in the Hadoop community, Hadoop Ozone has attracted widespread attention from the community. Born out of HDFS, it not only supports file system and object semantics at the same time, but also natively connects HDFS and S3 access modes. It also regards the read and write performance and throughput of the cluster as the top priority. In mid-2019, the Tencent big data team began to launch the Ozone cluster to undertake the big data storage business, and the data lake team also fully invested in the open source project of Hadoop Ozone. After in-depth cooperation with the Hadoop Ozone community and Cloudera, the data lake team gradually found that Ozone's write performance showed certain fluctuations and bottlenecks based on years of deep cultivation in the open source industry and practical experience in business docking with data platforms. new features and optimizations to address these issues, while also giving these features back to the community. The Multi-Raft feature introduced this time is the highlight.

 

The Ozone structure retains the original DataNode data nodes in HDFS, but splits the NameNode that manages metadata into Ozone Manager (processing user requests and responses) and Storage Container Manager (managing data pipeline, container and other resources). On the data node, the original Block management data has been transformed into Container management, and the lighter Raft protocol is also used to manage metadata and data consistency. Ozone Manager and Storage Container Manager are managed in RocksDB with better performance, which greatly increases the ability of cluster Scale Out, breaks through the limitation of the original HDFS metadata bottleneck in architecture, and also increases the metadata management node standBy through Raft , which enhances the cluster availability.

The enhancement of metadata management also means that data nodes can be better utilized. Therefore, the throughput optimization problem of DataNode, which is talked about on HDFS, has also become a focus issue on Ozone cluster. The major Internet companies have eighteen skills in optimizing the data throughput of HDFS DataNodes: a large company rewrote a set of HDFS using C++, which not only expanded the throughput of DataNodes, but also allowed one NameNode to manage more than 70,000 DataNodes; Other manufacturers choose to optimize the cache queuing mechanism of Linux, optimize the efficiency of page allocation when Linux allocates memory at the bottom layer, and achieve the effect of increasing the bandwidth of writing disks, thereby increasing the throughput of DataNode.

However, these solutions are limited to highly customized solutions within the company, which have created new obstacles in terms of post-maintenance and new changes and upgrades integrated into the open source community. Tencent, as the first big factory to use Hadoop Ozone on a large scale, prioritized community-friendly strategies when choosing Ozone's optimization solution.

Ozone uses the Raft protocol as the cluster consistency protocol. When writing data, the DataNode Raft Leader synchronizes to the Follower to ensure that multiple copies are written. When reading data, it only needs to read any copy to return the result, so as to achieve strong consistent read and write semantics. When considering tuning the performance of DataNode, we focused our attention on Raft Pipeline. After a series of tests, we found that Ozone's use of data Pipeline and Raft did not make the disk run efficiently, so we designed Multi- The function of Raft enables a single data node to carry as many Raft Pipelines as possible to achieve the purpose of exchanging bandwidth for delay.

This article will focus on sharing how Ozone efficiently uses the Multi-Raft solution to tune the throughput of data nodes to achieve write performance optimization. Multi-Raft's solutions and internships are also important features of the Ozone community in the latest 0.5.0 release. The author also published an English-language technical sharing on the official Blog of Cloudera, a major open source community:

https://blog.cloudera.com/multi-raft-boost-up-write-performance-for-apache-hadoop-ozone/

Interested friends can also take a look.

 

Raft election mechanism

 

Let's first briefly introduce the Raft mechanism and how Ozone uses the Raft mechanism to achieve data consistency for multi-copy writes. The consistency of distributed systems is a topic that everyone is talking about. Due to the characteristics of distributed processing and computing in distributed systems, it is necessary to coordinate the service status of each server node to keep the same to ensure the fault tolerance of the system. When some nodes appear In the event of a failure, it will not drag down the entire cluster. Especially for a strongly consistent distributed storage system like Ozone, a widely recognized protocol between nodes is required to ensure state consistency. Ozone adopts the Raft protocol to maintain cluster state and share data.

There will be 3 role identities in the Raft protocol: Leader, Follower and Candidate. The Split Vote voting process will select the leader and follower in the candidate, and the follower can become the candidate to run for the leader.

Raft Consistency Process

The Raft protocol essentially ensures data consistency through the local RaftLog and synchronizing the state between the Leader and the Follower, and StateMachine records the state of the cluster. RaftLog uses the concept of write ahead log (WAL), the leader records the process in the log, and the updated content of the log will be synchronized to the follower, and then the log will apply the log to the StateMachine to record the status. When there are enough Quorums to declare success, The overall operation was successful. This ensures that the state machine changes of all nodes in the cluster are relatively synchronized. At the same time, RaftLog will perform a replay playback operation when the node restarts, re-establish the StateMachine state before the restart, and restore the node to the relatively synchronized state before the restart.

In the container process of the Ozone writing process, in order to ensure the redundancy mechanism of the current three copies, the container needs to allocate space on the three data nodes DataNode and ensure that most nodes write successfully. The log mechanism in the Raft protocol can ensure that after the leader writes, the number of replicates goes to the follower, and after one of the three replicas is written, the data is successfully persisted. Most of the successful data writing operations can be directly assisted by the Raft protocol, while ensuring strong read and write consistency.

 

Multi-Raft for write throughput optimization

 

Ozone uses the Raft mechanism to manage the writing process of data and metadata. Before data is written to the disk on the DataNode, the leader node needs to use the replication function of RaftLog to synchronize the data to the follower node. Since the Raft protocol relies on RaftLog as its WAL (log before writing), writing data on the data node is also affected by the RaftLog persistence delay. As the amount of data supported by the online environment increases, Ozone writes Performance fluctuations and bottlenecks are becoming more and more obvious.

After a series of benchmark tests, we turned our attention to how to use Raft Log to enhance the efficiency of writing disks on the DataNode. In the original design, each Raft Group constituted a Pipeline in the SCM to control a series of data Containers. After testing, it was found that the disk usage of each DataNode was not very sufficient. Then, combined with the current implementation of Ozone, Multi-Raft was designed and developed. The function allows SCM to allocate multiple Pipelines on each DataNode, and use the disks and partitions on the DataNode more efficiently. At the same time, it also uses multiple Raft Groups to naturally improve the write concurrency of a single DataNode, and the evolution optimizes the write throughput of Ozone. volume and performance.

 

Ozone's data write mode

 

Ozone's data management is mainly completed by Pipeline and Container. Ozone's Storage Container Manager (SCM) is responsible for creating data pipelines, and assigning Containers to different Pipelines, and each Container is responsible for allocating Blocks on different data nodes DataNodes to store data. When the user writes to Ozone through the client, the key is created through the Ozone Manager, and the space where the data can be written is obtained from the SCM. This space will be allocated by the Pipeline and the Container.

Pipeline and Container assigning and writing data is related to the number of copies of the data. Ozone now mainly supports the method of three copies. When the data Pipeline is created, three data nodes will be associated, and then the relevant Container will allocate Blocks on these three data nodes to complete the process. When writing space allocation, the three data nodes will remember the Pipeline and Container information respectively, and asynchronously send a Report to the SCM to report the status of the Pipeline and Container. The data pipeline of three copies ensures the consistency of multiple copies through the Raft protocol. It is also called Ratis Pipeline in Ozone. The three related data nodes will form a combination of a leader and two followers according to the Raft protocol. RaftLog will be used when data is written. Send data from the leader to the followers. Therefore, the performance of Ozone's data writing depends on the write throughput and transmission speed of RaftLog.

 

Optimization exploration on throughput

 

In Ozone-0.4.0 version, Pipeline control data writing is based on the implementation of Single-Raft, that is, each data node can only participate in one Pipeline. We observed and tested the write performance of this version of Ozone, and assembled a 10-node Ozone cluster, each Ozone data node has 12 HDD disks, and the network selects 10 Gps bandwidth as the external storage for big data. Cluster configuration. To access Ozone, we chose the S3 protocol to write to Ozone. The size of the objects is relatively discrete and consists of the results of internal sql queries. Most of them are small KB-level objects, which are mixed with large GB-level objects.

According to the delay distribution in the figure, we can see that 68% of the writes can be completed within 200 milliseconds, but still more than 27% of the files can be completed within 2-3 seconds. At the same time, we observe the delay and file size. There is no very direct connection. In addition, we also observed the disk usage of a data node.

In the process of continuous concurrent writing in the foreground, among all 12 disks, only 5 data disks are loaded, and they are concentrated on 3 of them. Through log mining and data statistics, it is also observed that there is IO blocking phenomenon, and RaftLog on the data node is likely to be queued when writing to disk. So we put the idea on improving the disk usage of data nodes and alleviating the queuing phenomenon of Ratis persistent RaftLog.

On the community version of HDFS, the IO throughput of DataNode is also not high. There are some solutions before that will change the way Linux operates the file system. Adding some Cache-like means can help increase the throughput of DataNode writing to disk through the operating system. . The idea of ​​Ozone's optimization is to increase the usage of Pipeline on each node through node multiplexing of data Pipeline, thereby increasing the disk usage in disguise by increasing the number of users.

 

Design Ideas of Multi-Raft

 

If each DataNode is restricted by Single-Raft and can only participate in the writing of one data pipeline, then the above disk usage observation shows that the RaftLog implemented by Ratis cannot evenly distribute the write IO to each disk. on the path.

After discussions with the Ratis community, we proposed the Multi-Raft solution, which allows data nodes to participate in the work of multiple data pipelines. At the same time, Multi-Raft also needs to ensure data isolation through a new pipeline allocation algorithm, considering data locality and Rack awareness of nodes.

After allowing the data node DataNode to participate in multiple data pipelines, the node allocation of the pipeline has changed a lot:

  • SCM can allocate more data pipelines without increasing the size of the cluster. Assuming that each node can be configured to undertake the writing of M data pipelines at most, in the case of N data nodes, SCM can allocate up to N/3 pipelines under Single-Raft, and M ( M*N)/3 Pipelines.
  • Due to the three-copy backup, the data pipeline must have exactly 3 data nodes to join in order to be successfully created. Multi-Raft can reasonably utilize the capabilities of each data node to participate in the pipeline. As shown in the figure, when there are 4 data nodes, the Single-Raft cluster can only create 1 data pipeline, and the remaining 1 data node (4-3*1) can only be used as StandBy backup. After the Multi-Raft function, all four nodes can participate in the distribution of the data pipeline, which greatly improves the flexibility of the cluster configuration.
  • From the disk usage of Single-Raft, it can be seen that the RaftLog used by a data pipeline cannot fully write data into a good load balance. After collaborating with the Ratis community, the Multi-Raft solution can effectively utilize the service capacity on the node. The disk of the data pipeline has changed from 12 disks used by one RaftLog to 12 disks used by multiple RaftLogs. For the Batch writing and queuing mechanism of Ratis, a write may hold a certain disk for a period of time, so the later queued writes will cause a certain degree of queuing due to the limitation of the number of RaftLogs. This is also what we do in Single- Raft's cluster sees that more than 27% of writes take 2-3 seconds, and the presence of multiple RaftLogs will help reduce queue congestion as there are more "queues".

 

Multi-Raft optimized write performance

 

First, let's take a look at the comparison of the delay distribution of the same big data dataset written to the Multi-Raft cluster and the Single-Raft cluster:

Compare Single-Raft clusters:

It can be seen that in the cluster optimized by Multi-Raft, nearly 98% of the writes are completed within 200ms for the same data set, and the latency fluctuation caused by IO queuing is almost invisible.

Then select a data node to observe the disk usage according to the number of Pipelines allocated on the node:

When a single data node participates in one data pipeline:

When a single data node participates in 12 data pipelines:

When a single data node participates in 15 data pipelines:

It can be clearly seen that when the data pipeline on the data node increases, more and more disks reach the busy level, and there is no longer uneven busyness and idleness.

 

Summarize

 

The part of Ozone's data nodes written to disk continues the HDFS single-node write method to a certain extent, and also retains some parts to be optimized. Compared with the optimization of the IO stack level on the data nodes, the Multi-Raft solution utilizes the node multiplexing of the data pipeline and the characteristics of RaftLog, which increases the participation of the data nodes in concurrently writing data. In this way, the modification of metadata and management cost to increase the write bandwidth is a better choice for the open source community, and more optimization points can be opened in the future. The current Ozone-0.5.0 version writes on the HDD cluster The performance is also significantly improved compared to the 0.4 version.

The functions of Multi-Raft have been fully returned to the Apache Ozone community, and it has also achieved good reviews in the community. The technical sharing of Multi-Raft has also been published in the cooperative Cloudera Technology Blog:

https://blog.cloudera.com/multi-raft-boost-up-write-performance-for-apache-hadoop-ozone/

Compared with other solutions that optimize the performance of a single node of DataNode, Multi-Raft's solution is more friendly to community integration, and it also leaves room for further in-depth tuning. It is more suitable for Apache Hadoop Ozone, which is still growing rapidly. item .

Follow the official account "Tencent Big Data" for more content.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132117&siteId=291194637