Firestorm 0.2.0 Released: The First Open Source Remote Shuffle Service Supporting Hybrid Storage

1. Background

Since Firestorm launched the open source version 0.1.0 in November 2021, the project has received extensive attention from the industry. Firestorm is an important component to accelerate the cloud migration of distributed computing engines, and it can also solve the task failure of computing tasks caused by abnormal shuffle processes in large shuffle scenarios. (For more detailed background, please refer to this article: [ Firestorm - Practice of Tencent Self-developed Remote Shuffle Service in Spark Cloud Native Scenario] )

At present, Firestorm has ushered in the official release of version 0.2.0, and Firestorm has also become the first open source Remote Shuffle Service solution that supports hybrid storage. This article will focus on the latest features and performance analysis of Firestorm version 0.2.0.

2. New features of the version - support for hybrid storage

What is Hybrid Storage

In the initial version of Firestorm, Shuffle data can only be stored on the local disk of the Shuffle Server or in a distributed storage system. Hybrid storage makes full use of the memory resources of Shuffle Server, combined with local files and distributed storage systems, so that Shuffle data can be stored in multiple media.

Why you need hybrid storage

In the actual production process, due to the inconsistent block size of Shuffle data, the smallest size is only a few KB, or even dozens of Bytes, while the largest size can reach more than 256MB. In such a scenario, it is very unfriendly to distributed storage such as HDFS. The writing of a large number of small data blocks will cause the cluster to respond too slowly, seriously affecting the efficiency of computing tasks. **While using Shuffle Server

Disks can alleviate this problem very well, but the subsequent problem is that the Shuffle Server must have a large amount of disk space to carry PB-level Shuffle data. Such binding is not conducive to the current cloud-native environment. At the same time, in the process of shuffle data writing, you must wait for the data to be written to the storage before proceeding to the next step. When the storage is busy, the performance of the computing task will be greatly affected. In order to solve the above-mentioned problems, a hybrid storage solution based on the combination of memory, local files and distributed storage emerges spontaneously.

Hybrid storage implementation principle

Taking Spark as an example, let's first see how the single-storage-based solution reads and writes Shuffle data:

In the process of writing the above picture, the Shuffle data is sent to the Shuffle Server side in Step 4 after the calculation and cache operations in steps 1, 2, and 3, and then through the operations of caching and data aggregation in steps 5 and 6, and finally passed Step 7 Write to the storage medium. After all tasks are completed, the Commit command will be sent to the Shuffle Server. If it is the last task, it must wait for the relevant data to be written to the storage before it can be completed. The process of waiting for writing to the storage after the Commit operation has a greater impact on the overall performance of the task. Big.

After the writing is completed, the reading process is relatively simple. Based on the storage medium, you can choose to read from the Shuffle Server or directly from the distributed storage. After understanding the previous solutions, let's take a look at how hybrid storage is implemented:

Compared to before, there are 3 major changes:

1. First of all, the Flush scheme in step 5 is optimized: the previous Flush scheme is that when each Shuffle Partition data reaches the threshold or the entire cache space reaches the threshold, this part of the data is written to the storage medium, and now the cache space is set. When the upper and lower water levels are reached, the Flush operation is performed until the buffer space reaches the low water level. At the same time, in the selection of Flush data, the Partition with a large amount of data is preferentially selected. Through the control of the upper water level, it is ensured that the cache still has enough space to accept new data during the Flush process, and through the selection of the lower water level and the Flush data, it is guaranteed that the Shuffle data with a small amount of data can reside in the memory, reducing the Stores the probability of writing small files.

2. Second, step 7 is refactored to support the selection of storage media based on the size of the written data blocks, for example, data blocks larger than 32MB are written to distributed storage, while others are written to local storage. This strategy is to better match the write mode of distributed storage and achieve better write performance. At the same time, it is also observed that in the actual task running process, although the number of large data blocks is not high, such as 30%, the total data of large data blocks accounts for a higher proportion, such as 70%. Based on such a storage solution, the reliance on local disk capacity can be reduced, making it easier for Firestorm to be deployed in various environments, even on the cloud.

3. Finally, the Commit operation in step 8 is removed: The significance of the Commit operation is to ensure that the data can be read when reading data. Since the memory is also part of the hybrid storage, and the Shuffle Server side can ensure that the shuffle data is either in the memory or in the storage medium when the storage medium is normal, the Commit operation also loses its meaning. As you can see from the figure below, the BufferManager contains multiple Buffers, each of which stores the Shuffle data of a single Partition and stores it in CachedData. When the BufferManager reaches the high water level, the data of CachedData will be transferred to InFlushData until the storage write is completed. At the same time, CachedData can also receive new Shuffle data. Such a strategy ensures that Shuffle data will not be lost before being written to storage.

The following figure shows how data flows in the memory area of ​​Shuffle Server. Before writing, apply for memory space and occupy the PreAllocation area. After receiving the data, the memory usage is transferred to the CachedData area, and further moved to the InFlushData area after Flush. Finally write to storage, and clear the memory space.

It is easier to understand the writing process and look at the changes of the reading process. Compared with the previous single storage reading scheme, when reading based on the hybrid storage scheme, the Shuffle Server Memory, Shuffle Server local storage and Distributed storage reads Shuffle data.

Advantages of Hybrid Storage

The problems and related implementations solved by hybrid storage have been introduced above. Here is a summary. The introduction of hybrid storage can bring the following benefits: 1. Select the storage medium based on the size of the written data block to improve the writing performance of DFS 2. Reduce the The dependence on the local disk capacity of Shuffle Server is easier to deploy in a cloud-native environment. 3. Reduce the amount of data written to the local disk of Shuffle Server. When SSD is used as local storage, the service life of SSD is increased and storage costs are reduced. 4. Introduction Memory is used as storage to improve computing task performance

How Hybrid Storage is Used

Since the Flush policy of memory has been changed, the related configurations introduced by Shuffle Server are as follows:

#Low watermark percentage based on rss.server.buffer.capacity value

rss.server.memory.shuffle.lowWaterMark.percentage 25.0

# High watermark percentage based on rss.server.buffer.capacity value

rss.server.memory.shuffle.highWaterMark.percentage 75.0

Currently supported mixed storage types are:

Shuffle Server端:

Note: Due to the mixed storage of local files and HDFS, you need to increase the configuration of rss.server.flush.cold.storage.threshold.size to set the threshold for the amount of data written at one time. If the value is greater than this value, it will be written to HDFS, and the rest write to local file

rss.storage.type MEMORY_LOCALFILE_HDFS

rss.storage.basePath /path1,/path2

rss.server.hdfs.base.path hdfs://ip:port/path

rss.server.flush.cold.storage.threshold.size 32m

Spark Client:

spark.rss.storage.type MEMORY_LOCALFILE_HDFS

spark.rss.base.path hdfs://ip:port/path

Support data filtering

In the process of reading shuffle data, all metadata information, such as BlockId, TaskId, Length, etc., will be read first, and then the shuffle data will be read based on the metadata information. Due to the shuffle data of distributed computing tasks, there will be redundancy, such as Spark's speculative execution. In order to reduce invalid data reading and make more reasonable use of system resources, the filtering function when reading Shuffle data is added. The optimized scenarios are as follows: 1. Spark AQE needs to read the specified upstream data. 2. Redundant data generated by Spark speculative execution. data

Other features

In addition to the above main features, the version has the following changes:

1. Added support for Spark version, currently supported, Spark2.3, Spark2.4, Spark3.0, Spark3.1

2. Optimize the Shuffle data reading strategy, read the Index file first, and then read the Data file

3. Added GRPC related indicators

4. Fix known bugs

3. Version performance test

Due to the major changes in the storage architecture of the new version, the following is the relevant information of the performance test

test environment

hardware environment

1. Each server is 176 cores, 256G memory, 4T * 12 HDD, network bandwidth 10GB/s

2. Hadoop Yarn cluster: 1 * ResourceManager + 6 * NodeManager, 4T * 10 HDD write temporary data

3. Firestorm cluster: 1 * Coordinator + 6 * Shuffle Server, 4T * 10 HDD to write Shuffle data

Software Environment

1. Hadoop version 2.8.5

2. Spark version

3. 2.4.Spark related configuration:  

spark.executor.instances 100

spark.executor.cores 4

spark.executor.memory 9g

spark.executor.memoryOverhead 1024

spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager

spark.rss.storage.type MEMORY_LOCALFILE

4. Firestorm Shuffle Server related configuration:

rss.storage.type MEMORY_LOCALFILE rss.server.buffer.capacity 50g

testing scenarios:

Based on TPC-DS with a data volume of 1TB, TPC-DS has performed comparative performance tests on Spark native Shuffle, Firestorm 0.1.0, and Firestorm 0.2.0. The following are the relevant test results:

It can be seen from the test results that Firestorm 0.2.0 version has an improvement of about 30% compared with the previous version , but it has no advantages over Spark's native Shuffle. This result is in line with expectations. The reasons are as follows:

1. Even in the 1TB TPC-DS test, the amount of shuffle data for the query is generally small, so that the disk can ignore the performance degradation caused by random reads and writes.

2. Considering the problem of too many network connections in high concurrency scenarios, there is only one RPC connection between each Executor and Shuffle Server, and the serial data sending mode reduces performance

3. After the client sends the data, it will check whether the sending is successful or not at regular intervals. This interval also increases the performance loss of task operation.

From a performance point of view, the advantage of Firestorm is mainly to reduce the performance loss caused by random reading and writing of storage. Since RPC is implemented with more consideration for stability and high concurrency scenarios, it has additional performance overhead compared to the native Shuffle solution. As a result, the performance of Firestorm is not as good as native Shuffle without random IO on disk. **However, in the scenario where the disk has random IO, Firestorm still has performance advantages. **In order to verify this conclusion, 10 HDDs are reduced to 2 HDDs, and query23a with a large amount of Shuffle data is selected for testing. The test results as follows:

It can be clearly seen that when the number of HDDs drops from 10 to 2, the Shuffle Read performance of native Spark is seriously affected, and the read time increases by 5 times. For Firestorm, because the random read and write problem is not prominent, Shuffle Read Read performance is basically not lost. Test scenario: TeraSort performs a performance comparison test on native Spark Shuffle and Firestorm based on a 1TB dataset. The results are as follows:

Since the amount of shuffle data is 500GB, it can be seen from the test results that even if there are 10 HDDs, the performance of Shuffle Read caused by random reading of native Spark disks is still very obvious. No matter which version Firestorm is, the Shuffle Read performance is far superior to native Spark. For the Firestorm-0.2.0 version, due to the existence of hybrid storage, the Commit operation is no longer required. It can be seen that there is no need to wait for the Shuffle data to be written to the storage after the last task is completed.

4. Summary

This article introduces a series of improvements on the storage side of Firestorm 0.2.0, the most important of which is the introduction of the hybrid storage function, which utilizes memory, local disk, remote storage and other resources to allocate storage strategies more reasonably. In addition to improving performance, it also reduces the dependence on local disks, enabling better deployment and use in cloud-native environments.

Attach the open source address, welcome to build: https://github.com/Tencent/Firestorm

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324085631&siteId=291194637