Firestorm - Practice of Tencent's self-developed Remote Shuffle Service in Spark cloud-native scenarios

Image credit: pexels

background

Shuffle is a data redistribution process used by distributed computing frameworks to connect upstream and downstream tasks. In distributed computing, all processes involving upstream and downstream connection of data can be understood as shuffle. For different distributed frameworks, shuffle has several implementation forms:

  1. File-based pull based shuffle, such as MapReduce, Spark. This shuffle method is mostly used in MR-like frameworks, such as MapReduce and Spark. It is characterized by high fault tolerance and is suitable for large-scale batch jobs. Since the file-based shuffle scheme is implemented, only the failed task and stage need to be rerun when rerunning fails, instead of rerunning the entire job.
  2. Pipeline-based push based shuffle, such as Flink, Storm, etc. The implementation of pipeline-based push based shuffle is mostly used in streaming frameworks such as Flink and Storm, or some MPP frameworks such as Presto, Greenplum, etc. It is characterized by low latency and high performance, but relatively large The problem is that because the shuffle data is not persisted, the failure of the task will cause the entire job to be rerun.

Shuffle is the most important part of the distributed framework. The performance and stability of shuffle directly affect the performance and stability of the entire framework. Therefore, it is very necessary to improve the shuffle framework.

Business pain points

Challenges of Spark in cloud-native scenarios

The shuffle method based on local disk makes Spark have great usage limitations in cloud native, storage and computing separation, and offline environment:

  1. In a cloud-native environment, serverless is a goal of service deployment. However, due to elasticity or preemption, it is normal for nodes or containers to be preempted and executors to be killed. The existing shuffle cannot make computing serverless. , it is often necessary to recalculate the shuffle data when the node/container is preempted, which has a high cost.
  2. Online clusters usually have only a small number of local disks and a large number of CPU cores, so their computing and IO are unbalanced. In such a cluster, it is very easy to fill up the disk when scheduling jobs according to computing power.
  3. Nowadays, more and more data center architectures adopt the deployment method of separation of storage and computing. In this deployment method, the first problem encountered in the shuffle method based on local disks is that the shuffle data cannot be stored due to the lack of local disks; secondly, Although local storage can be solved by means of block storage (RBD), for IO usage patterns such as shuffle, using block storage will bring great network overhead and performance problems.

Challenges of Spark in Production

Most batch jobs on the current distributed computing platform are Spark jobs, and a few are MR jobs. Compared with MR jobs, Spark jobs are less stable, and at least half of the stability problems are caused by the failure of shuffle. of.

Tasks caused by shuffle failures are caught in retries, which seriously slows down jobs. If the shuffle fetch fails, the map task will be rerun to regenerate the shuffle data, and then the reduce task will be rerun. If the reduce task fails repeatedly, the map task will need to be rerun repeatedly. The cost of rerunning is high when the cluster pressure is high. will seriously affect the work.

Shao Zheng has a corresponding comment in SPARK-1529 , the address is as follows:

https://issues.apache.org/jira/browse/SPARK-1529

It is very difficult to run the job of super-large shuffle data (shuffle volume above T level) smoothly. The problems here are:

  1. Shuffle data can easily fill up the disk. This problem can only be avoided by repeatedly adjusting and retrying to distribute the executors to as many nodes as possible (anti-affinity).
  2. A large number of shuffle partitions lead to a lot of shuffle connections, which makes the shuffle framework extremely prone to timeout problems and problems caused by very high random access IO.

The local disk-based shuffle method has serious write amplification problems and random IO problems. When the number of tasks reaches 10K or even more than 100K, the random IO problem is very serious, which seriously affects the performance and stability of the cluster.

Therefore, it is particularly important to implement a better shuffle framework that can solve the above-mentioned business pain points.

Industry Trends

The industry has also explored shuffle[1] for many years, and built corresponding capabilities around their respective business scenarios. Here is a list of the work done by mainstream companies on shuffle.

Baidu DCE shuffle

Baidu DCE shuffle is a remote shuffle service solution that has been practiced and used on a large scale earlier in the industry. Its original intention is to solve several problems, one is to mix off-line offices, and the other is to improve the stability and processing scale of MR jobs. Baidu's internal MR jobs have been transformed into DCE shuffle and used for many years, and now Spark batch jobs have been transformed to use DCE shuffle as its shuffle engine.

Facebook Cosco Shuffle[2]

The original design of Facebook Cosco Shuffle is very close to that of Baidu. The construction of Facebook data center is to separate storage and computing. Therefore, the traditional shuffle method based on local files has a large overhead. At the same time, the largest job size in Facebook is 100T. There is a great challenge to shuffle, so Facebook has implemented a remote shuffle service based on HDFS - Cosco Shuffle.

Google Dataflow Shuffle[3]

Google Dataflow Shuffle is Google's Shuffle service on Google Cloud. For the elastic and volatile environment on the cloud, Google has developed a set of Dataflow Shuffle services for Google Cloud's big data services. Dataflow Shuffle is also a remote shuffle service, which moves the shuffle storage outside the VM, providing greater flexibility for computing jobs.

About Zeus [4]

In order to solve the above-mentioned Shuffle pain points, Uber also implemented Zeus, the Remote Shuffle Service, which has been open sourced. From the design documents and implementation, they have deployed multiple Shuffle Servers to receive and aggregate Shuffle data, and use SSD as the storage medium to improve Shuffle performance.

Ali ESS [5]

Alibaba's ESS (EMR Remote Shuffle Service) is mainly to solve the computing and storage separation problem faced by Spark on Kubernetes, so that Spark can adapt to the cloud-native environment.

business value

Implementing the Remote Shuffle Service can bring several business values:

  • Support for cloud-native architecture: Existing distributed computing frameworks (such as Spark which relies on local disks to store Shuffle data) greatly limit cloud-native deployment models. Using Remote Shuffle Service can effectively reduce partial dependence on local disks, support multiple deployment modes of clusters, improve resource utilization, and facilitate cloud-native architecture.
  • Improve the shuffle stability of Spark jobs: For shuffle data volumes up to TB, or even 10TB, such tasks will put a lot of pressure on disk space. At the same time, the large amount of task data also causes network access pressure, which eventually leads to a high failure rate. , and the Remote Shuffle Service can better solve these problems, so that the business side can run such tasks smoothly.

Description of Firestorm

Target

There are millions of Spark tasks running in Tencent every day, and the above-mentioned shuffle problems are often encountered. At the same time, in order to make better use of hardware resources, the deployment mode of separation of computing and storage is gradually advancing. Therefore, we carried out the development of Firestorm, the goals of the project are as follows:

  • Supports tasks with large shuffle volume (eg, TeraSort 40T+)
  • Support cloud-native deployment mode (eg, separate deployment mode of computing and storage)
  • Supports multiple storage systems (LocalFile, HDFS, COS, etc.)
  • Support data integrity check
  • The performance is close to the native solution of the computing engine

Architecture Design

The architecture of the Remote Shuffle Service is as follows:

Among them, the functions of each component are as follows:

  • The Coordinator manages the Shuffle Server based on the heartbeat mechanism, stores metadata information such as the resource usage of the Shuffle Server, and also undertakes task allocation responsibilities. According to the load of the Shuffle Server, the appropriate Shuffle Server is allocated to the Spark application to process different Partition data.
  • Shuffle Server is mainly responsible for receiving Shuffle data, and then writing it into storage after aggregation. Based on different storage methods, it can also be used to read Shuffle data (such as LocalFile storage mode).
  • Shuffle Client is mainly responsible for communicating with Coordinator and Shuffle Server, sending read and write requests for Shuffle data, and maintaining the heartbeat of applications and Coordinators.
  • During the interaction between Shuffle Server and Storage, the Storage Handler component is decoupled. Based on this component, different storages can be flexibly connected to meet various storage requirements.

Architecture design differences

Compared with other solutions in the industry, Firestorm has its unique features:

  • In terms of architecture , the Coordinator component is introduced to better manage Shuffle Server and reasonably allocate Shuffle tasks based on the node status of Shuffle Server. The cluster itself also supports flexible horizontal expansion to meet production needs
  • In terms of technology , the storage module is decoupled, and supporting the new Shuffle data storage method only needs to implement related interfaces. As the most important data verification part of the entire system, in addition to CRC, data deduplication and other implementations, it also adds read and write data consistency verification, making the data more secure and reliable during transmission.
  • In terms of operation , Firestorm provides various operating statistics and accesses the internal monitoring platform, which is convenient for observing the overall status of the cluster, understanding performance bottlenecks, and receiving alarm information in a timely manner under abnormal conditions.

Overall process

The overall Shuffle process based on Firestorm is as follows:

  1. Driver gets allocation information from Coordinator
  2. Driver registers Shuffle information with Shuffle Server
  3. Based on the allocation information, the Executor sends the Shuffle data to the Shuffle Server in the form of Blocks
  4. Shuffle Server writes data to storage
  5. After the write task ends, the Executor updates the result to the Drive
  6. The read task obtains the successful write task information from the driver side
  7. Read tasks obtain Shuffle metadata (eg, all blockIds) from Shuffle Server
  8. Based on the storage mode, the read task reads Shuffle data from the storage side

write process

When writing Shuffle data, you need to consider the reasonable use of memory, asynchronous writing of files, and merging of Shuffle data. The specific process is as follows:

  1. Task sends data to the corresponding Buffer based on PartitionId
  2. When the Buffer reaches the threshold, send the data of the Buffer to the data queue
  3. Continuously obtain data from the data queue and submit it to the sending thread
  4. The sending thread first requests memory space from the Shuffle Server, and then sends the data to the buffer of the Shuffle Server.
  5. After the Shuffle Server buffer reaches the threshold, it sends the Shuffle data to the write queue
  6. Continuously fetch data from the write queue and submit it to the write thread
  7. Obtain the storage path based on the Shuffle data information (ApplicationId, ShuffleId, PartitionId), and write the Shuffle data into the Index file and Data file
  8. After the task writing is completed, inform the Shuffle Server that the task has been completed and obtain the current completion number of all tasks. If the task completion number is less than the expected value, go to the next step. If the task completion number is greater than the expected value, send a message to the Shuffle Server to buffer the Write the relevant information to the storage, and wait for the write result, and go to the next step after success
  9. After the task is completed, the TaskId is recorded in MapStatus and sent to the Driver. This step is used to support the Spark speculative execution function.

read process

When reading Shuffle data, the main consideration is the integrity of the data. The specific process is as follows:

  1. Get all successful TaskIds in the Write stage from the Driver side
  2. To read the shuffle data, first read the Index file, check whether the BlockId exists, and then read the Data file based on the Offset information of the Index file to obtain the shuffle data
    If the Storage is HDFS, read directly from HDFS
    .  If the Storage is Local File, you need to read the file through Shuffle Server

Shuffle file

For Shuffle data, it is stored as Index file and Data file. The actual Shuffle data is stored in the Data file in the form of Block, and the Index file stores the metadata of each Block. The specific storage information is as follows:

  • BlockId: The unique identifier of each block, long type, the first 19 bits are auto-incrementing Int, the middle 20 bits are PartitionId, and the last 24 bits are TaskId
  • Offset: The offset of the Block in the Data file
  • Crc: The Crc check value of the block. This value is calculated when the block is generated and finally stored in the index file. It is used to verify the data integrity when reading the block.
  • CompressLength: Block compressed data length
  • UnCompressLength: Block uncompressed data length, used to improve the decompression efficiency when reading
  • TaskId: used to filter invalid block data

Data validation

The correctness of the data is the most critical to the Shuffle process. The following describes how Firestorm ensures the correctness of the data:

  1. The write task calculates the CRC check value for each block data, and the read task will check each block based on the CRC to avoid data inconsistency
  2. Each BlockId is stored on the Shuffle Server side. When reading data, it will verify that all BlockIds have been processed to avoid data loss.
  3. Successful task information will be recorded on the driver side, and redundant blocks will be filtered during reading to avoid data inconsistency caused by speculative execution.

Support multiple storage

Since there are many storage options, such as LocalFile, HDFS, OZONE, COS, etc., in order to facilitate access to various types of storage, the storage is decoupled in design, and the read and write interfaces are abstracted. For different storages, you only need to implement the relevant interfaces, which can be used as the back-end storage of Shuffle data.

Firestorm Earnings

Support cloud-native deployment models

Currently, Firestorm has been deployed in an offline mixed cluster of nearly 10,000 nodes within Tencent, supporting nearly 5W of distributed computing jobs every day. The daily shuffle data volume is close to 2PB, and the task failure rate has been reduced from the original 14% to 9%. It has reached the first-stage goal set in the initial stage, and helped distributed computing go to the cloud.

Improve the stability and performance of the Shuffle phase

Based on the TPC-DS 1TB data volume, we conducted a performance comparison test using native Spark Shuffle and using Firestorm. The test environment is as follows:

  • 3 servers as compute nodes, 80 core + 256G + HDD
  • 3 servers as ShuffleServer, 112core + 128G + HDD to store Shuffle data

The SQL complexity of TPC-DS varies. For simple SQL, due to the small amount of Shuffle data, native Spark Shuffle performs better, but the performance advantage is not obvious. For complex SQL, it involves a large number of partition Shuffle processes. , the Firestorm performance is more stable and the performance is greatly improved. The following two scenarios will be described separately:

Scenario 1, simple SQL, take query43 as an example, the following figure is the Stage diagram of query43, which consists of 2 stages, and the amount of shuffle data is very small. Using native Spark Shuffle to run the entire query takes about 12 seconds, while using Remote Shuffle Service It takes about 15 seconds.

Where is this time wasted? The figure below shows the relevant time consumption of the first stage. It can be seen that in the statistics of the write time column, the native Spark Shuffle has performance advantages, and the time consumption is at the millisecond level. With Firestorm, due to the shuffle write stage The increase of RPC communication leads to an increase in time consumption. In addition, the number of tasks needs to be run in multiple batches. Each batch will generate a difference of several hundred milliseconds, which eventually causes the native Spark Shuffle to have about 3 seconds on this query. performance advantage.

As the execution time of SQL increases, this kind of performance advantage will gradually decrease, and it is almost negligible. This type of SQL includes query1, query3, etc., which will not be listed here.

Scenario 2, complex SQL, take query17 as an example. The following figure shows the Stage diagrams using different shuffle modes. From the figure, we can see that the number of stages in this SQL is large, and the amount of shuffle data is large. The execution time uses native Spark Shuffle takes about 8 minutes, while using Remote Shuffle Service only takes about 3 minutes.

Expand the stage that takes the longest, and then look at the specific time-consuming comparison. First, let's look at the time-consuming of Shuffle Read. Since the native Spark Shuffle needs to pull data from each Executor, it involves a lot of network overhead and disk space. Random IO takes a very long time, even reaching 2 minutes, while the Remote Shuffle Service reduces the network overhead when reading, and reads the entire shuffle data, so it takes a short time and is relatively stable.

Let's look at the time-consuming of Shuffle Write. The native Spark Shuffle still takes a long time and is unstable. This is mainly because at this point in time, the computing nodes process both Shuffle Read and Shuffle Write at the same time, requiring frequent access to the local disk, and the amount of data is relatively large. Large, which eventually led to a significant increase in time consumption, and the Remote Shuffle Service avoids such problems in the read and write mechanism, so the overall performance has been greatly improved and more stable.

This kind of SQL also has query25 and query29 waiting, and I will not give examples here.

In addition to the above two scenarios, there are also some queries that use native Spark Shuffle cannot run normally due to the larger amount of shuffle data, but can run smoothly using Remote Shuffle Service, such as query64 and query67.

In general, in a scenario with a small amount of shuffle data, the Remote Shuffle Service has no advantage over native Spark Shuffle, and the performance is slightly reduced by 5%-10% or basically the same, while in a scenario with a large amount of shuffle data The Remote Shuffle Service has obvious advantages. Some SQL test results based on TPC-DS show that the performance is improved by 50% - 100%.

Summarize

This article introduces various problems of the existing Spark Shuffle implementation and the industry's response methods. Combined with the actual running status of Spark tasks within Tencent, it introduces the architecture, design, performance, and application of our self-developed Firestorm. It is hoped that in cloud-native scenarios, Firestorm can better assist distributed computing engines to migrate to the cloud.

Open source version address:

https://github.com/Tencent/Firestorm

Welcome everyone to follow and star, and welcome all outstanding developers to join Tencent's big data R&D team.

appendix

[1]https://issues.apache.org/jira/browse/SPARK-25299

[2]https://www.slideshare.net/databricks/cosco-an-efficient-facebookscale-shuffle-service

[3]https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#cloud-dataflow-shuffle

[4]https://github.com/uber/RemoteShuffleService

[5]https://developer.aliyun.com/article/772328

[6]https://www.sohu.com/a/447193430_31

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324240713&siteId=291194637