ByteDance open source self-developed Shuffle framework - Cloud Shuffle Service

Today, ByteDance announced that it has officially open sourced the Cloud Shuffle Service.

Cloud Shuffle Service (hereinafter referred to as CSS) is a general-purpose Remote Shuffle Service framework developed by Bytes. It supports computing engines such as Spark/FlinkBatch/MapReduce, and provides data shuffle with better stability, higher performance, and more flexibility than native solutions. At the same time, it also provides a Remote Shuffle solution for scenarios such as separation of storage and computing/offline co-location.

At present, CSS has been open sourced on Github, and interested students are welcome to participate in the joint construction!

Project address: https://github.com/bytedance/CloudShuffleService

Open source background

In big data computing engines, Pull-Based Sort Shuffle is a common shuffle scheme. For example, Spark/MapReduce/FlinkBatch (higher than version 1.15) all use Sort Shuffle as the default engine scheme, but the implementation mechanism of Sort Shuffle has certain Defect, in the large-scale production environment, the stability of the job is often affected by the Shuffle problem.

Take Spark's Sort Shuffle as an example:

As shown in the link above, Sort Shuffle will have the following problems:

  • Combining multiple Spill files into one file will consume additional read and write IO;

  • Assuming that there are m MapTask & n ReduceTask, m*n network links will be generated, when the number is particularly large:

    • A large number of network requests will cause the Shuffle Service to easily form a backlog;

    • Shuffle Service will generate a large number of random reads, which can easily lead to IO bottlenecks, especially in HDD clusters;

  • Shuffle Service cannot isolate Application resources. When there is an abnormal job, it may affect all other jobs on the same Shuffle Service node, and the problem is easy to magnify;

  • Only one copy of the Shuffle Data File generated by MapTask is stored locally. When the disk is damaged, data will be lost, which will also cause the FetchFailed problem;

  • The way Shuffle Data File is written to the local disk depends on the disk on the computing node and cannot separate storage and computing

All of these can easily lead to slow ShuffleRead or timeout, cause FetchFailed related errors, and seriously affect the stability of online jobs. Slow ShuffleRead will also greatly reduce resource utilization (CPU & Memory), and FetchFailed will also lead to Recalculation of related tasks in Stage, which is wasteful A large number of resources slow down the operation of the entire cluster; the architecture that cannot separate storage and computing is difficult to meet the requirements in scenarios such as offline co-location (insufficient online resources and disks)/Serverless cloud native.

ByteDance uses Spark as the main offline big data processing engine, running millions of jobs online every day, with an average daily shuffle volume of 300+PB. In scenarios such as HDFS co-location & offline co-location, the stability of Spark jobs is often not guaranteed, which affects business SLA:

  • Restricted HDD disk IO capability/disk failure, etc., lead to a large number of problems such as slow job/failure/Stage recalculation caused by Shuffle FetchFailed, affecting stability and resource utilization

  • External Shuffle Service (hereinafter referred to as ESS) cannot be separated from storage and calculation. When encountering a machine with low disk capacity, the disk is often full, which affects the operation of the job.

In this context, ByteDance developed CSS to solve the pain points of Spark's native ESS solution. Since CSS was launched internally for one and a half years, the current number of online nodes is 1500+, and the average daily shuffle volume is 20+PB, which greatly improves the shuffle stability of Spark jobs and guarantees business SLA.

Introduction to Cloud Shuffle Service

CSS is the Push-Based Shuffle Service developed by Byte. All MapTasks send the Shuffle data of the same Partition to the same CSS Worker node for storage through Push, and the ReduceTask directly reads the Partition's data from the node through CSS Worker in sequence. Compared with the random read of ESS, the IO efficiency of sequential read is greatly improved.

CSS Architecture

Cloud Shuffle Service (CSS) Architecture Diagram

CSS Cluster is an independently deployed Shuffle Service service. The main components involved are:

  • CSS Worker

After the CSS Worker starts, it will register node information with the ZooKeeper node. It provides Push/Fetch service requests. The Push service accepts the Push data request from MapTask and writes the data of the same Partition to the same file; the Fetch service accepts requests from ReduceTask Fetch data request, read the corresponding Partition data file and return it; CSS Worker is also responsible for Shuffle data cleaning. When the Driver performs the UnregisterShuffle request to delete the Znode corresponding to the ShuffleId in ZooKeeper, or when the Application ends to delete the Znode of the ApplicationId in ZooKeeper, CSS Workers will Watch related events clean up Shuffle data.

  • CSS Master

After the job is started, the CSS Master will be started in the Spark Driver. The CSS Master will obtain the node list of CSS Worker from ZooKeeper, and then allocate n copies (2 by default) of CSS Worker nodes for each Partition generated by the subsequent MapTask, and will These Meta information are managed for the ReduceTask to obtain the CSS Worker node where the PartitionId is located for pulling. At the same time, during the RegisterShuffle/UnregisterShuffle process, the corresponding ApplicationId/ShuffleId Znode will be created in ZooKeeper, and the CSS Worker will watch the Delete event to clean up the Shuffle data. .

  • ZooKeeper

As described above, it is used to store CSS Worker node information and ShuffleId and other information.

CSS features

  • Multi-engine support

In addition to supporting Spark (2.x&3.x), CSS can also be connected to other engines. Currently, within ByteDance, CSS is also connected to the MapReduce/FlinkBatch engine.

  • PartitionGroup support

In order to solve the problem that a single Partition is too small and the Push efficiency is relatively low, multiple consecutive Partitions are actually combined into a larger PartitionGroup for Push.

  • Efficient and unified memory management

Similar to ESS, CSS Buffer in MapTask stores all Partition data together, sorts the data according to PartitionId before Spill, and then pushes data according to PartitionGroup dimension; meanwhile, CSS Buffer is fully integrated into Spark's UnifiedMemoryManager memory management system. Memory-related parameters are managed by Spark uniformly.

  • fault tolerance

Push failure : When triggering Spill to push PartitionGroup data, the data size of each Push is 4MB (one Batch). When a Push batch fails, it does not affect the data that has been successfully Pushed before, and only needs to reallocate nodes (Reallocate ) Continue to push the current failed data and the subsequent data that has not yet been pushed, and the subsequent ReduceTask will read the complete Partition data from the new and old nodes;

Multi-copy storage : ReduceTask reads a Partition data from CSS Worker according to the batch granularity. When the CSS Worker is abnormal (such as network problem/disk failure, etc.), the batch data cannot be obtained, and you can continue to select another replica node Continue to read the data of this Batch and subsequent Batches;

Data deduplication : When the job starts Speculative speculative execution, there will be multiple AttempTasks running concurrently, which needs to be deduplicated when reading. When Push Batch, Header information will be added to the Batch data. The Header information includes MapId + AttempId + BatchId and other information. When the ReduceTask reads, it can deduplicate according to these ID information.

  • Adaptive Query Execution ( AQE ) adaptation

CSS fully supports AQE-related functions, including dynamically adjusting the number of Reduces/SkewJoin optimization/Join strategy optimization. For SkewJoin, CSS has done more adaptation and optimization work to solve the problem that Skew Partition data is repeatedly read by multiple ReduceTasks, which greatly improves performance.

CSS performance test

We compare CSS and open-source ESS using exclusive Label computing resources for 1TB TPC-DS Benchmark test, the overall end-to-end performance is improved by about 15% , and some queries have a performance improvement of more than 30%**.

At the same time, we also use the online co-location resource queue (ESS stability is poor) to conduct the 1TB TPC-DS Benchmark test comparison, and the overall end-to-end performance is improved by about 4 times .

CSS 1TB test improves Query by more than 30%

future plan

CSS currently open-sources some features, and some features & optimizations will be opened in succession:

  • Support MapReduce/FlinkBatch engine;

  • The CSS cluster adds the ClusterManager service role to manage the status & load information of CSS Workers, and at the same time, the function of assigning CSS Workers to the current CSS Master is mentioned to ClusterManager;

  • CSS Worker allocation strategy based on dimensions such as heterogeneous machines (such as different disk capacities)/load.

Guess you like

Origin www.oschina.net/news/207898