ByteDance Spark Shuffle large-scale cloud native evolution practice

Spark is a widely used computing engine within ByteDance and has been widely used in various large-scale data processing, machine learning and big data scenarios. At present, the number of daily tasks in China has exceeded 1.5 million, and the daily Shuffle read and write data volume exceeds 500 PB. At the same time, the Shuffle data of some single tasks can reach hundreds of TB levels.

At the same time, the job volume and Shuffle data volume are still growing. Compared with last year, the number of daily tasks this year has increased by 500,000, and the overall data volume has grown by more than 200 PB, reaching a 50% increase. Shuffle is a function that is often triggered in user jobs. Shuffle is used in various ReduceByKey, groupByKey, Join, sortByKey and Repartition operations. Therefore, in large-scale Spark clusters, Spark Shuffle often becomes a bottleneck for performance and stability; Shuffle calculations also involve frequent disk and network IO operations. The solution is to repartition and combine the data of all nodes. . The following will introduce in detail ByteDance’s large-scale evolution practice in the direction of Spark Shuffle cloud nativeization.

Introduction to Spark Shuffle Principle

In the basic principle of the Shuffle mode used by default in the Community Edition ESS mode, it was just mentioned that the calculation of Shuffle will repartition the data. Here, the Map data is reorganized into all the Reducers. If there are M Mappers and R Reducers, the Partition data of M Mappers will be partitioned into the Partition of subsequent R Reducers. The Shuffle process can be divided into two stages - Shuffle Write and Shuffle Read. During Shuffle Write, Mapper will divide the current Partition into R new Partitions according to the Reduce Partition, sort them and write them to the local disk. The generated Map Output contains two files: the index file and the data file sorted by Partition. When all Mappers finish writing Map Output, the second phase—Shuffle Read phase will begin. At this time, each Reducer will access all ESSs containing its Reducer Partition and read the data of the corresponding Reduce Partition. Here, the ESS where all Partitions are located may be requested until the Reducer obtains the data of all corresponding Reduce Partitions.
In the Shuffle Fetch phase, each ESS will receive all Reducer requests and return corresponding data. This will generate M times R levels of network connections and random disk read and write IO , involving a large number of disk reads and writes and network transmissions. This is why Shuffle has very frequent requests for disk and network IO.
Since Shuffle has very high resource requirements and consumption, CPU, disk and network overhead are likely to be the cause of Fetch Failure or the bottleneck of Shuffle's slow speed. In ByteDance's large-scale Shuffle scenario, the same ESS node may need to serve multiple merchants at the same time, and these clusters do not perform IO isolation, which may cause Shuffle to become the main reason for user job failure and a pain point.
Therefore, ByteDance started work on the cloud-nativeization of Spark Shuffle in early 2021, and Spark jobs and other big data ecosystems began migrating from Yarn Gödel. Gödel is ByteDance's self-developed scheduler based on Kubernetes . During migration, it also provides a migration solution for Hadoop to the cloud - Yodel (Yarn on Gödel). It is a protocol that is fully compatible with Hadoop Yarn and aims to smooth all big data applications. Migrate to the Kubernetes system.
In this set of migration work, ESS also did customization related work, completed the adaptation work of migrating the Yarn Auxiliary Service from the previous Yarn Node Manager mode to the Kubernetes DaemonSet deployment mode, and started the migration of Shuffle jobs. . It took two years to successfully migrate all big data applications, including Spark applications, to today's cloud native ecosystem in 2023 .

Cloud native challenges

During the cloud-native migration process, we also encountered many challenges:
  • First of all, during the migration from NM to DaemonSet , the CPU of ESS on DaemonSet has very strict restrictions. In the previous NM mode, ESS can basically use all CPU resources. Therefore, in this migration practice, the CPU resources of the ESS initially set up are often insufficient and require continuous adjustments. Later, some high-end clusters even directly used the ESS CPU.
  • At the same time, DaemonSet and Pod have stricter limits on the CPU of Spark jobs. This has also caused many users' jobs to become slower after migrating to the new architecture. This is because in the previous mode, the CPU was overstressed to a certain extent, so this situation needs to be adjusted. We have enabled the CPU Shares mode under the Kubernetes and Gödel architectures so that users will not notice any performance difference during the migration process.
  • In addition, Pod has very strict memory restrictions, which results in the inability to use free page cache resources during Shuffle Read, resulting in a very low page cache hit rate during Shuffle Read. In this process, more disk IO overhead will be incurred, resulting in poor overall performance. We have taken appropriate measures to reduce the impact of Shuffle on performance after migration by appropriately opening up Pod's use of page cache.

Cloud native benefits

After completing the migration work, we successfully unified all offline resource pools and were able to implement some optimization and scheduling strategies more friendly at the scheduling level, thus improving the overall resource utilization. ESS Daemonset has also gained a lot of benefits compared to Yarn Auxilary Service. First of all, ESS DaemonSet is independent and becomes a service, breaking away from the tight coupling with NM, reducing operation and maintenance costs. In addition, the isolation of ESS resources by Kubernetes and Pods also increases the stability of ESS, which means that ESS will no longer be affected by other jobs or other services on the node.

Cloud native environment

Cloud-native Spark jobs currently have two main operating environments:
  • Stable resource cluster environment . These stable resource clusters are mainly focused on the task of serving high-quality and SLA. The deployed disk is an SSD disk with better performance. For these stable resource clusters, we mainly use community-based and deeply customized ESS services; use SSD disks, ESS read and write, and local high-performance SSD disks; they are deployed in Daemonset mode and Gödel architecture.
  • Co-located resource cluster environment . These clusters mainly serve mid- to low-end jobs, focusing on some temporary query, debugging or testing tasks. The resources of these clusters are mainly deployed on HDD disks, and some are transferred through online resources or shared with other services, or are deployed jointly with other online services. This means that the resources of the cluster are not exclusive, and the overall disk performance and storage environment are not particularly excellent.

Stable resource scenario

Since there are many high-quality jobs in a stable cluster environment, the first task is to improve the stability of Shuffle of these jobs and the job duration during runtime to ensure the SLA of these jobs. In order to solve the problem of Shuffle, the following three capabilities have been deeply customized for ESS: enhancing the monitoring/governance capabilities of ESS, adding the current limiting function of ESS Shuffle, and adding the Shuffle overflow splitting function .

ESS deep customization

  1. Enhance the monitoring and governance capabilities of ESS

  • Monitoring capabilities
In terms of monitoring, during the process of using the open source version, we found that the existing monitoring was not enough to deeply troubleshoot the Shuffle problems encountered and the current ESS status. As a result, there is no way to quickly locate which nodes are causing the Shuffle problem, and there is no way to detect the problematic nodes. Therefore, we have made some enhancements to our monitoring capabilities.
First, some key indicators for monitoring Shuffle slowness and Fetch Rate capabilities have been added, including Queued Chunks and Chunk Fetch Rate. Queued Chunks is used to monitor the accumulation of requests on the currently requested ESS nodes, while Chunk Fetch Rate is used to monitor the traffic of requests on these nodes. At the same time, we have also connected the Metrics indicators of ESS to ByteDance's Metrics system, allowing us to quickly locate the accumulation of ESS nodes through the Application dimension indicators provided by the system. In terms of user interface (UI), our improvement is to add two new functions to the Stage details page, which are used to display the slowest nodes encountered by each Task Shuffle in the current Stage, as well as all Task encounters after Stage statistics. To the top node with the most Shuffle times. The above operations not only facilitate user queries, but also use these indicators to build relevant markets.
With these monitoring and UI improvements, when users see that Shuffle is slow on the UI, they can open the corresponding Shuffle monitoring through the UI. This facilitates users and us to quickly locate the ESS nodes that cause Shuffle problems, and quickly see the actual situation on these nodes, so as to quickly locate which applications these accumulated requests come from.
The new monitoring will also detect key indicators such as actual chunk accumulation and latency on the ESS node when running and troubleshooting Shuffle problems . This helps to take action more efficiently in real time when Shuffle is slow. Once the Shuffle problem is located, we can analyze the situation and provide governance direction and optimization.
  • Governance ability
Governance work is mainly implemented through the BatchBrain system. BatchBrain is an intelligent job tuning system specially designed for Spark jobs. It mainly collects job data and performs offline and real-time analysis. These collected data include Spark's own Event Log, more detailed internal Timeline events, and various Metrics indicators, including customized Shuffle indicators added to ESS.
In offline analysis, it is mainly necessary to manage periodic jobs. Based on the historical characteristics of each job and the collected data, the Shuffle Stage performance of these jobs is analyzed. After multiple iterative adjustments, a set of suitable Shuffle parameters is finally provided. , so that these jobs can run the optimized Shuffle parameters when re-running, thereby obtaining better performance and results.
BatchBrain can also use the previously added Shuffle indicators for automatic scanning in the real-time analysis part. Users can also query the job shuffle status in their cluster through the BatchBrain API, effectively locate nodes and jobs experiencing shuffle accumulation, and notify relevant personnel through alarms. If it is found that Shuffle is slow due to other jobs or abnormal jobs, users can also take direct management actions, such as stopping or evicting these jobs, to free up more resources for higher-priority jobs for Shuffle.
  1. Shuffle current limiting function

Through the monitoring and management of Shuffle, we found that when Shuffle is slow on ESS nodes, it is usually because the data volume of some tasks is too large or inappropriate parameters are set, resulting in the number of Mapper and Reducer of these Shuffle Stages being too high. Unusually large. An abnormally large number of Mapper and Reducer may cause a large accumulation of requests on the ESS node, and the chunk size of these requests may also be very small. The average Chunk Size of some abnormal jobs may not even reach 20 KB. These jobs send a large amount of requests to ESS, but the failure of ESS to process them in time may eventually lead to a pile of requests, even cause job delays or directly lead to failure.
In response to these phenomena, the solution we adopted is to limit the total number of requests for each Application on the ESS node. When an Application's Fetch requests reach the upper limit, ESS will reject new Fetch requests sent by the Application until the Application waits for the existing request to end before it can continue to send new requests. This prevents a single Application from occupying excessive resources on the node and causing ESS to be unable to properly service other job requests. It can also prevent other jobs from failing or the Shuffle speed to slow down, and alleviate the negative impact of abnormal or large-scale Shuffle jobs on cluster Shuffle.
  • Features of Shuffle current limiting function
  1. When the job is running normally, even if the current limiting function is started, it will not have any impact on the job. If the node can serve normally, there is no need to trigger any current limiting.
  2. Only when the load of the node exceeds the tolerable range and Shuffle IO exceeds the set threshold, the current limiting mechanism will be activated to reduce the number of requests that abnormal tasks can send to the ESS and reduce the current pressure on the ESS service. Since the load capacity of the ESS service has exceeded the tolerable range at this time, even if it receives these requests, it cannot return these requests normally. Therefore, limiting excessive requests for abnormal tasks may better improve the performance of these tasks themselves. .
  3. In the case of current limiting, the priority of the job will also be considered. For high-quality tasks, greater traffic will be allowed.
  4. When the current limit takes effect, if it is found that the ESS traffic has returned to normal, the current limit will be quickly lifted. Applications with restricted traffic can quickly return to their previous traffic levels.
  • Detailed flow of current limiting
The current limiting function is mainly performed on the ESS server. The latency indicator is scanned on the node every 5 seconds. When the latency indicator exceeds the set threshold, it will be determined that the load of the node has exceeded the load it can bear. Then all applications currently undergoing Shuffle on the ESS node will be evaluated to determine whether to enable current limiting. Using the previously added indicators, we can count the total Fetch traffic and IO on this node in the past 5 minutes . According to the upper limit of the total traffic, the traffic of each Application currently running Shuffle on each ESS node can be reasonably allocated and restricted. . Traffic distribution is also adjusted based on the priority of the Application. If any Application's Shuffle or currently accumulated Chunk Fetch Rate has exceeded its allocated traffic, they will be throttled and newly sent requests will be rejected until the accumulated requests have been partially released.
There is also a tiered system for the distribution of current limits. First, allocation is based on the number of Applications running Shuffle on the current node . The greater the number of Applications, the less traffic each Application can be allocated. When the number of Applications on a node is relatively small, each Application can be allocated more traffic. The current limiting level will also be adjusted every 30 seconds based on the actual situation on the node.
In the case of current limiting, if the latency on the node does not improve and the total Shuffle traffic does not recover, the current limiting will be upgraded and stricter traffic restrictions will be imposed on all applications. On the contrary, if latency improves or node traffic is recovering, the current limit will be downgraded or even lifted directly. Finally, the current limit is also adjusted appropriately based on the priority of all jobs.
  • Effects and benefits
Chunk stacking issues are significantly alleviated. Due to current limiting , the Chunk accumulation caused by abnormal tasks is effectively reduced, which greatly reduces the accumulation of a large number of requests on some nodes in the cluster.
In addition, the situation of Latency has also been improved. Before turning on current limiting, we often see high latency on the nodes in the cluster. After enabling the current limiting function, the overall latency situation has been significantly alleviated. By reducing unnecessary and invalid requests, and limiting the number of requests initiated by various large or abnormal tasks to ESS nodes, we avoid the negative impact of these abnormally large tasks on the ESS service load and reduce the need for other high-quality tasks to run. Impact.
  1. Shuffle overflow split function

When analyzing some slow Shuffle jobs, we also discovered another phenomenon. The amount of Shuffle data written by each Executor in a job may be very uneven. Since ESS uses the Dynamic Allocation mechanism, the running time of each Executor and the number of allocated Map Tasks may be different. This causes a large amount of Shuffle data to be concentrated on a few Executors during job running, causing the Shuffle data to actually be concentrated on a few nodes.
For example, in the figure below, we find that the Shuffle write volume of 5 Executors exceeds that of other Executors by more than 10 times. In this case, Shuffle requests may be concentrated on these nodes, causing the load on these ESS nodes to be very high, which also indirectly increases the possibility of Fetch Failure.
For this situation, the solution we provide is to control the total amount of Shuffle data written to disk by each container or each node. This function can be achieved from two perspectives. First, Spark itself controls the Shuffle Write Size of the Executor, which is the maximum amount of data written by each Executor when executing Shuffle. Each Executor will calculate the amount of Shuffle data it is currently writing and report this information to the Spark Driver. Spark Driver can use the Exclude on Failure mechanism to proactively exclude Executors whose written data has exceeded the threshold from the scheduling scope and recycle these Executors. In addition, we also improve the scheduling strategy through the Gödel scheduler and try to schedule new Executors to other nodes to avoid excessive Shuffle write data in a single container, which will cause the disk of the node to be filled up, or the data set in the Shuffle Fetch stage will be filled up. on these ESS nodes.

Cloud native optimization

At the same time, some Executor scheduling and function optimizations have also been made in cloud nativeization. Through the strategy of the Gödel scheduler, Shuffle capabilities are improved. When scheduling Executors, some nodes with high Shuffle load can be avoided as much as possible, thereby alleviating the possibility of these nodes encountering Shuffle problems. The scheduler can also provide more complete functions for Executors, evicting Executors on nodes with particularly high disk pressure, or evicting some containers that have written a large amount of Shuffle on the disk when the remaining disk space is insufficient. Spark Driver's control of Executor's Shuffle combined with these cloud-native scheduling functions can disperse the overall Shuffle data to more nodes, making the Shuffle Fetch stage data and request volume more balanced.

Effect

After enabling the above-mentioned deeply customized Shuffle optimization online, we observed significant effects. The following is some operating data from three high-quality clusters. The total number of tasks in these three high-quality clusters every day may exceed 300,000, but the average number of jobs that ultimately fail due to Shuffle Fetch failure is around 20 to 30 per day. It can be said that a failure rate of less than 1/10000 has been achieved. As shown in the figure above, it can be observed that the stability of these three high-quality clusters has been significantly improved after optimization, and the problems encountered by users on Shuffle have also been greatly reduced.

Mixed resource scenario

It is worth noting in the optimization of co-location cluster scenarios that Fetch Failure is usually much more serious than in a stable resource environment. The average number of Fetch Failures per day is very high. The main reason is that most of these resources come from the transfer of idle online resources, and their disk IO capabilities and disk space are relatively limited. In addition, some resources are deployed mixed with HDFS or other services. Since disk IOPS and disk space may be very limited, this has a greater impact on the Shuffle performance of the cluster, so the probability of failure is also high. The main goal of co-location resource management is to reduce the failure rate of jobs and ensure the stability of jobs. It also needs to improve the Shuffle performance of the entire cluster and reduce resource waste.
For clusters with mixed resources, the main solution is the self-developed Cloud Shuffle Service ( CSS ), which reduces the dependence of these jobs on local disks by providing a remote Shuffle service.

Introduction to CSS functions

  • The Push Based Shuffle mode is different from the ESS mode just introduced. In the Push Based Shuffle mode, the same Reducer Partition data from different Mappers will be sent to a common remote service, merged on this service, and finally in a certain Write one or more files on the Worker so that the Partition data can be read through the Sequential Read mode during the Reduce stage, reducing random IO overhead.
  • Supports the Partition Group function , which is used to allocate multiple partition data to a Reducer Partition Group. In this way, during the Map stage, Mapper can transfer data through Batch Push and directly transfer the batch data to the working nodes of the corresponding partition group, thus reducing the IO overhead in batch mode and improving the performance of batch mode.
  • The fast double-write backup function uses push based shuffle and aggregation mode. All data is actually gathered on one Worker. If the Worker data is lost, all Mapper will have to recalculate the corresponding data. Therefore, for For the push aggregation function, it is important to use a double-write backup. CSS improves the writing speed by adopting the double-write in-memory copy mode and performing asynchronous disk flushing, so that the Mapper can continue to push subsequent data without waiting for the disk brushing to end.
  • Load balancing function, CSS manages nodes on all services through a Cluster Manager. Cluster Manager will regularly collect and collect load information reported by CSS Worker nodes. When a new Application is submitted, it will perform balanced resource allocation to ensure that Shuffle Write and Shuffle Read will be allocated to clusters with lower utilization rates. nodes to achieve better Shuffle load balancing in the cluster.

Overall structure

  1. Cluster Manager is responsible for cluster resource allocation and maintaining cluster Worker and Application status. It can save this information through Zookeeper or local disk to achieve services with High Availability.
  2. Worker supports two writing modes, namely disk mode and HDFS mode. Currently, the disk mode is commonly used, and the data of each partition is written to two different Worker nodes to achieve data redundancy.
  3. CSS Master is located on the Spark driver side and is mainly responsible for heartbeat contact with Cluster Manager and Application Lifecycle. When a job starts, it will also apply for a Worker from the Cluster Manager. The process of Shuffle Stage will also count the metadata and progress of Shuffle Stage.
  4. Shuffle Client is a component connected to the Spark Shuffle API, allowing any Spark job to use CSS directly without additional configuration. Each Executor will use ShuffleClient for reading and writing. Shuffle Client is responsible for double writing when writing. When reading, it can read the data from any Worker that has data. If one of the Workers fails to read, it will automatically switch to another Worker and Deduplicate data that has been read multiple times.

reading and writing process

When CSS is written, the Worker will send data directly, and the Mapper will send the data to two Workers at the same time. The Worker will not wait until the disk is flushed and return to the Mapper, but will return the result to the Mapper asynchronously . If it encounters a failure, it will be sent to the next Worker. Request to notify Mapper. At this time, Mapper will re-apply for two new workers from the node and re-push the failed data. When reading, data can be read from any node and deduplicated through Map ID, Attempt ID and Batch ID.

Performance and future evolution

Under the 1TB TPC-DS Benchmark performance test, CSS was improved by more than 30% in Query.
As a remote Shuffle service, CSS is actually particularly suitable for cloud nativeization, supports flexible deployment, or supports more remote savings services. At present, CSS has also been open sourced. Interested friends can go to the CSS open source website to learn more information. We also hope that some subsequent iterations and optimizations can be synchronized to the community. In the future evolution of cloud-native, it is necessary to support elastic deployment, remote storage services and other related capabilities.
 
GitHub:github.com/bytedance/CloudShuffleService
 
Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10323638