Wuhan Yuan Chuanghui Returns, Let’s Talk About Large Models on April 20th”

Abstract: This article is compiled from the keynote speech of "ByteDance Spark supports Wanka model inference practice" at CommunityOverCode Asia 2023 by ByteDance infrastructure engineer Liu Chang and ByteDance machine learning system engineer Zhang Yongqiang.

In the development process of cloud nativeization, Kubernetes has enabled more and more types of load applications, including big data and AI, to migrate to Kubernetes due to its strong ecological construction capabilities and influence. Byte internally explores the migration of Spark from Hadoop to Kubernetes runs cloud-native jobs. ByteDance's big data resource management architecture and Spark's deployment evolution can be roughly divided into three stages:

The first stage is offline resource management based entirely on YARN. By using YARN on a large scale to manage big data clusters, Spark resource utilization can be effectively improved while reducing resource operation and maintenance costs.
第二个阶段是离线资源混部阶段，通过构建 YARN 和 Kubernetes 混合部署集群，进一步提升在离线资源整体的利用率。通过混合部署技术，集群和单机的资源利用率都得到了显著的提升。更高的资源利用率提升意味着需要更完整的隔离手段。因此我们开始逐步推进 Spark 的容器化部署。
The third stage is a complete cloud-native deployment. Offline loads no longer use different architectures for management, and the technology stack and resource pool are truly unified. Spark's cloud nativeness is also gradually built and improved.

Of course, cloud native is almost a unanimous development trend in the industry, so why use cloud native? Why use Kubernetes as a unified resource management base? There are three main advantages. The first is efficient operation and maintenance. Kubernetes provides agile load creation and management. Whether it is online load or big data load, it can easily achieve continuous development, integration and deployment. The second is resource pooling . The unified cloud-native base reduces infrastructure overhead and further improves resource transfer efficiency. In terms of resource utilization, the utilization rate of the entire data center can be more comprehensively and fully improved, achieving reduction of This increases efficiency. The third is ecological prosperity . We know that Kubernetes has almost the most active ecosystem. It promotes ecological development at all levels by providing standardized interface definitions, whether it is basic operation and maintenance facilities, upper-layer application management, or underlying network and storage. etc., there are many options in management, which facilitate the cloud-native use of Spark.

ByteDance Spark scale

ByteDance has the industry-leading Spark business scale, running millions of offline jobs every day, occupying resources of millions of cores, tens of thousands of GPU cards, and the total cluster size reaches tens of thousands of nodes. Such a large-scale Spark load means that it is not easy to completely natively implement Spark. Here are the questions we think about in practice. Is Spark job deployment static deployment by Standalone or dynamic deployment by K8s Native? Should Operator be used? How to implement tenant-level resource management and control of Spark jobs on K8s? Should the management be performed when the job is submitted or when the Pod is created? How to support Spark's scheduling needs? When submitting jobs in Spark, does the large number of Pod creations cause scheduling bottlenecks? For such a large-scale operation architecture migration, how do we build peripheral capabilities and smooth the experience before and after the operation migration?

In the process of Spark's exploration of cloud nativeization, partners also faced many problems. Search tasks include a large number of offline batch processing tasks with extremely high GPU requirements. Online cluster business can free up a large amount of resources during low peaks, and some online services cannot be fully used. GPU, overall utilization is low. Machine learning is an important partner of Spark. We solve the above problems and work together to strengthen the surrounding ecosystem. Spark has made targeted engine enhancements for the business, and the business has also benefited from Spark's cloud-native resources, scheduling, and management.

Spark cloud native solutions and engine enhancements

The current mainstream use methods of Spark cloud native technology solutions include Spark Native and Google's open source Spark Operator. The two methods achieve the same goal, and ultimately call the Spark-submit command line tool. The difference is that Google's Spark Operator supports richer semantics and injects richer features closer to K8s through Operator and Mutatingwebhook.

There are two Byte Spark cloud native technology solutions. The first one is smooth migration, which does not need to modify the submission method of YARN. It is submitted to Kubelet or Gödel for scheduling through Yodel. The other is Spark Native Submit, which is submitted to the scheduling system through Arcee. . The concepts that need to be explained here are: Gödel is a distributed resource scheduling system developed by Byte. It hosts the resource scheduling capabilities of YARN and Kubernetes and unifies the resource pool, quota, scheduling and isolation of Kubernetes and YARN; Yodel is developed by Byte itself. Operator that supports YARN job types and transforms YARN's RM and NM components. Arcee is a unified cloud-native big data operator independently developed by Byte, which can more conveniently manage big data loads such as Spark and Flink. The difference between Yodel and Arcee is that Yodel is a big data on Gödel solution that is "compatible with the YARN protocol", while Arcee is a big data on Gödel solution that is "compatible with the K8s protocol". The bottom layer of both will reuse the same Gödel Scheduler and Kubelet technologies.

This practice is a complete cloud-native deployment, which is submitted through Arcee Operator. Arcee's core capabilities mainly include job life cycle management, job resource management, and some engine customization functions.

Introduction to Arcee

The core design idea of Arcee is two-level job management , which draws on YARN's two-level management model - the central management service AM, which is mainly responsible for creating and maintaining big data jobs, and then AM creates and maintains computing workers. Corresponding to the Spark job, Arcee creates the Driver, and the Driver creates the required Executor. This management model can effectively manage and express big data job status and customize job management strategies. On the other hand, it can also ensure that the computing engine has full control over the operation of computing jobs and has the ability to adjust resource usage as needed.

The overall architecture is shown in the figure. Arcee Operator contains six modules . The Arcee CRD module defines two resource types: ArceeApplication and ArceeCommand: ArceeApplication is used to describe specific jobs, and ArceeCommand describes the operations used for jobs; the Webhook module is mainly used Configuration injection and verification for Application/Pod; Application Manager is responsible for the life cycle management of jobs; PodSetManager is job resource management; EngineManager is engine management, used to implement some engine customization capabilities; Scheduler Manager is the scheduler docking layer, used to complete Interfacing big data jobs such as Spark with batch schedulers.

The complete job submission process is that Arnold (machine learning platform) initiates Spark job submission, calls Spark Client and fills in the required parameters to submit the job to K8s. In Arcee mode, Spark Client uses the built-in Arcee Client to create Spark ArceeApplication, which is preprocessed by Webhook and submitted to APIServer. Next, the Arcee Controller receives the creation event of the Application. The Arcee ApplicationManager generates the corresponding job status and creates the Driver according to the description in the Application. The Driver creates the required Executors on demand. Arcee will continue to monitor all Executors and will also perform related Configuration injection. All Pods of the Driver and Executor in the Application will be maintained in Arcee's PodsetManager for resource usage statistics and provide relevant information to other modules.

Spark on Arcee

Spark on Arcee can be considered to be an improvement of the Spark Native deployment model to a certain extent. The main difference is that the built-in K8s Client in the Spark Client is replaced by Arcee Client; the component responsible for managing the Driver load becomes Arcee Operator; Driver and Executor become independent from each other. Have a unified Arcee Application for maintenance. Arcee also provides job life cycle management, scheduling shielding and other related functions.

Spark engine optimization

Based on the business background practice introduced in the previous section, the Spark engine side has made the following enhancements. The following are the occurrence and solutions of each problem.

Executor exits gracefully to avoid MPS status abnormalities

Currently, some Spark database flushing jobs that require the use of GPU are run on K8s and mixed with online services. These jobs share the GPU device on the host through MPS (MPS is the Multi-Process Service technology provided by Nvidia, which allows different processes at the same time. The process performs space division multiplexing on the GPU instead of the default time division multiplexing). If one of the multiple shared processes is killed when executing Kernel, it is easy to cause a Fatal Exception at the hardware level, which will cause other processes on the GPU to Exit, so it is necessary to handle the graceful exit of each process.

Running on K8s may cause the container to be evicted or killed due to resource exhaustion due to certain scheduling reasons. We carefully analyzed the exit situations of various Executors and Workers from the Driver, Executor, Daemon, and Worker relationships. By implementing Executor graceful exit in the container environment, capturing the exit signal and automatically doing cudaDeviceSync, it prevents MPS from being in an undefined state due to offline exit.
Solve a large number of Pending Pods issues through Quota

Spark supports DynamicAllocation. In actual use, users generally set Max to a relatively large value. Currently, in order to prevent a large number of Pending Pods from being generated, Arnold performs Quota verification based on Max. Only when Quota is enough to start Max Executors can it be truly submitted to K8s. , otherwise wait in queue in the Arnold service. However, the current disadvantage of using Max to Check Quota is that it is easy to waste resources. If there is a quota in the queue that is less than Max, according to the characteristics of the current task, the task can be started first to use the current resources. However, the current Quota Check logic causes this part of the resource to be unusable, and the task is always queued in the upper layer. This problem can be solved by the following means:
- Use the Spark.kubernetes.allocation.batch.size parameter to control the number of Pods pulled up in each batch;
- Limit the maximum number of Pening Pods for a single job through the Spark.kubernetes.allocation.maxPendingPods parameter;
- However, parameter adjustment still cannot solve the problem of a large number of submissions in the same queue at the same time period. Therefore, you can use Webhook to check the Quota based on the Queue. If there is no Quota, the Pod creation fails. Spark handles the Exception, adds a Pod creation strategy, exponentially increases the creation time interval, etc. to solve this problem. One question.
Robust optimization of operations in mixed-location non-stable resource scenarios

To give a few examples, it is often found that Spark Executor Pod is abnormally rejected (UnexpectedAdmissionError) during multiple stress testing tests during scheduling resource stability optimization. Through centralized investigation, multiple Race Condition problems in a series of Kubelet logic were fixed, and the average daily mixed resource reached a stable increase in the limit filling rate. We have also carried out a series of tuning and transformation, adding some GPU indicator collection points to facilitate observation of resource usage, and improving the task's fault tolerance to resource instability through parameters such as Blacklist and Speculation.

Surrounding ecological integration

In the Spark on K8s environment, logs and monitoring indicators are also very important. They can help us observe the running status of the entire cluster, container, and task, quickly locate problems based on logs and monitoring, and handle them in a timely manner. Therefore, a Trace system was gradually built during the process of Spark cloud nativeization. Arcee Operator and Gödel scheduler provide some job indicators, Spark provides business indicators, and the stand-alone Metrics Collector component provides physical machine indicators and container indicators. In terms of logs, the Log Agent running on each Node collects logs of specified paths and automatically uploads them to the log platform for analysis and query. All indicators and logs can be queried in real time based on the Arnold machine learning training platform. Specific data tables are also provided. Users can perform higher-level queries based on their needs, such as report production, job optimization, job anomaly discovery, etc. . At the same time, Arnold can also catch up with updates in a timely manner through image management.

Wanka model reasoning practice

Our current clusters are mainly divided into offline clusters and online clusters. The offline cluster is mainly focused on training tasks and mainly focuses on task throughput. There are cards such as V100, A100 and A800. The online cluster is mainly focused on online inference services and focuses on latency and throughput. , mainly smaller cards such as T4, A10, and A30, with tens of thousands of GPU cards in total.

main contradiction

The current main contradiction is that the offline cluster Quota has been fully allocated. Logically speaking, the resources have been allocated, but there is still a lot of room for improvement in the overall utilization of the offline cluster. In addition, there are many internal computing needs that have not been met. For example, our cluster is like a large container. These high-quality tasks are actually like stones. The stones may be full, but there are still many gaps between the stones. In fact, many more can be filled in these gaps. of sand, so our problem definition is to find these gaps and fill them with sand, that is to say, find suitable reusable resources and put forward suitable tasks.

resource

Offline cluster: low-quality tasks

The first is the low-priority tasks in the offline cluster. This part is in the offline cluster as a whole and is not sensitive to delays. We use these idle resources with low priority and schedule low-priority tasks when there is free time. Then, when there are high-priority tasks, they will be preempted. At the same time, these are the resources of the entire card, and their supply has no obvious rules, because the offline submission itself has no obvious rules, and the overall isolation level is relatively low.

Online -> Offline: Tide

The other part is the tidal resources from online to offline. This part requires lending the idle resources of the online cluster to the offline cluster. We implement it based on Virtual-Kubelet. This part is also the whole card resource, and its supply is random. There is an obvious pattern with the ups and downs of the business. When the online business is at its peak, the resources are vacated through automatic scaling, and then lent to the offline cluster. When the peak is reached, the capacity will be expanded again, and the cluster will evict the offline Pod. This is A medium isolation level, the offline Pod must run on the same machine, but the card can still be isolated.

Online->Offline: Normal co-location

The other part is the normal co-location resources from online to offline. This part actually means that we lend part of the computing power of the relatively low-utilized GPU in the online cluster to the offline cluster. The main reason is that some models do not use the entire card and are empty. The computing power can be reused, and the overall implementation is based on Virtual-Kubelet + ByteCUDA + MPS.

ByteCUDA is a self-developed CUDA Hook. It does some work on memory isolation and time division multiplexing in the upper layer. The lower layer MPS improves the overall isolation level through space division multiplexing. What is lent is actually a non-integrated card resource, that is, one card can be used for multiple purposes. This card may have online tasks and offline tasks. The advantage is that the supply volume is relatively stable. This part of the service does not automatically expand or shrink. They all run on the same card and have the highest isolation level.

A big question we have about normal co-location is how to prevent offline from affecting online? First of all, memory isolation and computing power isolation must be done. In addition, we use a load-adaptive dynamic lending algorithm, or lending strategy, to observe some power consumption of the GPU within a window period, and then make judgments based on these indicators. Should our offline calculations actively avoid online calculation requests so that the online world is less affected?

In addition, MPS is famous for its fault propagation problem. The problem mentioned above is solved by graceful exit. From the above renderings, we can see that the online throughput before and after colocation is almost unchanged, and the delay increases by about 0.75 ms. In fact, it is acceptable. Its utilization rate has increased from the original 10% to 70%. This has a small impact on the overall revenue, but the utilization rate has been greatly improved.

Task

After resources are tasks, which is what we call sand. The first is that his demand must be large enough, otherwise there will be no need to "fiddle". In addition, because these are fragmented resources themselves, the size of the required tasks must be relatively moderate and cannot be particularly large tasks. It is also necessary that these tasks cannot heavily consume these non-isolated resources, such as disks, networks, etc.; in addition, the tasks must also be able to adapt to the automatic expansion and contraction of resources, because these resources are elastic resources and must be automatically used when expanding tasks. Resources, when shrinking, will not be hindered by this shrinkage.

Spark-based offline reasoning tasks

Based on the above implementation requirements, the offline reasoning task based on Spark was finally locked. First, because there are a large number of internal offline reasoning requirements, the demand is large enough; in addition, the process of the reasoning task is relatively simple, and there is nothing between Executors. For communication needs, there is no need for online cards, RDMA, etc.; in addition, a large amount of data is actually stored in HDFS and Hive, which is naturally compatible with Spark; at the same time, we also need to use Spark's data processing and data distribution capabilities; and the ability to dynamically expand and shrink capacity based on Dynamic Allocation.

SDK build

After locking the task, what we have to do is to encapsulate the best practices as much as possible. The above is a schematic diagram of the SDK, which is a Tide Box that supports common model inferences such as Pytorch and Tensorflow, and also supports Partition-level Checkpoints. . In this way, there is no need to repeat calculations when resources are withdrawn, which can avoid the waste of computing power, and improve the overall resource utilization by supporting Batching.

Platform construction

After the SDK is built, platform construction is also a very important aspect. We do not want users to execute commands directly through Spark Submit, which is inconvenient to control. Therefore, the Arnold machine learning platform is used as the base to manage these resources in a unified manner. Based on Notebook development and debugging, the necessary variables can be set in advance. Users can also debug in a single step without manually searching for Submit. It is relatively convenient to switch resources in different scenarios at the same time. Also more flexible.

In addition, the task startup method can support multiple methods such as upstream event triggering, timing triggering, API triggering, etc., which facilitates user use. For example, if you want to organize some Pipeline or user automation requirements, you can flexibly implement them through these methods. In addition, the operation and maintenance of tasks is also very important. It can realize the ability to view historical tasks and problem backtracking with one click.

Task-resource matching

There are many types of Spark inference tasks. One is sudden emergency demand. This part of the resource demand is relatively large, the time is also urgent, and it is usually some unconventional demand. For this part we need to use offline low-optimization and Tide is a resource with more comprehensive cards and stronger computing power. In addition, it is necessary to batch backtrack tasks that require regular re-runs, have relatively large resource requirements, and have average task urgency, and use these resources in tidal and normal mixed deployments. Routine tasks may be day-level and have medium resource requirements. We will use relatively more stable and more stable supply of normal co-location resources to support them.

The peak state from online lending to offline is about 10000+ Tidal GPU; the normal mixed deployment is about 20000+. This part is also due to the offline mixed deployment, the overall utilization has doubled; there are about 100+ offline inference tasks every day, and a single The task has a maximum limit of 5K+ GPUs. We actively limit this part, otherwise users may increase it even more, resulting in all resources being occupied. A typical database brushing task requires stable resources to run for 9.5 days. Through this elastic resource The schedule was shortened to 7.5 hours to complete.

future outlook

In the future, our flexible co-location resources need to do more, not only to improve the overall resource utilization, but also to optimize the overall resources. First, try to avoid the impact on online operations, so that it can be used offline more and achieve a higher utilization rate; in addition, we still need to connect more businesses to expand the overall scale and revenue. Make the overall performance better, and at the same time reduce or avoid as much as possible related problems caused by unnecessary resource rollback.

ByteDance Spark supports Wanka model inference practice