【SIGMOD 2023】GoldMiner, a deep learning elastic data pipeline system, greatly improves the efficiency of tasks and clusters

Section 1: Opening

Recently, the paper "GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning" co-authored by Alibaba Cloud's machine learning platform PAI and the team of Professor Yang Zhi from Peking University was accepted by SIGMOD 2023, the top conference in the database field.

GoldMiner observed that the data preprocessing pipeline in deep learning tasks is stateless and has inherent resource elasticity. Based on this, GoldMiner separates the execution of data preprocessing pipeline and model training, identifies stateless data preprocessing calculations through automatic calculation graph analysis, and realizes efficient parallel acceleration and elastic scaling for them, thereby alleviating data preprocessing bottlenecks. Improve training performance. Through the collaborative design with the cluster scheduler, GoldMiner further exploits the resource elasticity of data preprocessing calculations and greatly improves the efficiency of cluster scheduling. Experiments show that GoldMiner can improve training performance by 12.1 times and improve GPU cluster utilization by 2.5 times.

Section Two: Background

In recent years, with the continuous evolution of GPU accelerators and the emergence of various software optimization technologies, the computational efficiency of deep learning training is constantly being raised to a new level. But at the same time, deep learning is still essentially a multi-stage, multi-resource task type: it not only requires a large amount of training calculations on the GPU, but also often requires a data preprocessing pipeline on the CPU side (such as data enhancement, feature conversion, etc.) etc.), this type of preprocessing calculation is a necessary step to train a high-quality model. Therefore, the improvement of GPU-side training performance also brings greater data preprocessing pressure, making the latter a new performance bottleneck.

We observe that data preprocessing bottlenecks have a profound impact on both task training performance and cluster resource utilization efficiency. On the one hand, for a single training task, the bottleneck of data preprocessing means the loss of training performance. We conducted a performance test using one GPU and different numbers of vCPUs on a virtual machine equipped with 8 V100 GPUs and 64 vCPU cores to observe how many vCPUs are needed for different models to achieve optimal performance. The results show (below) that most models require more than 8 vCPUs (that is, the average number of vCPUs that a GPU can obtain) to achieve optimal performance, and even some models need to consume 64 vCPUs of the eight cards of the whole machine. This means that this type of model may not be able to obtain enough CPU resources in the shared cluster, resulting in a decline in the performance of the data preprocessing part, which ultimately affects the training efficiency (the vertical axis on the right side of the figure below indicates the relative performance).

image.png

On the other hand, the above-mentioned problems will be more serious in cloud scenarios, affecting the efficiency of resource allocation in shared clusters. At present, enterprises generally build or purchase shared GPU clusters to run training tasks, and GPU utilization is very important in such GPU clusters. In order to avoid insufficient CPU for their tasks, users may actively increase the CPU-GPU ratio of tasks. However, these user-defined CPU-GPU ratios can easily lead to fragmentation of cluster resources. For example, because some tasks with a high CPU-GPU ratio are running on a certain machine, the CPU is finally allocated before the GPU, so that the idle GPU on the machine will not be allocated, which will not only cause expensive GPU resources to be wasted, but also Increase the waiting time of tasks. Our observations in Alibaba's internal GPU cluster found that nearly 40% of the task waiting time was wasted in this "enough GPU but not enough CPU" situation.

image.png

One way to solve the above two problems is to separate the GPU-side training from the CPU-side data preprocessing, so that the resource allocation of these two parts of the calculation does not have to be tied to the same machine. In this way, when the CPU resources of the machine are insufficient, the resources of other machines can be used. On the one hand, more CPU resources can be allocated to a single task to achieve the acceleration effect, and at the same time, the problem that the fragmented GPU cannot be allocated can be alleviated. In fact, this idea is not proposed for the first time, but there are still a series of technical challenges to improve task and cluster efficiency in this way.

Section Three: Challenges

Although some solutions (such as tf.data service, PyTorch DPP) support the separate execution of data preprocessing calculations, there are still the following challenges to be solved in existing technologies:

  1. Calculation segmentation efficiency: The existing technology simply uses the Dataset/DataLoader API provided by the deep learning framework to separate the calculations encapsulated inside such APIs as data preprocessing calculations. However, we found that even outside this type of API, there may still be some calculations that can be performed separately, and the simple segmentation method misses the opportunity for this part of parallel acceleration.
  2. Intrusiveness of user code: tf.data service [1], PyTorch DPP [2] and other technologies to achieve separate execution of data preprocessing require the user to reconstruct this part of the code logic, which is relatively intrusive. We hope to achieve the effect of achieving this separation in a user-transparent manner.
  3. Combination with cluster scheduling: After being separated from training, data preprocessing calculation actually contains inherent resource elasticity, but none of the existing technologies exploits this part of elasticity at the level of cluster scheduling to improve the overall resource utilization efficiency of the cluster.

The fourth plate: breaking the game

GoldMiner is an automatic and elastic data preprocessing service. As shown in the figure, GoldMiner uses data worker (DW) and training worker (TW) roles to perform data preprocessing and training respectively. GoldMiner can automatically identify the calculation of the data worker part from the original user code (including calculations that are not encapsulated in the Dataset/DataLoader API). At the same time, GoldMiner has realized the elastic scaling of data preprocessing calculations, and through the collaborative design with the cluster scheduler, the cluster efficiency has been further improved.

image.png

The key for GoldMiner to achieve this effect is to use the stateless nature of data preprocessing calculations . Stateless here means that data preprocessing does not depend on model parameters, and model parameters need to be updated repeatedly in each iteration of training, so calculations that do not depend on model parameters can be performed asynchronously with the training part. We know that deep learning calculations can be expressed as a data flow graph (DFG). GoldMiner automatically finds the stateless subgraphs in it through the analysis of the user's DFG. The figure below is the DFG of a typical recommendation model. Different from the method of directly dividing the Dataset (Simple partition in the figure), GoldMiner automatically extends the scope of segmentation to some subsequent feature conversion operations by identifying the dependency relationship with the model parameters. (Expected partition). Experiments show that through this expansion, we can further increase the effect of data worker parallel acceleration by 1.6 times.

image.png

Based on automatic graph segmentation, GoldMiner further realizes the parallel acceleration of some subgraphs of data workers among multiple data workers and the data transfer between data workers and training workers. Taking advantage of the stateless nature of data workers, this distributed execution enables dynamic scaling of the number of data workers, allowing for more efficient use of resources in the process of changing cluster idle resources.

In order to give full play to the resource elasticity of data workers, GoldMiner provides a data worker scheduler, which will dynamically adjust resources at the task and cluster levels. For each task, GoldMiner adjusts the size of its DW and TW to search for a configuration with the highest expansion efficiency; in the cluster dimension, GoldMiner dynamically adjusts the distribution of data workers among different tasks to optimize some global scheduling goals (such as Minimize task completion time). These two layers of adjustment utilize a unified performance index, that is, the queue that transfers data between DW and TW. The status of this queue reflects the relative speed of DW and TW, as well as the potential benefits of increasing DW. Experiments in a 64-GPU cluster show that the GoldMiner scheduler can shorten the average task completion time by 2.5 times and increase the GPU allocation rate by 2.1 times.

image.png

The fifth block: application

We have previously evaluated the effect of GoldMiner on the customer's real recommendation model, and the results show that GoldMiner can speed up the user model by 1.43 times and reduce the training cost by 13%. Currently in deployment.

At the same time, we have also developed a PyTorch version implementation, and will soon integrate with PAI-DLC to provide users with the ability to accelerate data preprocessing.

Sixth plate:

  • 论文名字:GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
  • Authors of the paper: Zhao Hanyu, Yang Zhi, Cheng Yu, Tian Chao, Ren Shiru, Xiao Wencong, Yuan Man, Chen Langshi, Liu Kaibo, Zhang Yang, Li Yong, Lin Wei
  • Paper pdf link: https://dl.acm.org/doi/pdf/10.1145/3589773
  • references:

[1] Andrew Audibert, Yang Chen, Dan Graur, Ana Klimovic, Jiri Simsa, Chandramohan A. Thekkath. A Case for Disaggregation of ML Data Processing. https://arxiv.org/abs/2210.14826

[2] Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, Parik Pol. Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training. ISCA'22

Free interactive modeling PAI-DSW, model training PAI-DLC 5000CU*H computing resource package, and model online service PAI-EAS deduction package worth 500 yuan.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10084146