Design and Practice of Dewu AI Platform-KubeAI Reasoning Training Engine

1. Introduction to KubeAI

KubeAI is a Dewu AI platform. It is a cloud-native AI platform that we gradually collect and mine the needs of the company's various business domains in the AI ​​model research and production iteration process during the containerization process. Taking the model as the main line, KubeAI provides solutions for the entire life cycle from model development, to model training, to inference (model) service management, and continuous iteration of model versions.

In terms of data, KubeAI provides a cvat-based labeling tool that connects with data processing and model training processes to facilitate rapid iteration of online models; provides task/Pipeline orchestration functions, connects to ODPS/NAS/CPFS/OSS data sources, and provides users with a Stationary AI workstation. The platform's self-developed inference engine helps businesses improve model service performance while controlling costs; the self-developed training engine improves the throughput of model training tasks, shortens the training time of models, and helps model developers accelerate model iteration.

In addition, with the rapid development of AIGC, after investigating the company's internal AI-assisted production-related needs, we launched the AI ​​​​drawing function, providing basic capabilities and general AI drawing capabilities for business scenarios such as Dewu posters, marketing activities, and designer teams.

picture

Previously, we introduced the construction of KubeAI and the process of its implementation in the business by reading the article Understanding the Implementation Process of the Object Cloud Native AI Platform-KubeAI . In this article, we will focus on the practical experience of the core engine capabilities of the KubeAI platform in the process of reasoning, training, and model iteration. ****

2. AI reasoning engine design and implementation

2.1 Status Quo of Reasoning Service and Performance Bottleneck Analysis

The Python language is widely used in the field of model research and development because of its flexibility and lightness, as well as its rich library support in the field of neural network training and inference. Therefore, the model inference service is mainly based on Python GPU inference. The model reasoning process generally involves pre-processing, model reasoning, and post-processing. The CPU pre-/post-processing process in the single process mode, and the GPU reasoning process need to work in a serial or pseudo-parallel manner. The general process is shown in the following figure :

dddssss.jpg

The advantage of the above architecture is that the code is easy to understand, but it has a big disadvantage in performance, and the QPS it can carry is relatively low. Through the pressure test on the CV domain model, we found that it is difficult to achieve 5 inference QPS. In-depth analysis found that the reasons for this problem are as follows:

(1) In single-threaded mode, CPU logic and GPU logic wait for each other, and GPU Kernel function scheduling is insufficient, resulting in low GPU usage and unable to fully improve service QPS. In this case, more processes can only be opened to improve QPS, but more processes will bring greater GPU memory overhead.

(2) In multi-thread mode, due to Python's GIL lock, Python's multi-threading is actually a pseudo-multi-threading, not real concurrent execution, but multiple threads are executed by competing for GIL locks. In some cases, the GPU Kernel Launch thread cannot be fully scheduled. In addition, enabling multi-threading in the Python inference service will cause the GPU Kernel Launch thread to be frequently interrupted by the CPU thread, so the GPU computing power will always be "sluggish" and continue to decline.

The above problems make it so that if the inference service wants to support more traffic, it can only increase the number of service instances horizontally, accompanied by an increase in cost.

2.2 Self-developed inference service unified framework kubeai-inference-framework

To solve the above problems, KubeAI's solution is to separate the CPU logic and GPU logic into two different processes: the CPU process is mainly responsible for pre-processing and post-processing of images, and the GPU process is mainly responsible for executing CUDA Kernel functions, that is, model reasoning .

In order to facilitate model developers to access our optimization scheme more quickly, we have developed a unified framework ***kubeai-inference-framework*** that separates CPU and GPU processes based on Python. The old Flask or Kserve services, slightly You can access the unified framework of the inference engine by making modifications, and add new services to implement the specified functions according to the framework. The unified framework of the inference service is shown in the figure below:

b87b8dee8caa45fd.jpg

As mentioned above, the main idea of ​​the unified framework of inference service is to separate GPU logic and CPU logic into two processes. In addition, a Proxy process will be pulled up for routing and forwarding.

CPU process

The CPU process is mainly responsible for the CPU-related logic in the inference service, including pre-processing and post-processing. Pre-processing generally includes image decoding and image conversion, and post-processing generally includes logic such as inference result judgment. After the pre-processing is completed, the CPU process will call the GPU process to perform inference, and then continue to perform post-processing related logic. The CPU process and the GPU process communicate through shared memory or the network, and the shared memory can reduce network transmission of images.

GPU process

The GPU process is mainly responsible for running the logic related to GPU inference. When it starts, it loads many models into the video memory, and then directly triggers Kernel Lanuch to call the model for inference after receiving the inference request from the CPU process.

The kubeai-inference-framework framework provides a Model class interface for model developers. They do not need to care about the subsequent call logic. They only need to fill in the pre-processing and post-processing business logic, and they can quickly launch the model service and automatically pull start these processes.

Proxy process

The Proxy process is the entrance of the inference service, provides an external call interface, and is responsible for route distribution and health check. When the Proxy process receives the request, it polls the CPU process and distributes the request to the CPU process for processing.

The self-developed unified framework for reasoning services separates the CPU logic (image decoding, image post-processing, etc.) and GPU logic (model reasoning) into two different processes, effectively solving the GPU Kernel Launch scheduling brought by the Python GIL lock. problem, the utilization rate of the GPU is improved, and the performance of the inference service is improved. For an online inference service, our framework was used to separate the CPU and GPU processes. The data obtained from the stress test is shown in the table below. It can be seen that the QPS has increased by nearly 7 times.

Inference Services Framework Type SWC Time-consuming (s) GPU computing power usage (%)
Traditional multi-threaded architecture 4.5 1.05 2.0
Self-developed reasoning service framework (6 CPU processes + 1 GPU process) 27.43 0.437 12.0

2.3 Do better - introduce TensorRT optimization acceleration

In the process of supporting the integration of inference services into the kubeai-inference-framework unified framework, we continue to try to optimize and improve the model itself. After research and verification, we converted the existing pth format model into TensorRT format and enabled FP16, which achieved a better QPS improvement in the inference stage, up to 10 times.

TensorRT is a software development toolkit for high-performance deep learning model reasoning launched by Nvidia. It can build optimized deep learning models into reasoning services and deploy them in actual production environments, and provide hardware-based Inference engine performance optimization. The most commonly used TensorRT optimization process in the industry is to convert models such as pytorch/tensorflow into onnx format first, and then convert the onnx format into TensorRT ( trt ) format for optimization, as shown in the following figure:

232332.jpg

The work done by TensorRT is mainly in two periods, one is the network construction period, and the other is the model running period.

  • Network construction period
  1. Model analysis and establishment, loading onnx network model.
  2. Calculation graph optimization, including horizontal operator fusion, or vertical operator fusion, etc.
  3. Node elimination, remove useless nodes.
  4. Multi-precision support, support FP32/FP16/int8 and other precision.
  5. Relevant optimizations based on specific hardware.
  • model runtime
  1. Serialize and load the RensorRT model file.
  2. Provides a runtime environment, including object lifecycle management, memory and video memory management, etc.

In order to better help model developers use TensorRT optimization, the KubeAI platform provides ***kubeai-trt-helper*** tool , users can use this tool to convert the model into TensorRT format, if the accuracy occurs during the model conversion process Lost and other problems, you can also use this tool to locate and solve problems. kubeai-trt-helper mainly provides assistance to users in two stages: one is problem location, and the other stage is model conversion.

identify the problem

The problem location stage is mainly to solve the problem of precision loss that occurs when the model is transferred to TensorRT and the FP16 mode is turned on. For general classification models, if the accuracy requirements are not extreme, try to enable FP16. In FP16 mode, NVIDIA has dedicated Tensor Cores for FP16 that can perform matrix operations, and the throughput is more than doubled compared to FP32. For example, when converting to TensorRT, the precision loss problem occurs when FP16 is turned on. The general workflow of the kubeai-trt-helper tool in the problem location stage is as follows:

3232323.jpg

Step 1: After setting the model conversion accuracy requirements, mark all operators as output, and then compare the output accuracy of all operators.

Step 2: Find the earliest operator that does not meet the accuracy requirements, and intervene in the following ways.

  • Mark this operator as FP32.
  • Mark its parent operator as FP32.
  • Change the optimization strategy of this operator.

Cycle through the above two steps, and finally find the model parameters that meet the target accuracy requirements. These parameters are for example: those operators that need to additionally enable FP32, etc. The relevant parameters will be output to the configuration file, as follows:

configuration item paraphrase
FP32_LAYERS_FOR_FP16 When the FP16 mode is enabled, which operators need to additionally enable FP32
TRT_EXCLUDE_TACTIC TensorRT operators need to ignore the tactic strategy (tactics can refer to TensorRT related information)
atoll Relative error
rtol absolute error
check-error-stat Error calculation methods include: mean, median, max

Model conversion In the model conversion phase, the parameters obtained in the problem location phase above are directly used, and TensorRT related interfaces and tools are called for conversion. In addition, in the model conversion stage, we also made some packages to solve the problem that the original parameters and API of TensorRT were too complicated, and provided a more concise interface. For example, the tool can automatically parse onnx and judge the input and output shapes of the model. The user then provides relevant shape information, etc.

2.4 Implementation results

In practical applications, we help model development students in the algorithm field to realize a reasoning based on a self-developed reasoning service unified framework, and at the same time enable TensorRT optimization, so that the superimposed effect of QPS two optimizations can often be obtained.

2.4.1 Classification model, CPU and GPU separated, TensorRT optimized, and FP16 enabled, 10 times QPS improvement

A Resnet-based classification model online can accept an error within 0.001 (error definition: median, atol, rtol) for accuracy loss. Therefore, we have implemented three performance optimizations for the inference service:

    1. Use the kubeai-inference-framework unified framework to separate and transform the CPU process and GPU process.
    2. After the model is transferred to ONNX, transfer to TensorRT.
    3. Turn on the FP16 mode, and use self-developed tools to locate operators with loss of precision in the middle, and mark these operators as FP32.

After the above optimization, a 10-fold increase in QPS was finally obtained (compared with the original Pytorch direct reasoning), and the service cost was greatly reduced.

2.4.2 Detection model, CPU and GPU separation, TensorRT model optimization, QPS increased by about 4-5 times.

An online inspection model based on Yolo cannot enable FP16 due to its high precision requirements. We directly optimized TensorRT in FP32 mode, and used the kubeai-inference-framework unified framework to separate the GPU process from the CPU process . , and finally get a 4-5 times improvement in QPS.

2.4.3 Multi-instancing of model reasoning process to make full use of GPU computing resources

In actual scenarios, the computing power of the GPU is often sufficient, but the memory of the GPU is not enough. After TensorRT optimization, the size of the video memory required for the model to run is generally reduced to 1/3 to 1/2 of the original. Therefore, in order to make full use of GPU computing power, the kubeai-inference-framework unified framework is further optimized to support multiple copies of the GPU process in one container. This architecture not only ensures that the CPU can provide sufficient requests to the GPU, but also ensures that the GPU Computing power is fully utilized.

For a model online, after being optimized by TensorRT, the video memory is reduced from the original 2.4G to only 1.2G. While keeping the 5G video memory configured for the inference service unchanged, we replicated the GPU process four times to make full use of the 5G video memory, making the service throughput four times that of the original.

3. AI training engine optimization practice

3.1 Overview of PyTorch framework

PyTorch is a relatively popular deep learning framework in recent years, almost occupying various business directions in the fields of CV (Computer Vision, Computer Vision) and NLP (Natural Language Processing, Natural Language Processing). Algorithm students are basically using the PyTorch framework for model training . The following figure shows the basic flow of code for model training based on the PyTorch framework:

1233333.jpg

Step 1 : Pull out the data required in this step training process from pytorch dataloader.

Step 2 : Copy the obtained data, such as the sample image, the tensor of the sample label, etc., to the GPU memory.

Step 3 : Start formal model training: forward calculation, loss calculation, gradient calculation, and parameter update.

The time-consuming of the entire training process is mainly distributed in the above three steps. Usually the second step will not be the bottleneck, because most of the training sample images are copied from the memory to the GPU memory after being resized and reduced. However, due to the differences in models and training data, performance bottlenecks often occur in the first and second steps during the training process, resulting in long training time, low GPU utilization, and affecting model iteration efficiency.

3.2 Dataloader bottleneck analysis and optimization

3.2.1 PyTorch Dataset/Dataloader analysis

PyTorch training reads data mainly through Dataset and Dataloader, where Dataset is a user-defined class for reading data (inherited from torch.utils.data.Dataset), and Dataloader is implemented by PyTorch during the training process Scheduler for Dataset.


torch.utils.data.Dataset

train_loader = torch.utils.data.DataLoader( MyDataset,

batch_size=16,

num_workers=4,

shuffle=True,

drop_last=True,

pin_memory=False)

val_loader = torch.utils.data.DataLoader(MyDataset,

batch_size=batch_size,

num_workers=4,

shuffle=False)

The parameters are explained as follows:

  • dataset(Dataset): The incoming custom Dataset (specific steps for data reading).
  • batch_size(int, optional): How many samples are there in each batch, and how much data can be taken from the dataloader for each iter.
  • shuffle(bool, optional): At the beginning of each epoch, reorder the data, so that the combination and order of the data read by each epoch are different.
  • num_workers (int, optional): This parameter determines how many background processes dataloader starts to pull data. 0 means that all data will be loaded into the main process, the default is 0.
  • collate_fn (callable, optional): A function that composes a list of samples into a mini-batch. The general CV scenario is the concat function.
  • pin_memory (bool, optional): If set to True, the data loader will copy the tensors to the fixed memory (CUDA pinned memory) in CUDA before returning to the batch. This parameter is useful in some scenarios.
  • drop_last (bool, optional): This parameter is for the last unfinished batch. For example, batch_size is set to 64, and an epoch has only 100 samples. If it is set to True, then the next 36 samples will be dropped during training. Throw it away, otherwise it will continue to execute normally, but the final batch_size will be smaller. The default setting is False.

Among the above parameters, the more important thing is num_workersthat when Dataloader is constructed, it will start num_workersa worker process, and then the main process will distribute reading tasks to the worker process. After the worker process reads the data, it will put the data in the queue for the main process. access. The multi-process mode uses torch.multiprocessingan interface, which can share memory between the worker process and the main process, and tensor can be stored in the shared memory, so that if the tensor is returned in the process, the result can be directly returned to the main process through shared memory, reducing Communication overhead between multiple processes.

When num_workers it is 0:  get_data()the process and train_model()process are serial, and the efficiency is very low, as shown in the following figure:

12123332.jpg

When num_workers it is greater than 0, multi-process reading data is enabled, and the time for reading a batch of data is less than the time for a step training, the efficiency is the highest, and the GPU computing power is fully utilized, as shown in the following figure:

44444.jpg

When num_workers it is greater than 0 to enable multi-level data, but the time to read a batch of data is longer than the time for a step training , the GPU training process will wait for the data to be pulled, and the GPU computing power will be idle, and the training time will increase, as shown in the figure below Shown:

435345.jpg

It can be seen that the functions in Dateset __getitem__are very important. After analyzing its source code implementation in detail, we found that the time consumption of this function mainly includes two periods of time:

  • load_image_time: Time spent reading data from disk or remote disk.
  • transform_image_time: Time-consuming to preprocess image or text data.

3.2.2 Solve the problem - it is important to set reasonable parameters

Through the analysis in the previous section, the selection of relevant parameters during training is very important. Summarized as follows:

  • batch_size: According to the amount of data and the expected training time, the user can customize the settings reasonably
  • CPU configuration of the training environment (KubeAI Notebook/task/pipeline node): The recommended CPU configuration is the number of GPU cards * (the number of CPU cores configured with a single GPU card).
  • num_workers: The minimum setting of the parameter is  the CPU configuration of the training environment -1 . For example, when the task configuration is 12C, it is recommended to set this parameter to 11. In addition, the value of num_workers can be appropriately increased, because dataset iterpart of the time is spent on network or disk IO, which does not consume CPU; but it cannot be set too large, because the data preprocessing part is a CPU-intensive task, and there are too many parallel processes. It will cause CPU contention and reduce preprocessing efficiency.

Optimization Case 1

An online CV model training task based on the MMDetection framework (the bottom layer also calls the PyTorch framework). Before parameter adjustment, the time-consuming of a single step is unstable, with an average of about 1.12s, and the data pull time is about 0.3s:


mmengine - INFO - Epoch(train) [2][3100/6005] time: 1.0193 data_time: 0.0055

mmengine - INFO - Epoch(train) [2][3150/6005] time: 1.0928 data_time: 0.3230

mmengine - INFO - Epoch(train) [2][3200/6005] time: 0.9927 data_time: 0.2304

mmengine - INFO - Epoch(train) [2][3250/6005] time: 1.3224 data_time: 0.5135

mmengine - INFO - Epoch(train) [2][3300/6005] time: 1.1044 data_time: 0.3427

mmengine - INFO - Epoch(train) [2][3350/6005] time: 1.0574 data_time: 0.2842

After adjusting the parameters, the time-consuming of a single step is stable, with an average of about 0.78 s, and the time of pulling data is 0.004 s, which can basically be ignored.


mmengine - INFO - Epoch(train) [1][150/5592] time: 0.7743 data_time: 0.0043

mmengine - INFO - Epoch(train) [1][200/5592] time: 0.7736 data_time: 0.0044

mmengine - INFO - Epoch(train) [1][250/5592] time: 0.7880 data_time: 0.0044

mmengine - INFO - Epoch(train) [1][300/5592] time: 0.7761 data_time: 0.0045

For this model training task, through the above optimization and adjustment, the data pull time is shortened to 0, the time consumption of a single step is reduced from the original 1.12s to 0.78s, and the overall training time is reduced by 30% (from 2 days to 33 hours), the effect significantly.

Optimization Case 2

An online multimodal model (input includes pictures and text) training task, using 2-card V100 for training, the parameters are adjusted as follows:


CPU = 12 ---> 调整为 24

num-workers = 4 ---> 调整为 11

After the adjustment, the total training time of 300 steps is 405s, and the overall training time is reduced by about 45% (from 10 days to about 5 days).

Optimization Case Three

An online YoloX model training task, using a single card A100 for training, the parameters are adjusted as follows:


num-workers = 4 ---> 调整为 16

After adjustment, the overall training time is reduced by about 80% (from 10 days and 19 hours to 1 day and 16 hours).

3.2.3 Data pull IO bottleneck analysis

Currently, the KubeAI platform provides three storage media for training scenarios:

  • Local disk: small space, best read and write performance, 500G~3T space available for a single disk.
  • NAS network storage: large space, poor read and write performance, moderate cost.
  • CPFS parallel file system storage: large space, good read and write performance, and high cost.

For small data sets, you can first pull the data to the local disk at one time, and then read the data from the local disk for each epoch, which avoids repeatedly pulling data from the remote NAS for each epoch, which means that the entire training only needs Pull data from remote NAS once. For large datasets, there are 2 solutions:

  • Resize large data sets in advance and store relatively small pictures for training, which avoids the need to resize each epoch, and after resize, the pictures become smaller and read faster.
  • Put the data set on the parallel file system CPFS storage to improve the training throughput. Experiments show that CPFS is 3 to 6 times better than NAS disk read performance in image scenarios.

3.3 Training Model Optimization

After the data part is optimized, the main time overhead in the training process is in the GPU training part. At present, there are some relatively mature methods in the industry that can be referred to, and we summarize them as follows.

3.3.1 Mixed Precision Training (AMP)

PyTorch mixed-precision training is introduced in detail on the PyTorch official website, as well as the method of enabling mixed-precision training, you can read here to get the implementation method. Many current CV training frameworks already support AMP training, such as:

  • The AMP parameter in the MMCV framework is the option to enable mixed precision training.
  • There are also related parameters in Pytorch Vision to enable AMP training.

It should be noted that during the mixed precision training process, not all model parameters are converted to FP16 for calculation, only some of them are converted. The reason why mixed precision can speed up the training process is that the floating-point computing power of most Nvidia GPU models in the FP16 data format is twice as fast as that of FP32; in addition, the memory usage of mixed-precision training will be smaller.


for epoch in epochs:

for input, target in data:

optimizer.zero_grad()

with autocast(device_type='cuda', dtype=torch.float16):

output = model(input)

loss = loss_fn(output, target)

scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place

scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,

# although it still skips optimizer.step() if the gradients contain infs or NaNs.

scaler.step(optimizer)

# Updates the scale for next iteration.

scaler.update()

3.3.2 Data Parallel Training with Multiple Cards on a Single Machine

Pytorch natively supports multi-card data parallel training. For details on how to enable multi-card training, refer to the official document. In the process of multi-card training, the backword calculation of each card will add an all-reduce operation of collective communication between multiple cards to calculate the average value of the gradients on multiple cards.

3.4 Self-developed training engine framework kubeai-training-framework

Through the previous analysis, we can see that although the PyTorch framework itself has done a good job, with rich training methods and parameter support, in the actual model research and production process, due to the differences in models and training data, As well as the differences in experience of model developers, the advantages of the PyTorch framework itself may not necessarily be brought into play.

Based on the aforementioned analysis and practice, the KubeAI platform has developed the *** training engine framework kubeai-training-framework*** to help model developers better match training script parameters and quickly access and use appropriate training methods. kubeai-training-framework includes PyTorch Dataloader optimization, GPU TrainModel (AMP) acceleration, and various functional functions. Taking Dataloader as an example, users can use it in the following ways:


from kubeai_training_framework.dataloader import Dataloader

def train(train_loader, model, criterion, optimizer, epoch):

train_dataset = .......

train_loader = torch.utils.data.DataLoader(

train_dataset, batch_size=args.batch_size, shuffle=True,

num_workers=args.workers, pin_memory=True)

model.train()

my_train_loader = Dataloader(train_loader)

input, target = my_train_loader.next()

while input is not None:

## model train 代码 input, target

..........

input, target = my_train_loader.next()

4. AI Pipeline engine facilitates rapid iteration of AI business

Generally, the development of the model can be summarized as the process shown in the following figure:

dsfds.jpg

It can be seen that after the demand scenario is determined and the first model version is launched, the model needs to be iterated repeatedly in order to achieve better business results. In the process of iterative construction, the KubeAI platform has gradually launched relatively independent functional modules such as Notebook, model management, training task management, and inference service management. With the continuous change of business requirements, the efficiency of model iteration directly affects the efficiency of business launch. The KubeAI platform has built the AI ​​Pipeline capability, focusing on solving the periodic iteration needs of AI scenarios and improving production efficiency.

AI Pipeline is developed on the basis of ArgoWorkflow to meet the timing requirements of model iteration, inference task management, data processing, etc., task start trigger method, general template task, and specified node start. Before the AI ​​Pipeline goes online, an iterative task may be configured as multiple scattered tasks, resulting in a heavy maintenance workload and a long debugging cycle. The following figure shows the situation of a task that needs to be configured separately to do a similar task:

23333.jpg

AI Pipeline can design the entire workflow as shown in the figure below:

4432323.jpg

dsfdsa323.jpg

The way of pipeline arrangement reduces the time wasted by model developers on repetitive work, allowing more time to be devoted to model research. At the same time, by properly arranging tasks, limited resources can be fully utilized.

5,Perspective

Starting from the actual needs of Dewu AI business scenarios, the KubeAI platform focuses on solving the training and reasoning performance problems in the AI ​​model development process, as well as the efficiency problems in the model version iteration process, with three core engines as the construction goal.

In terms of inference service performance, we will start with kubeai-inference-framework and continue to conduct in-depth explorations in model quantization, operator optimization, and graph optimization. In terms of model training, we will continue to focus on image data preprocessing, Tensorflow GPU training framework support, and NLP model training support, using the kubeai-training-framework training engine framework as an interface to provide model developers with more efficient and better performance. High training frame. In addition, on the AI ​​Pipeline engine, we will support more abundant preset models to meet the needs of general data processing tasks and reasoning tasks.

Text: Weidong

 

This article belongs to Dewu technology original, source: Dewu technology official website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law! Copyright belongs to the author. For commercial reprint, please contact the author for authorization, for non-commercial reprint, please indicate the source.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/8787777