Meituan Visual GPU Inference Service Deployment Architecture Optimization Practice

Faced with the challenges of increasing GPU resources used by online reasoning services and generally low GPU utilization, the Meituan Vision R&D team decided to optimize by splitting the model structure and micro-services. They proposed a general and efficient deployment architecture. To solve this common performance bottleneck problem.

Taking the "image detection + classification" service as an example, the GPU utilization rate of the optimized service pressure test performance index has increased from 40% to 100%, and the QPS has also increased by more than 3 times. This article will focus on the engineering practice of inference service deployment architecture optimization, hoping to help or inspire you.

  • 0. Introduction

  • 1. Background

  • 2. Features and challenges of visual model services

    • 2.1 Model Optimization Tools and Deployment Framework

    • 2.2 Visual Model Features

    • 2.3 Problems and challenges faced by visual reasoning service

  • 3. GPU service optimization practice

    • 3.1 Image classification model service optimization

    • 3.2 Image "detection + classification" model service optimization

  • 4. General and efficient inference service deployment architecture

  • 5. Summary and Outlook

0. Introduction 

Meituan Vision is oriented to local life services, and has applied visual AI technologies such as text recognition, image quality evaluation, and video understanding in many scenarios. Previously, GPU resources used by online reasoning services continued to increase, but the utilization rate of service GPUs was generally low, wasting a lot of computing resources and increasing the cost of visual AI applications. This is an urgent problem that Meituan and many companies need to solve.

Through experimental analysis, the Visual Intelligence Department of Meituan found that an important reason for the low GPU utilization rate of the visual inference service is the model structure problem: the CPU operation speed of the pre-processing or post-processing part of the model is slow, which makes the inference backbone network unable to fully utilize the GPU computing performance . Based on this, the visual R&D team proposes a general and efficient deployment architecture to solve this common performance bottleneck problem through model structure splitting and micro-service.

At present, this solution has been successfully applied on multiple core services. Taking the "image detection + classification" service as an example, the GPU utilization rate of the optimized service pressure test performance index has increased from 40% to 100%, and the QPS has increased by more than 3 times. This article will focus on the engineering practice of inference service deployment architecture optimization, hoping to help or inspire students engaged in related work.

 1. Background 

As more and more AI applications enter the production application stage, the GPU resources required for inference services are also increasing rapidly. The survey data show that the resource usage of reasoning services in domestic AI-related industries has accounted for more than 55%, and the proportion will continue to increase in the future. However, the actual problem faced by many companies is that the GPU utilization rate of online inference services is generally low, and there is still a lot of room for improvement.

One of the important reasons for the low GPU utilization of the service is that the inference service itself has performance bottlenecks, and GPU resources cannot be fully utilized even under extreme pressure. In this context, "optimizing inference service performance, improving GPU resource usage efficiency, and reducing resource usage costs" is of great significance. This article mainly introduces how to improve the performance of inference service and GPU utilization under the premise of ensuring accuracy, inference delay and other indicators through architecture deployment optimization.

 2. Features and challenges of visual model services 

In recent years, deep learning methods have made remarkable progress in computer vision tasks and have become mainstream methods. The vision model has some particularities in its structure. If it is deployed with an existing inference framework, the service may not meet the requirements in terms of performance and GPU utilization.

2.1 Model Optimization Tools and Deployment Framework

Deep learning models are usually optimized using optimization tools before deployment. Common optimization tools include TensorRT, TF-TRT, TVM, and OpenVINO. These tools improve model running speed through methods such as operator fusion, dynamic memory allocation, and accuracy calibration. Model deployment is the last link of production application. It encapsulates the deep learning model reasoning process into a service, internally implements functions such as model loading, model version management, batch processing, and service interface encapsulation, and provides RPC/HTTP interface externally. The mainstream deployment frameworks in the industry are as follows:

  • TensorFlow Serving : TensorFlow Serving (TF-Serving for short) is a high-performance open source framework released by Google for machine learning model deployment. It integrates TF-TRT optimization tools internally, but it is not friendly enough for models in non-TensorFlow formats.

  • Torch Serve : TorchServe is a Pytorch model deployment reasoning framework jointly launched by AWS and Facebook. It has the advantages of simple deployment, high performance, and lightweight.

  • Triton : Triton is a high-performance inference service framework released by Nvidia. It supports multiple framework models such as TensorFlow, TensorRT, PyTorch, and ONNX, and is suitable for multi-model joint inference scenarios.

In actual deployment, no matter which framework you choose, you need to consider multiple factors such as model format, optimization tools, and framework features.

2.2 Visual Model Features

Unlike traditional methods, deep learning is an end-to-end method that does not require separate design of modules such as feature extraction and classifiers, and replaces the multi-module tasks of traditional methods with a single model. The deep learning model has shown great advantages in visual tasks such as classification, detection, segmentation, and recognition, and has achieved precision that cannot be achieved by traditional methods. Commonly used visual classification models (such as ResNet, GoogleNet, etc.) and detection models (such as YOLO, R-FCN, etc.) have the following characteristics:

  • Large number of network layers (suitable for GPU computing) : Taking ResNet-50 as an example, the network contains 49 convolutional layers and 1 fully connected layer, with up to 25 million parameters and 3.8 billion FLOPs (floating point operand). Model reasoning involves a large number of matrix calculations, which are suitable for GPU parallel acceleration.

  • The input image size is fixed (requires preprocessing) : Also take ResNet-50 as an example, the input of the network is an image floating-point type matrix, and the size is fixed at 224x224. Therefore, before the binary coded picture is sent to the network, preprocessing operations such as decoding, scaling, and cropping are required.

2.3 Problems and challenges faced by visual reasoning service

Due to the above characteristics of the visual model, there are two problems in the deployment and optimization of the model:

  1. Incomplete model optimization : TensorRT, TF-TRT and other tools are mainly aimed at backbone network optimization, but ignore the preprocessing part, so the entire model optimization is not sufficient or cannot be optimized. For example, all network layers of ResNet-50 in the classification model can be optimized, but the image decoding (usually CPU operation) operation "tf.image.decode" in preprocessing will be ignored and skipped by TF-TRT.

  2. Difficulty in deploying multiple models : Visual services often combine multiple models in series to realize functions. For example, in the text recognition service, the position of the text is first located by the detection model, and then the partial image of the text location is cut, and finally sent to the recognition model to obtain the text recognition result. Multiple models in the service may use different training frameworks. The TF-Serving or Troch Serve reasoning framework only supports a single model format, which cannot meet deployment requirements. Triton supports a variety of model formats, and the combination logic between models can be built through custom modules (Backend) and integrated scheduling (Ensemble), but the implementation is relatively complicated, and the overall performance may have problems.

These two common model deployment problems lead to performance bottlenecks in visual inference services and low GPU utilization. Even Triton, a high-performance deployment framework, is difficult to solve.

The general deployment framework focuses on performance issues in service hosting such as "communication methods, batch processing, and multi-instances." However, if there is a bottleneck in a certain part of the model itself (such as image preprocessing or postprocessing), the optimization tool cannot optimize it. , there will be a "barrel effect", resulting in poor performance of the entire reasoning process. Therefore, how to optimize the model performance bottleneck in the inference service is still an important and challenging work.

 3. GPU service optimization practice 

Classification and detection are two of the most basic vision models, which are often used in scenarios such as image review, image label classification and face detection. The following uses two typical services as cases, the use of a single classification model and the "classification + detection" multi-model combination, to introduce the specific performance optimization process.

3.1 Image classification model service optimization

Meituan has tens of millions of pictures every day that need to be reviewed and filtered to filter risky content. The cost of manual review is too high, and image classification technology is needed to realize automatic machine review. The commonly used classification model structure is shown in Figure 1. The preprocessing part mainly includes operations such as "decoding, scaling, and cropping", and the backbone network is ResNet-50. The preprocessing part receives the image binary stream, generates the matrix data Nx3x224x224 required by the backbone network (respectively representing the number of pictures, the number of channels, the height of the image, and the width of the image), and the backbone network predicts the output image classification results.

c0cb02374d40e4802e349e2d0ae95a31.png

Figure 1 Schematic diagram of image classification model structure

The actual structure of the model after TF-TRT optimization is shown in Figure 2 below. The backbone network ResNet is optimized as an Engine, but the operators in the preprocessing part do not support optimization, so the entire preprocessing part remains in its original state.

f2f798166126b9979383b081742a97a5.png

Figure 2 Image classification model TF-TRT optimization structure diagram

3.1.1 Performance bottleneck

After the model is optimized by TF-TRT, it is deployed using the TF-Serving framework. The GPU utilization rate of the service pressure test is only 42%, and there is a big gap between QPS and Nvidia's official data. Exclude possible influencing factors such as TF-TRT and Tensorflow framework, and finally focus on the preprocessing part. The model for performance testing by Nvidia is not preprocessed, and the Nx3x224x224 matrix data is directly input. However, the online service here includes a preprocessing part, and the CPU utilization rate of the stress test indicator is relatively high. Check the operating equipment of each operator in the model, and find that most of the model preprocessing is CPU computing, and the backbone network is GPU computing (see Figure 1 for details).

As shown in Figure 3 below, check the CPU/GPU execution status when the model is running through the NVIDIA Nsight System (nsys) tool, and you can find that there are obvious intervals between GPU operations. You need to wait for the CPU data to be prepared and copied to the GPU before performing backbone network inference. Calculation, the slow processing speed of the CPU causes the GPU to be in a starvation state. Combined with the CPU/GPU utilization data of the service stress test, it can be seen that the high CPU consumption and slow processing speed of the preprocessing part are the performance bottleneck of the inference service.

0d1a1a29f685d43e969996a7f8b2c522.png

Figure 3 Classification model nsys performance diagnosis diagram

3.1.2 Optimization method

As shown in Figure 4 below, several optimization methods are proposed to solve the problem that the high CPU consumption of preprocessing affects the performance of the inference service:

6ee16bc08b4959955d6639d55064fe68.png

Figure 4 Comparison of optimization methods
  1. Increase the CPU : It is the easiest way to increase the number of CPUs in the machine, but limited by the server hardware configuration, 1 GPU is usually only configured with 8 CPUs. Therefore, the method of increasing the CPU can only be used for performance test data comparison, and cannot be deployed in actual applications.

  2. Pre-processing : The decoding and scaling operations of large-size pictures consume a lot of CPU, but the processing speed of small-size pictures is much faster. Therefore, consider preprocessing the input image in advance, and then send the preprocessed small image to the service. It has been verified that scaling and cropping operations in Tensorflow have superposition invariance, that is, multiple operations and single operations have no effect on the final result. The preprocessed small images are encoded before being sent to the raw classification service. It should be noted that lossless encoding (such as PNG) must be selected for image encoding, otherwise the decoded data will be inconsistent and cause prediction errors. The advantage of the pre-processing method is that there is no need to modify the original model service, high operability, and no error in the prediction result. The disadvantage is that repeated preprocessing is required, resulting in increased overall process delay and waste of CPU resources.

  3. Separate preprocessing : Another idea is to split the model preprocessing part from the backbone network, deploy the preprocessing part on the CPU machine separately, and deploy the backbone network on the GPU machine. This approach allows the CPU pre-processing service to expand infinitely horizontally to meet the data supply for GPU processing and make full use of GPU performance. More importantly, the decoupling of CPU and GPU operations reduces the waiting time for CPU-GPU data exchange, which is theoretically more efficient than increasing the number of CPUs. The only thing that needs to be considered is the communication efficiency and time between services. The cropped image size is 224x224x3, and the unsigned integer data size is 143KB. There is no problem with the transmission time and bandwidth below 10,000 QPS.

3.1.3 Optimization results

As shown in Figure 5 below, we use the NVIDIA Nsight System to compare the operation of the model optimized by the above methods. The methods of increasing CPU and pre-preprocessing can shorten the CPU preprocessing time, reduce the GPU data waiting time, and improve the GPU utilization. But in comparison, the method of separating preprocessing is more thoroughly optimized, the data copy time from CPU to GPU is the shortest, and the GPU is fully utilized.

0b3c9511ed2958635ceb0e125e33973b.png

Figure 5 Comparison chart of optimization method nsys performance diagnosis

The online service pressure test performance results optimized by various methods are shown in Figure 6 below (the CPU preprocessing service in the pre-preprocessing and separation preprocessing uses an additional 16 CPUs), and the CPU model configured on the machine is Intel(R) Xeon(R) Gold 5218 [email protected], GPU model is Tesla T4. From the pressure test results, it can be seen that:

d9f6908993cba4214310443f4a7cb904.png

Figure 6 Performance comparison of optimization results

1. The service CPU has increased to 32 cores, the QPS and GPU utilization ( nvidia-smiGPU-Util indicators obtained through commands) have been increased by more than 1 times, and the GPU utilization has increased to 88%;

2. After the pre-preprocessing method is optimized, the service QPS is more than doubled, which is slightly better than the method of increasing the CPU, but there is still a lot of room for optimization from the perspective of GPU utilization;

3. After the separation preprocessing method is optimized, the QPS is increased by 2.7 times, and the GPU utilization rate reaches 98%, which is close to the full load state.

Increasing the CPU cannot completely solve the service performance bottleneck problem. Although the GPU utilization rate reaches 88%, the CPU-GPU data transmission time accounts for a large proportion, and the QPS improvement is limited. The pre-preprocessing method does not completely solve the preprocessing performance bottleneck problem, and the optimization is not thorough. In comparison, the separate preprocessing method gives full play to the computing performance of the GPU, and achieves better optimization results in QPS and GPU utilization indicators.

3.2 Image "detection + classification" model service optimization

In some complex task scenarios (such as face detection and recognition, image text recognition, etc.), it is usually a combination of multiple models such as detection, segmentation, and classification to achieve functions. The model introduced in this section is composed of "detection + classification" in series. The model structure is shown in Figure 7 below, which mainly includes the following parts:

08c9db2d2557c057ce69d7cadf7e6877.png

Figure 7 Schematic diagram of the original model structure

  • Preprocessing : It mainly includes operations such as image decoding, scaling, and filling, and outputs an Nx3x512x512 image matrix. Most operators run on CPU devices.

  • Detection model : The detection network structure is YOLOv5, and the operator running device is GPU.

  • Post-detection processing : use the NMS (non-maximum value suppression) algorithm to remove duplicate or misdetected target frames to obtain valid detection frames, then crop and scale the target area sub-image, output an Nx3x224x224 image matrix, and run most of the operators after detection on CPU devices.

  • Classification model : The classification network structure is ResNet-50, and the operator running device is GPU.

The two sub-models of detection and classification are trained separately and merged into a single model during inference. The deployment framework uses TF-Serving, and the optimization tool uses TF-TRT.

3.2.1 Performance bottleneck

Most of the preprocessing and post-detection processing in the model are CPU operations, and the backbone network of the detection and classification models is GPU operations. Multiple CPU-GPU data exchanges are required in a single inference process. Similarly, the slow CPU operation speed will lead to low GPU utilization, and there will be a performance bottleneck in the inference service.

The actual online service pressure test shows that the GPU utilization rate is 68%, and QPS also has a large room for optimization. The service configuration is 1GPU+8CPU (the CPU model of the machine is Intel(R) Xeon(R) Gold 5218 [email protected], and the GPU model is Tesla T4).

3.2.2 Optimization method

The idea of ​​model splitting is still adopted, and microservices are deployed separately for the CPU and GPU computing parts. As shown in Figure 8 below, we split the original model into four parts. The preprocessing and post-detection processing are deployed as CPU microservices (which can be dynamically expanded according to the traffic volume), and the detection model and classification model are deployed as GPU microservices (the same GPU card). In order to keep the calling method simple, the upper layer uses scheduling services to connect various microservices in series, provides a unified calling interface, and shields calling differences from the upstream. After the split detection model and classification model are optimized by TensorRT, they are deployed with Triton.

ab88b40f9f936bf79bc63273d2d71094.png

Figure 8 Schematic diagram of model optimization deployment structure

3.2.3 Optimization results

As shown in Figure 9 below, in addition to the original service and microservice split, the performance test results of adding CPU and Triton Ensemble are also compared. The Triton Ensmeble method deploys all sub-models (including pre-processing and post-processing) on ​​one machine to realize multi-model pipeline scheduling. It can be seen from the comparison:

5693a117685cb0f4d65d7d1cb3eec6f4.png

Figure 9 Performance comparison of optimization results
  • The CPU of the original service is increased to 32 cores, and the GPU utilization rate is increased to 90%, but the QPS increase is only 36%;

  • Compared with the original TF-Serving service, the Triton Ensemble method has little performance gap;

  • After the optimization of the model splitting microservice method, the QPS is increased by 3.6 times, and the GPU utilization rate reaches 100% full load.

The method of increasing the CPU greatly improves the service GPU utilization rate, but the QPS improvement is not obvious. The reason is that the CPU pre-processing and post-processing time are shortened, but the CPU-GPU data transmission time still accounts for a large proportion in the entire inference process, and the GPU computing time is relatively large. few.

The split model is deployed using Triton Ensmeble to implement a multi-model scheduling pipeline, and the efficiency is not higher than that before the model split. Because the multi-model pipeline is on the same machine, the problem of mutual influence between CPU and GPU has not been solved, and the processing efficiency of GPU is restricted by CPU. The model splitting microservice deployment method realizes the call pipeline at the service level, and the CPU preprocessing and postprocessing microservices can dynamically increase resources according to the traffic to meet the throughput requirements of the GPU model microservices, realize GPU/CPU processing decoupling, and avoid CPU Processing becomes the bottleneck.

All in all, the method of splitting microservices gives full play to the computing performance of the GPU, and achieves better results in terms of QPS and GPU utilization indicators.

 4. General and efficient inference service deployment architecture 

In addition to the above two representative inference service optimization cases, many other visual services also adopt this optimization method. This optimization method has a core idea: in the reasoning service where CPU computing is the bottleneck, split the CPU and GPU computing parts of the model and deploy them separately as CPU/GPU microservices .

Based on this idea, we propose a general and efficient inference service deployment architecture. As shown in Figure 10 below, the bottom layer uses a common deployment framework (TF-Serving/Triton, etc.), and the CPU computing parts in the model, such as preprocessing and postprocessing, are split and deployed separately into CPU microservices, and the backbone network model is deployed as GPU services. The upper-level scheduling service connects the model microservices to realize the overall functional logic.

e11f9a04b33aff7e38d5dd8a3a1acca0.png

Figure 10 Schematic diagram of general service deployment architecture

An important question needs to be explained here, why is this deployment architecture efficient?

First of all, at the macro level, the model is split and deployed into microservices, and the sub-model pipeline processing is realized through the scheduling service. The split sub-model CPU microservices can be dynamically expanded according to traffic and processing capabilities, avoiding insufficient CPU computing power for model preprocessing or postprocessing as a performance bottleneck, and meeting the throughput requirements of GPU model microservices.

Secondly, at the micro level, if the model contains multiple CPU/GPU computing parts, there will be gaps between GPU computing. As shown in Figure 11 below, in a single inference process, it is necessary to wait for the CPU post-processing part to complete between two GPU operations, and the CPU pre-processing will also affect the GPU operation. After the model is split, the pre-processing and post-processing parts are independently deployed as CPU services. The inference process in the GPU service only includes two sub-model operation parts, and the operations between the sub-models are independent of each other without data association. The CPU-GPU data transmission part can be connected with the GPU The calculation process is parallel, and theoretically, the GPU can achieve 100% operating efficiency.

ee07db52e629d069a87a15b1fbf24f7c.png

Figure 11 Schematic diagram of reasoning process comparison

In addition, in terms of delay, the deployment method of splitting microservices increases the time overhead of RPC communication and data copying. However, in practice, this part of the time accounts for a small proportion and has no significant impact on end-to-end delay. For example, for the classification model in Section 3.1 above, the average service time before optimization is 42ms, and the average service time in microservice form (RPC communication protocol is Thrift) after optimization is 45ms. This level of latency increase is usually insensitive to most vision serving use cases.

 5. Summary and Outlook 

This article uses two typical visual inference services as examples to introduce the optimization practice for the problems of model performance bottlenecks and low GPU utilization. After optimization, the inference performance of the service is improved by about 3 times, and the GPU utilization is close to 100%. According to the optimization practice, this paper proposes a general and efficient inference service deployment architecture. This architecture is not limited to visual model services, and GPU services in other fields can also be used for reference.

Of course, this optimization scheme also has some shortcomings. For example, how to split the model in the optimization process depends on manual experience or experimental testing, and does not realize the automation and standardization of the optimization process. In the future, we plan to build model performance analysis tools to automatically diagnose model bottlenecks and support automatic splitting and optimization of the entire process.

 6. Author of this article 

| Zhang Xu, Zhao Zheng, An Qing, Lin Yuan, Zhiliang, Chu Yi, etc. are from the Visual Intelligence Department of the Basic R&D Platform. 

| Xiaopeng, Yunfei, Songshan, Shuhao, Wang Xin, etc. are from the Data Science and Platform Department of the Basic R&D Platform.

----------  END  ----------

maybe you want to see

  |  Exploration and practice of Meituan BERT

  |  Transformer's practice in Meituan search ranking

  |  Common-sense concept map construction and its application in the Meituan scene

read more

Frontend  | Algorithms  | Backend  |  Data   

Security  |  Android  |  iOS   |  Operation and Maintenance  |  Testing

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/128962869