Decrypting Alibaba Cloud's large-scale deep learning performance optimization practices

Yunqi Information: [ Click to view more industry information ]
Here you can find the first-hand Shangyun information in different industries, what are you waiting for, come soon!

image
Author | You Liang, head of AI acceleration for heterogeneous computing in Alibaba Cloud

Recently, Stanford University released the latest DAWNBench deep learning list, which is one of the most authoritative competitions in the field of artificial intelligence. It is a standard for measuring the strength of comprehensive solutions such as deep learning optimization strategies, model architectures, software frameworks, clouds, and hardware. One.

In the Image Classification on ImageNet list, Alibaba Cloud took the first place in training time, training cost, inference delay, and inference cost.

DAWNBench officially shows that Alibaba Cloud heterogeneous computing service training ImageNet 1.28 million images takes only 2 minutes and 38 seconds, and it takes only 0.0739 ms to recognize an image based on the AI ​​service of Hanguang 800. At the same time, it also achieves the world in training costs and inference costs. Record breakthrough.

This time, Alibaba Cloud created four records thanks to Alibaba Cloud's self-developed acceleration framework AIACC and Pingtou Geguang 800 chip.

Among them, AIACC is the Feitian AI acceleration engine independently developed by Alibaba Cloud. For the first time, it has achieved unified acceleration of mainstream deep learning frameworks such as Tensorflow, PyTorch, MxNet and Caffe. Under the same hardware platform, AIACC can significantly improve the performance of artificial intelligence training and inference .

As the person in charge of R & D of AIACC, I will share with you Alibaba Cloud's large-scale deep learning application architecture and performance optimization practices based on AIACC.

Large-scale distributed training is the future trend

As deep learning models become more complex, the demand for computing power is also increasing. ResNet-50 trains ImageNet 1.28 million images and 90 epochs can achieve 75% Top-1 accuracy. It takes about 5 days to use a P100 GPU, and each major factory has implemented its own large-scale distributed training method. To shorten the training time of this benchmark.

In July 2017, Facebook designed the Big Basin server. Each server used 8 P100s to connect through NVLink. Facebook used 32 sets of 256 P100s to complete the training in 1 hour.

In September 2017, UCBerkeley published an article that used 2048 NKL to shorten the training record to 20 minutes.

In November 2017, Japan's Preferred Network adopted 1024 P100 blocks to shorten the training record to 15 minutes.

In November 2017, Google also published an article that used 256 TPUv2 to train in 30 minutes.

On November 13, 2018, Sony used 2176 V100s to shorten the training record to 3 minutes and 44 seconds ...

These large manufacturers are competing for the strategic commanding heights of distributed training. It can be seen that large-scale distributed training will be the trend of future industry technology development.

Deep learning application infrastructure

Let's first take a look at the basic architecture of deep learning applications. The basic architecture diagram of the entire deep learning application is divided into 4 levels.

image

The first layer from bottom to top is the resource layer, which includes computing resources, storage resources, and network resources. When we do deep learning applications, we need GPU server computing resources for deep learning training and inference, storage resources to store training codes and data, and network resources to handle communication between multiple machines.

The second layer is the scheduling layer, which includes container-based scheduling and physical machine-based scheduling. It is responsible for scheduling computing resources, storage resources, and network resources downward, and scheduling computing tasks at the framework and application layers.

The third layer is the framework layer. The current mainstream deep learning computing frameworks include Tensorflow, PyTorch, MXNET, Caffe and other mainstream computing frameworks.

The fourth layer is the application layer. The application layer includes image recognition, face recognition, target detection, video recognition, CTR estimation, natural language understanding, speech recognition and other deep learning applications. The application layer is composed of the computing framework of the framework layer. Describe and calculate your own models and algorithms.

Architecture and challenges of large-scale deep learning applications

Let's first look at why we need to do large-scale deep learning.

For example, we used to train a model. If you use a GPU card, it takes 7 days, which means that after adjusting the model parameters, you may not know whether the parameters are correct after 7 days. If we use an 8-card GPU server, it takes less than a day to complete the training. If we have 10 such 8-card servers, we may be able to train a model in two or three hours. This way you can quickly find out whether the parameters are correct.

It can be seen from this example that large-scale deep learning has the following four advantages: first, large-scale distributed training can reduce the time of deep learning training; second, it can accelerate the efficiency and process of deep learning algorithm research; third, large-scale distribution Inference can improve the concurrency and reliability of users' deep learning applications. The fourth is to ultimately improve the competitiveness of user products and increase the market share of user products. Therefore, large-scale deep learning is the commanding height of improving productivity for companies.

image

The basic computing model of large-scale distributed training can be roughly divided into two categories: one is distributed training in PS mode, and the other is distributed training in peer-to-peer mode.

image

As shown in the figure, PS distributed has a parameter server and there are many workers. The parameter server is responsible for storing the global model, and each worker has a copy of the local model; when the distributed training starts, each worker will read Own training data, and then update the local model to get the local gradient; then each worker uploads the local gradient to the parameter server to get the global gradient, and then uses the global gradient to update the global model on the parameter server, and then the updated The global model will update the local models of all workers and then perform the next training iteration.

In the peer-to-peer mode, there is no parameter server. Each worker gets its own training data to update the local model to get the local gradient. Then the local gradient of each worker will do a global all-reduce calculation, so that each worker The global gradient can be obtained, and each worker uses the global gradient to update the local model, and then performs the next training iteration.

PS mode is more suitable for a large number of model parameters. A GPU cannot store the size of the model. The large parameter model can be saved through the distribution of PS, but the disadvantage is that there is centralized communication. When the scale is getting larger, the communication The efficiency will become lower and lower.

The peer-to-peer mode is more suitable for the size of a model that can be stored by a GPU, and is a decentralized communication mode, which can reduce the communication complexity through the Ring-Allreduce ring communication algorithm. Different computing frameworks have different computing modes. For example, Tensorflow has both PS mode and peer mode. PyTorch mainly supports peer mode, while MXNET mainly supports PS mode of KVStore and PS-Lite.

Different distributed modes of different frameworks are very big obstacles for users to write distributed code, distributed performance optimization and distributed scheduling.

Large-scale deep learning applications have different needs for infrastructure.

image

At the resource layer and computing resources, a large-scale GPU server cluster is required for distributed training, a large-capacity parallel file system is required for large-scale file storage and parallel file access, and a large-scale TCP or RDMA network.

At the scheduling layer, large-scale GPU clusters and task scheduling are required.

At the framework level, it is necessary to perceive computing and schedule distributed computing modes of different deep learning computing frameworks.

At the application layer, we also need to split the training data, or split the training model to different workers.

Therefore, for the application of large-scale deep learning, there will be many difficulties and challenges.

For the resource layer:

First, building large-scale GPU clusters is very difficult, including solving the problems of large-scale GPU machine rooms, racks, and power supplies, as well as their stability and cost.

Secondly, it is well known that building large-scale parallel file systems not only requires the construction of large-capacity parallel file systems, but also poses a very high challenge to the stability and reliability of parallel file systems.

In addition, it is very difficult to build a large-scale, high-bandwidth TCP or RDMA network. It is necessary to plan and implement a large-scale switch and node topology, plan and implement a large-scale north-south traffic convergence ratio, and plan and implement a large-scale network protocol And IP address, and need to ensure the reliability of the network and the performance of large-scale networks.

For the scheduling layer: Whether it is container-based scheduling or physical machine-based scheduling, you need to do mixed scheduling of CPU and GPU, you need to do GPU shared memory scheduling, and distributed scheduling for different deep learning computing frameworks.

For the framework layer: various mainstream deep learning computing frameworks have different distributed computing modes, which require different distributed implementations at the application layer and different distributed scheduling at the scheduling layer. Implement distributed performance optimization for various frameworks.

No one of them needs to be deployed and implemented by technical experts, architects and engineers with very deep professional knowledge.

Cloud-based large-scale deep learning application architecture

How can the realization and application of these advanced technologies fly into ordinary people's homes? Fortunately, there is cloud computing.

image

Taking Alibaba Cloud as an example, from the architecture of large-scale deep learning applications based on cloud computing, we can see that at the resource layer, our demand for large-scale GPU server clusters directly creates large-scale GPU cloud computing servers; for this large-capacity parallel The file system can directly create the parallel file system CPFS; for large-scale TCP and RDMA networks, Alibaba Cloud network resources can be directly used.

At the scheduling layer, for our large-scale GPU cluster scheduling needs, we can directly use cloud container scheduling ACK or cloud virtual machine scheduling EHPC. It can be said that the current cloud computing products can basically solve the problems of large-scale resources and scheduling level, and do not require much advanced technical knowledge.

Cloud computing has natural advantages such as ease of use, flexibility, and stability for large-scale deep learning.

The first is ease of use. If we have a very urgent need to use a large amount of GPU computing, storage and network resources in a short time, Alibaba Cloud can open large-scale GPU computing resources, large-scale storage resources and large-scale on demand in less than ten minutes Network resources, while purchasing and deploying by oneself, on the one hand, technical experts in the fields of computing, storage, and networking are required. On the other hand, the cycle is calculated on a monthly basis.

The second is flexibility. When the business peak comes, you can flexibly expand more basic resources to undertake new business, and when the business peak passes, you can release excess basic resources, so that you can achieve the best ratio of business and basic resource costs .

The third is stability. The stability of computing services, storage services and network services provided by Alibaba Cloud far exceeds physical resources. The reliability of Alibaba Cloud's computing + network services is 99.95%, while the reliability of storage services is 99.9999999999%.

The fourth is cost advantage. Because cloud computing itself has the advantages of scale, it has the cost advantages of centralized procurement of physical hardware, as well as the cost advantages of centralized management and operation and maintenance of physical hardware. For business, the optimal ratio of business costs is achieved through elastic scaling.

For large-scale deep learning applications, in addition to these advantages brought by cloud computing, our team has also made two cloud-based architecture upgrades, the first is the Feitian AI acceleration engine AIACC, and the second is FastGPU immediate construction.

image

First, I will introduce you to the Feitian AI acceleration engine AIACC. As mentioned earlier, the different distributed modes of different frameworks have great learning costs and difficulties for users to write distributed code, distributed performance optimization, and distributed scheduling. AIACC mainly solves the framework. Uniform performance acceleration and unified scheduling problem of large-scale deep learning at the level. It is the industry's first unified acceleration of the performance acceleration engine of mainstream open source frameworks such as Tensorflow, PyTorch, MXNET, and Caffe. It has four major advantages.

The first advantage is unified acceleration.

As mentioned earlier, the different distributed modes of various computing frameworks will greatly hinder the implementation of unified scheduling and distributed application layer.

AIACC can achieve a unified distributed mode and distributed performance acceleration, so that the scheduling layer can achieve unified distributed scheduling, the application layer can achieve unified distributed computing, and the optimization of the underlying distributed communication requires only one For a job, various frameworks can enjoy the benefits of improved performance.

The second advantage is that the network and GPU accelerators are extremely optimized, which will be discussed in detail in the following section.

The third advantage is to combine the cloud with elastic scaling to optimize the cost of user services.

The fourth advantage is compatible with open source. Most of the code written by users with the open source deep learning computing framework does not need to be modified. Using the AIACC library directly can get a leap in performance.

AIACC mainly uses communication-based performance optimization technology. Through the previous sharing, we know that when doing distributed training, we need to exchange gradient data between machines and GPUs, and we need to achieve very efficient data communication.

Our distributed communication optimization is divided into three aspects of optimization.

The first aspect is the overlap of communication and calculation. Our gradient is for asynchronous communication, and gradient communication is done in parallel during the calculation, so as to hide the communication time behind the calculation.

The second aspect is to optimize the delay. We need to do gradient negotiation before doing gradient communication, we need to know whether the gradient in the GPU on each machine is ready, and then do communication. The traditional approach is to do all node gradient negotiation through a centralized node, so that when the scale comes up, the delay will be very high. And our optimization method is to do gradient negotiation in a decentralized way, which is more efficient and the delay will not increase on a large scale.

The third aspect of optimization is to optimize the bandwidth. There are 5 optimization methods for bandwidth:

The first optimization method is hierarchical communication optimization based on topology. We know that the communication bandwidth between GPUs on a machine is very high, and the bandwidth of cross-machine GPU communication is very low, so we optimize through hierarchical communication. First communicate between GPUs inside the machine, and then communicate between GPU machines.

The second optimization method is to do mixed precision transmission. The accuracy of the original gradient is of type float32. We still maintain the accuracy of float32 when doing calculations, but we can convert it to the precision of float16 for gradient transmission during gradient transmission, so that the amount of data to be transmitted is directly reduced half. At the same time, we maintain the accuracy without scaling by scaling.

The third optimization method is to do multi-gradient fusion communication. When a model is doing distributed communication, it needs to communicate with the gradients of many layers. If the communication is done every time a gradient is calculated, the gradient data packets of many layers are very small, and the bandwidth utilization rate is Very low. Therefore, we did gradient fusion. After a batch of gradient fusion, we did a batch of gradient multi-machine communication, so that the bandwidth utilization rate is very high.

The fourth optimization method is to do multi-stream communication. In the case of a high-bandwidth TCP network, there is no way to fill up the bandwidth with a single communication stream, so we use multiple streams for communication. However, we found that in the case of multiple streams, the transmission rate between multiple streams is not the same, so load balancing is done. The faster streaming will automatically get more gradient communication, and the slower streaming will Communication with fewer gradients.

The fifth optimization method is to make a dynamic tuning process for the granularity of fusion and the number of communication flows. When we start a few batches of training, we will dynamically adjust these parameters according to the current network to achieve optimal performance, so that we can dynamically adapt to achieve optimal performance under different network conditions.

image

This figure is the process of dynamic tuning. The green part is for calculation and the red part is for communication. We can see that there is only one stream for communication in the first batch of training, and its communication time will be longer. In the middle section, we opened two streams for communication. In the following section, we opened four streams for communication, and load balancing was performed between the four streams. In the last batch, we A best performance is achieved.

image

After these performance optimization work, we also tried a small knife, training the ResNet-50 model + ImageNet data set of the above big factory, the performance on 512 P100s can be accelerated by 462 times than the performance of a single card, basically reaching nearly linear The speed-up ratio has shortened the training time from the original 5 days to 16 minutes.

This time DWANBench hit the list, we also released a large-scale training time based on V100, training to reach the accuracy of 93% of top5 only takes 2 minutes and 38 seconds.

Our other architecture upgrade, FastGPU, is built to help users quickly build large-scale distributed training clusters, and help customers optimize business costs in the cloud. Next, we will introduce you to build a large-scale distributed training cluster in the cloud through FastGPU.

Because Alibaba Cloud's cloud computing services now provide OpenAPI interfaces to directly create computing resources, storage resources, and network resources. We can use FastGPU to encapsulate these OpenAPI interfaces to directly create a large-scale distributed cluster in the cloud, and can start large-scale distributed training tasks.

image

As shown in the figure above, the green part represents users, the blue part represents Alibaba Cloud resources, and the orange part represents FastGPU. When the user is in the initial state, first upload the training data set to the cloud storage OSS, and open an ECS as a development host to store the training code (or put it on the Cloud Shell). Then on this development machine, you can use FastGPU to create the basic resources needed for deep learning applications, including large-scale GPU computing resources, storage resources of cloud disks and parallel file systems, interactive resources of Tmux and Tensorboard, all of which can be created. Out, users can view the training process in real time through interactive resources.

After the resources required for training are ready, the distributed training task can be automatically started. After the distributed training task ends, these basic resources can be automatically released, and the trained models and log files can be stored on the OSS or development machine for users to use.

FastGPU can save time, money, and ease of use.

The first is to save time. For example, if we want to configure a distributed deep learning environment, we need to first prepare the basic resources of the GPU, network resources, and then prepare the storage resources, and then need to configure the deep learning environment of each machine: including A certain version of the operating system, a certain version of the GPU driver, a certain version of CUDA, a certain version of cuDNN, a certain version of Tensorflow, etc., and then upload the training data to each machine, and then the network between multiple machines To get through, this may take an engineer a day to prepare the environment, while using FastGPU only takes 5 minutes to complete.

The second is to save money, we can keep the life cycle of GPU resources and the life cycle of training synchronized, that is to say, we only turn on GPU resources when our training or inference tasks are ready, and when the training or inference tasks end After that, the GPU resources will be automatically released, which will not cause the GPU resources to be idle, and also supports the creation and management of preemptible GPU instances (low-cost instances).

The third is easy to use. All the resources we create are IaaS (infrastructure) resources. All the resources and tasks that are created are accessible, adjustable, reproducible, and traceable.

Large-scale deep learning application architecture and performance optimization practices

When doing large-scale distributed training, we hope that the performance of training can increase linearly with the increase of the number of GPUs, but in reality, the ideal speedup ratio is often not achieved, and even when the GPU server is added, the performance is often No corresponding increase.

There are two main bottlenecks. On the one hand, when multiple GPU servers read training files at the same time, the parallel access capabilities of the file system, including IOPS and bandwidth, will be the bottleneck; on the other hand, the communication between the GPU servers will be the bottleneck .

On Alibaba Cloud, you can create a high-concurrency parallel file system CPFS with one click to solve the problem of high-concurrency file access, and AIACC to solve the performance problem of large-scale distributed communication.

Finally, I will share the application architecture and performance optimization practices of these four large-scale deep learning scenarios with image recognition, large-scale CTR estimation, large-scale face recognition, and large-scale natural language understanding.

The first case is a one-click construction of a distributed training task for large-scale image recognition.

This scene needs to train 1.28 million ImageNet pictures, the models are ResNet-50 and VGG-16, and the training framework is Tensorflow.

Pull up the right architecture through FastGPU, including multiple 8-card P100 GPU servers, 25Gb network and parallel file system CPFS, and use the AIACC-Tensorflow framework for distributed training.

Workers on multiple GPU servers will train in parallel and read training data from CPFS in parallel. CPFS can provide aggregated IOPS and aggregated bandwidth for multiple GPU servers to access data in parallel, while AIACC can enable communication between multiple GPUs to achieve optimal performance.

image

The following figure is the result of large-scale image recognition distributed training performance optimization. Urov's open source Horovod is also a framework for distributed training optimization, mainly based on ring-allreduce ring communication optimization. Its distributed performance will be better than that of native Tensorflow and other frameworks. In our case, the training performance based on 32 P100 training can be improved by 65% ​​compared to Horovod, and the training performance based on 128 cards P100 improved by 80% compared to Horovod.

image

The second case is distributed training for large-scale CTR estimation.

CTR estimates will make thousands of recommendations based on everyone's behavior on the Internet, for example, based on everyone's click, stay, like, forward, purchase and other behaviors, recommend content, goods or advertisements that users may be interested in.

In this case, the amount of data to be trained is 100 billion. Its model is Wide & Deep's model, and the distributed framework used is Tensorflow.

We first use the FastGPU one-click pull architecture, including multiple 2 card M40 GPU servers, 10Gb network and file system HDFS, and use the AIACC-Tensorflow framework for distributed training.

image

But in the picture on the right, the green part is the performance of the original Tensorflow. As the number of nodes increases, there is not much performance acceleration. It is impossible to train 100 billion data in 1 day.

We located the application to the performance bottleneck. The bottleneck mainly comes from two aspects: one is the IO bottleneck of reading files from HDFS, and the other is from the communication bottleneck between multiple machines.

We optimize file IO through multi-threaded parallel reading and multiple buffer queues, and optimize communication performance between multiple machines through AIACC's communication optimization technology.

In the end, we got 3.5 times performance improvement on 4 GPU cards, 8.5 times performance improvement on 64 GPU cards, and 13.4 times performance improvement on 128 GPU cards. We can train 1 thousand in 5 hours Billions of data.

image

The third case is distributed training for large-scale face recognition.

In the face recognition scenario, the complexity of distributed training will increase with the increase in the number of face recognition classifications. In this case, the number of face classifications has reached tens of millions of faces. The model is InsightFace, and the calculation framework is It is MXNET.

We first use FastGPU to pull up multiple 8-card P100 GPU servers, 25Gb network and parallel file system CPFS, and use AIACC-MXNET for distributed training.

image

There is no way to do pure data parallelism in the tens of millions of face classification scenarios, and it is necessary to do mixed parallelism of data parallelism and model parallelism. So we expanded the AIACC interface, on the one hand, it supports MXNET's kvstore interface, on the one hand, it supports mixed parallelism of data parallelism and model parallelism. In this way, the ability of face recognition is increased to 10 million levels through AIACC-MXNET, and finally 16 GPU The performance on the card can be increased by 56%, the performance on 32 GPU cards can be increased by 135%, and the performance on 64 GPU cards can be increased by 280%.

image

The fourth case is large-scale distributed training of natural language understanding.

The models in this case include Transformer and Bert models, of which Bert is a very large model open sourced by Google, and has achieved very good results in the NLP competition. It has 110 million parameters, which is a very big challenge for the acceleration ratio of distributed training. We use FastGPU to pull up multiple 8-card P100 GPU servers, 25Gb network and parallel file system CPFS, and use AIACC-Tensorflow to do distributed training.

image

We extended the AIACC interface to support the distributed training of Transformer and Bert models. Finally, the Transformer model achieved a performance improvement of 7.8 times on 16 GPU cards, and the Bert model achieved a performance improvement of 7.4 times on 16 GPU cards. .

image

FastGPU one-click deployment and training gesture recognition application source code:
https://github.com/aliyun/alibabacloud-aiacc-demo/tree/master/pytorch/gtc-demo

The large-scale face recognition distributed training source code introduced above:
https://github.com/aliyun/alibabacloud-aiacc-demo/tree/master/mxnet/insightface

The large-scale natural language understanding distributed training source code introduced above:
https://github.com/aliyun/alibabacloud-aiacc-demo/tree/master/tensorflow/bert

More large-scale deep learning source code will be open source in the future, so stay tuned.

[Yunqi Online Classroom] Every day, product technology experts share!
Course address: https://yqh.aliyun.com/live

Join the community now, face-to-face with experts, and keep abreast of the latest developments in the course!
【Yunqi Online Classroom Community】https://c.tb.cn/F3.Z8gvnK

Original Published: 2020-04-08
Author: Yu Liang
article from: "  AI front line " for information may concern "  AI Frontline "

If you find that there is suspected plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, the community will immediately delete the suspected infringing content.

Original link
This article is the original content of the Yunqi community and may not be reproduced without permission.

Published 2315 original articles · 2062 thumbs up · 1.58 million views

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/105441248