iGear uses this little magic, and the model training speed is increased by 300%

A high-precision AI model is inseparable from a large number of high-quality datasets, which often consist of annotation result files and massive images. In the case of a relatively large amount of data, the model training period will also be lengthened accordingly. So what's a good way to speed up training?

The first thing that the boss of Haoqi thinks of is usually to increase computing power and increase resources.

If you have enough money, you basically don't need to continue to look at other solutions.

But in most cases, in the face of expensive computing resources, it is impossible for us to increase infinitely. In the case of spending a lot of money to buy limited resources, how can we speed up model training and improve resource utilization?

This article will introduce the iGear high-performance cache acceleration solution. Let's first look at a simple AI workflow service diagram.

The collected data will form high-quality training data sets through the screening and preprocessing of the iGear data center and the labeling of the iGear labeling platform. These data sets will be transferred to the iGear training platform to train the algorithm model. The iGear training platform completes heterogeneous computing resource scheduling based on Kubernetes clusters. In this architecture, computing and storage are separated, and the data set will be placed in the remote object storage cluster. When running the model training task, it is necessary to access the remote storage to obtain the data set, which brings high network I/O overhead. , it will also cause the problem of inconvenient data set management.

What the iGear high-performance cache acceleration solution proposed in this paper needs to do is:

1. How to reduce I/O overhead and improve training efficiency and GPU utilization?

2. How to manage datasets to improve user ease and convenience?

Overview of Caching Scenarios

As mentioned earlier, datasets are stored in remote object storage clusters. In order to improve the usability of the dataset, a common solution is to mount it to the training task in fuse mode, so that users can identify and use the dataset in the form of a common file directory. Although this method satisfies the ease of use, the high I/O overhead makes the user have to synchronize the dataset to the computing node in advance manually or by script, which will increase the mental burden of the user during the training process. In response to the above problems, we have optimized the training data set. When the user starts to prepare for training, the data set cache engine is implemented through JuiceFS to provide users with data set cache and warm-up functions, which can reduce the access to remote object storage. , which can reduce user operations. Make full use of the local storage of the computing cluster to cache the data set. Through the two-level cache (the system cache of the training node + the disk cache of the training node), the speed of model training can be accelerated, and the utilization of GPU can also be improved to a certain extent.

JuiceFS is an open source high-performance shared file system designed for cloud environments. It has made a lot of targeted optimizations in data organization management and access performance. The community version also has very good documentation support, so I won't go into details here.

Cache scheme test

Test program

The solution we used before was to use fuse to directly mount the object storage system, and mount the S3 bucket to the local by means of mounting, so as to provide the ability to access remote object storage data. The optimized high-performance cache acceleration solution, the backend is also based on object storage, but on this basis, it provides functions such as caching and preheating to optimize the performance of storage. Based on this, we have done the following two sets of comparative experiments, both of which are based on the same set of object storage, and other conditions remain the same.

  1. Performance comparison of enabling or disabling high-performance cache acceleration

  1. Performance comparison between using high-performance cache acceleration scheme and using fuse mount

test method

In the server physical machine environment, we use the ResNet50 v1.5 provided in the PyTorch /examples repository for model training, reproduce the results of single-machine single-card, single-machine multi-card, and compare the execution time.

test environment

硬件:Tesla V100-SXM2-32GB
驱动:Nvidia 450.80.02
操作系统:Ubuntu 20.04.1 LTS
测试工具:PyTorch ResNet50 v1.5提供的脚本
数据集:ImageNet

Experiment 1: Turn on high-performance cache acceleration VS turn off high-performance cache acceleration

ResNet50 v1.5 batch_size = 128, worker=8

As can be seen from the above figure, in the absence of cache, the number of images processed by the training task does not change significantly with the increase in computing power, indicating that the I/O bottleneck has been reached. After using the cache, as the computing power increases, the number of processed pictures also increases accordingly. This proves that after using cache acceleration, the I/O overhead is greatly reduced, and the training speed is also greatly improved under the same computing power . , the speed of a single machine with 8 cards has been increased by **230%**.

From the perspective of model training time, the training time was reduced from 1381 min to 565 min from 1381 min accelerated by the cache, and the model training was completed in only 1/3 of the original time.

The first set of experiments compares the performance differences before and after dataset caching, and verifies the necessity of using a high-performance caching scheme to speed up iGear training tasks.

Experiment 2: High-performance cache to accelerate VS fuse mount

The current more common solution is to use fuse to mount remote object storage to the local, and provide users with access requests to the data set. In order to compare the current ordinary fuse mounting scheme and the high-performance cache acceleration optimization scheme, we designed a second set of experiments:

ResNet50 v1.5 batch_size = 128, worker=8

From the perspective of model training time, the training time of the fuse mounting scheme is 1448 minutes, and the high-performance cache acceleration optimization scheme can reduce the training time to 565 minutes, which is nearly 1/3 of the fuse mounting scheme .

Therefore, compared to the traditional direct use of object storage, our high-performance storage has a significant improvement in training speed and training time.

The second set of experiments compares the model training time under different schemes and verifies the importance of using a high-performance caching scheme to speed up iGear training tasks.

in conclusion

In the face of expensive and limited computing resources, we can greatly accelerate the speed of training tasks on the iGear platform through a high-performance cache acceleration solution, greatly shorten the model training time, and improve the utilization of GPU resources. If the computing power is further improved, the benefits will not stop at the current test environment.

The author of this article: iGear old driver, original address: https://mp.weixin.qq.com/s/Lh5UEVw4-gCe6wAVcmznxg

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

おすすめ

転載: my.oschina.net/u/5389802/blog/5433177