Approached the depth of learning, understanding MoXing: Acquaintance Huawei cloud ModelArts ace weapon - MoXing [share] Huawei Cloud

Abstract This paper is the first to MoXing series, focuses on what is MoXing, advantages MoXing API and the basic structure MoXing program.

MoXing concept

MoXing Huawei cloud depth learning services provided by the network model development API. Relative to TensorFlow and MXNet and other native API, MoXing API to make the code easier to write the model, allowing users only need to worry about data input (input_fn) and model building (model_fn) of the Code, any model can be realized in a multi-GPU and distributed high-performance operation.

MoXing-TensorFlow native support TensorFlow, Keras, slim like API, you can build the image classification, object detection, generated against a variety of models, natural language processing, the OCR and the like.

 

Why the name "MoXing" ?

 

First of all, "MoXing" pinyin "model" of the term. Because deep learning of the times, China's scientific and engineering team gradually become the industry leader, with the name derived from the Chinese, highlighting the depth of learning services (DLS) R & D team confidence, DLS R & D team will work to develop a model to build MoXing API the industry benchmark areas.

At the same time, "MoXing" also implies "Model Crossing". "Crossing" on the one hand has the meaning "crossroads of the road," the; Model Crossing MoXing API brings together representatives from all the classical model, users get through the latest achievements in the field of model connected on the road. Crossing on the other hand also has the meaning of "leaps and bounds voyage" of; Model Crossing representatives MoXing API is designed to achieve leapfrog development model, to provide users with a significantly better performance than the native API, and has more ease of use.

Of course, brain-hole wide open MoXing users can also be understood as "evil", even if the name of a gimmick more, it can not cover its outstanding. Approached the depth of learning, understanding MoXing, to fully exploit its potential, magic MoXing API will develop even more powerful for your model!

 

MoXing API Advantage

Huawei Service cloud deep learning parallel hybrid fusion, gradient compression, convolution acceleration, EASGD technology, and may be single MoXing frame of the code is automatically distributed, large-scale distributed training, the training model greatly improves the speed and effectiveness.

The following is based on experimental data depth study of Huawei cloud services.

image.png

image.png

As can be seen by comparing, when the MoXing GPU = 1, and the acceleration is not significantly higher than the throughput advantage; but GPU = 4, throughput and speedup been fully beyond tensorflow; in respect to the GPU = 8, a certain phase other API has a qualitative leap.

By the following two examples to introduce the MoXing performance.

1 to MoXing achieve LARS training ResNet-50 A Case Study

LARS allows large batch_size train the neural network, the advantage that it can not affect the accuracy of convergence in the case of increased batch_size increase batch_size it means being able to use more distributed nodes to train the network, thereby reducing the total duration of training ( when using traditional methods of large-scale node will encounter a large batch_size not lead to convergence problems, you can not use traditional methods for training).

LARS Optimizer use MoXing can achieve batch_size = 32k distributed training ResNet-50.

image.png

Loss value graph

image.png

FIG correct curve

⊙ green line as stand-alone convergence curve ResNet-50, using 4 GPU.

⊙ gray line is the same in the green line, when the convergence curve using FP-16, the accuracy almost no effect.

⊙ orange line to use MoXing distributed training a convergence curve ResNet-50 model.

⊙ batch_size red line is in use MoXing LARS characteristics achieved ResNet-50 = 32k of the convergence curve.

LARS core code, the definition of a LARS-based optimizer:

image.png

完整代码(基于TensorFlow-1.4):http://code.huawei.com/inforsight-dl/tf-models/blob/v1.x.x-tf-1.4/moxing/moxing/tensorflow/practice/image_classification/train_model_32k.py

运行参数:https://github.com/huaweiyun7759/backup/tree/master/Using%20MoXing%20to%20train%20resnet-50%20with%20LARS

2、MoXing实现DGC训练ResNet-50

DGC能减少分布式训练的通信量,有效降低由于网络带宽造成的瓶颈,在不影响收敛精度的情况下增加分布式训练加速比。

对比传统resnet_v1_50的训练和应用DGC时的训练:传统收敛精度:top-1 = 74.4, top-5 = 91.7,DGC收敛精度:top-1 = 74.5, top-5 = 91.8。在吞吐量对比上,参见下面的图标可知,在1Gbps的带宽下,原生TF的加速比是0.4147,DGC的加速比是0.8670,加速比超过原生TF的一倍。

image.png

正确率曲线图

image.png

梯度稀疏度变化曲线图

由图可知,深度梯度压缩的梯度稀疏度在前5个epoch时是由75%逐渐上升到99.9%,所以在前5个epoch时,分布式加速比并一定比普通的分布式训练高,但是从第5个epoch之后,加速比则有显著提升,同时模型精度也没有下降。从第5个epoch之后DGC在分布式运行中的加速比表现:

image.png

DGC的基本使用方法:在代码中import moxing.tensorflow as mox,然后运行脚本时加入dgc的相关参数:

dgc_sparsity_strategy: 稀疏度策略

dgc_momentum_type: momentum strategies

dgc_momentum: momentum value

dgc_momentum_factor_masking: whether to apply a factor masking

dgc_total_samples: training set number of samples

Operating parameters: https://github.com/huaweiyun7759/backup/tree/master/Using%20MoXing%20to%20train%20resnet-50%20with%20DGC

代码(基于TensorFlow-1.4):http://code.huawei.com/inforsight-dl/tf-models/blob/v1.x.x-tf-1.4/moxing/moxing/tensorflow/practice/image_classification/train_model.py

MoXing Basic Program Architecture

Moxing framework easy to use, directly to the code on the depth of learning Huawei cloud services can run, stand-alone distributed set of code, data reading are optimized, without requiring the user then changes. Code has a lot of cases, are based on TensorFlow-1.4, the operating parameters refer to the code itself.

MoXing series Next Issue: Based Tensorflow operating parameters tutorial.

Source: Huawei's original cloud community of: cloud AI

Guess you like

Origin www.cnblogs.com/huaweicloud/p/12016887.html