StartDT AI Lab | algorithm model of visual intelligence engine acceleration


StartDT AI Lab | algorithm model of visual intelligence engine acceleration

 

Articles described before StartDT AI Lab through the column, I believe we have computer vision technology and artificial intelligence algorithms to support the role of point cloud strategy in the odd AIOT have a good understanding. Again, this business traction, technical coverage pattern also gain a good market response, while the odd point cloud AIOT large spread in the market gave the algorithm department has brought new challenges, that is, how to further reduce the algorithm end calculation of costs, thereby enhancing business profits.

The aim is simple, is to model the existing algorithms without compromising accuracy, miniature size to save hardware storage costs, simplify the computational complexity of the model, in order to save the cost of computing hardware. This small and fast model optimization requirements, we generally referred to as model acceleration problem. Solution to this problem, in the academic community is a long time, has a lot to explore very worth learning. This article will reveal the model as we accelerate the mystery.

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

 

Why do models accelerate?

Before officially opened, you first need to understand the depth of learning is how successful: Why the depth of neural networks can succeed in this day and age, rather than in the 1980s and 1990s it? Compared to before, the major breakthrough came the following aspects: First, such as stochastic gradient descent optimization algorithm improvements and the like, and second, increasing the tagging data sets, the third is the introduction of models to meet the huge training and reasoning count such force needs high-performance GPU computing hardware.

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(Image classification task performance of different models on the CPU and GPU)

But expensive GPU, because the application industry is very sensitive to costs. Therefore, manufacturers such as google generally developed their own AI chip (TPU) from the source to save money. Therefore, the first question is the corresponding model to accelerate the efficiency of the industry's most concerned about: how stable and efficient algorithm can be deployed on hardware so that it can have the greatest value.

The model's second goal is to accelerate faster! Many scenes for speed is very high: most easy to think of the scene is the depth of the neural network image processing technology intensive use of unmanned, in which a brake ⻋ 0.5s slower scenario will cause a major accident, the model the reasoning speed is always demanding.

Another scenario is injected into the AI ​​capabilities in mobile applications on mobile devices, this is the inevitable reaction of AI experience in the mobile Internet. Now more well-known application of artificial intellectual disabilities have launched various voice services assistant, Siri, students and other little love.

模型加速的第二个目标就是如何在性能受限的设备上部署能够满足需要 的模型。加速后的模型的参数和计算量更小从而可以有效降低计算和存储开销,能够部署到移动端这种性能受限的设备上。关于移动端性能,这里说组数据:移动端常⻅的ARM芯片的A72大核的算力大概是 30G FLOPs,而桌面端的Intel酷睿i3的算力是1000G,也就说如果要把在服务器端进行推理的模型拿来放在移动端上去运行,这个加速比例至少是30倍以上。

如何进行模型加速?

模型加速一般是对已经训练好的深度模型进行精简来得到轻量且准确率相当的模型。这里有一个重要的前提:深度神经网络并不是所有的参数都在模型中发挥作用,大部分参数其实是冗余的,只有一小部分对模型的性能产生关键作用。

根据这一前提条件,目前工业界主要通过以下几种方式对模型进行加速:包括不会改变网络机构的权值量化,知识蒸馏,紧凑型神经网络的设计和会改变网络的网络剪枝。学术界和工业界对着几个方向的研究侧重点略有差异:前者对经凑型神经网络的设计更感兴趣,毕竟是从源头上解决问题的方法;而后者对剪枝量化这种偏向工程实现的更关注,毕竟加速效果稳定可控。这里主要简单讲下我们在生产中比较常用的几种方式:1、权值量化 2、知识蒸馏 3、网络剪枝。

01 权值量化

量化的思路简单概括下就是把相近的值变成一个数。最常用的量化方式就是INT8量化,即把神经网络里面的原来用精度浮点数(FP32)存储的权值和计算中间值用整形(INT8)表示。计算机中的值都是用二进制存储的,FP32是用32bit来存储,INT8是用8个bit来存储。从下图可以看到,FP类型用了23bit来表示小数部分,因此使用INT8集意味着只能用更稀疏的值和更小的数值范围(-127~128),小数的部分和超过128的部分都会被省略,如果直接就这样量化,那么这部分损失的值就会极大的影响模型精度。

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(来源:wiki:fp32的存储方式)

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(FP32量化到INT8的值的变化)

那既然会影响精度,为什么我们还是要冒着⻛险去做量化呢?这主要是两个方面的原因:一方面是现代的计算芯片对于低bit的数值计算要比高 bit的快很多,尤其是现在很多AI芯片都设计了专⻔的INT8计算核来专⻔处理INT8数值的计算,比如瑞星微出品的RK3399 pro芯片就带了一个算力达3T FLOPs的NPU;另一方面是计算机内存和GPU显存加载8bit的数值速度更快,显存消耗更小,同样的显存下就可以加载更多更大的网络进行计算。

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(来源:https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/:RTX2080ti对FP32, FP16和INT8

那么为什么INT8数值类型在深度神经网络中中能够应用呢?不是有数值精度损失么?主要原因有两个:

1、训练好的深度神经网络是出了名的对噪声和扰动鲁棒性强。

2、大部分训练好的权重都落在一个很小的区间内。

这个是有文章作为理论支撑的,Han Song在ICLR2016发表的DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING作为神经网络压缩的开山大作里面就对AlexNet网络的卷积层的权重分布进行了分析。下面左边这个图就是其中一层神经网络的权重,基本上分布在-0.1到0.1之间。

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

 

如果进行4bit量化,4bit能够最大表示16个数值,因此大部分权重都有塌缩,能够保持原来的值的只有16个值,这16个值的分布如右图所示,分布的图形还是挺吻合的。那么如果进行8bit的量化,最大能够保持256个值,对原始权重的保留会更加完整,量化造成的数值损失会很小。

根据这个特性,最直观、最简单量化方式就是乘一个系数把FP32类型的小数部分缩放为整数,然后用这 个INT8整数进行计算,计算结果再除以这个系数还原成FP32的值。因为数值集中在很小的范围内,因此缩放的时候就不太需要担心会有大量的值转化后会溢出INT8的表示范围。因此对于实际值和量化值的映射关系,一般可以用以下公式表示:

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

 

其中,r表示实际值;q表示量化的比特数,比如int8量化就是8;z表示量化后的0点值。在实际操作 中,缩放比例、进行缩放的原始数值的最大最小值边界这些值都是需要反复调试优化的,优化较好的量 化加速效果4倍的情况下一般能够保持模型的精度损失不超过0.5%。

02 网络剪枝

另一项比较重要的神经网络的加速方法就是模型减枝,剪枝这个方式在许多经典的机器学习中也很常见,比如决策树,GBM算法。在神经网络中,剪枝原理受启发于人脑中的突触修剪,突触修剪即轴突和树突完全衰退和死亡,是许多哺乳动物幼年期和⻘春期间发生的突触消失过程。突触修剪从出生时就开始了,一直持续到 20 多岁。

前面提到过,神经网络的参数量非常多,而其中大部分的参数在训练好之后都会集中在0附近,对整个网络的贡献非常小。剪枝的目的就是把这些对网络贡献很小的节点从网络中删除,从而使网络变得稀疏,需要存储的参数量变少。当然后遗症也是有的,一方面模型的精度会有所下降,另一方面那些冗余的参数可能是神经网络鲁棒性强的原因,因此剪完枝模型的鲁棒性也会有所损失。

经典的剪枝方法是使用预训练模型进行裁剪,裁剪的原则就是设定一个阈值或一定的裁剪比例,然后把低于阈值的权值抛弃,再使用训练集进行微调来得到最后的剪枝模型。这种方法操作上非常简单,裁剪的评价指标有很多种,比如权重大小,权重梯度大小,权重独立性等,但是往往要耗费非常多的时间来进行反复调参和微调训练。这种就是现在主流的结构化剪枝方法,裁剪的粒度比较粗,对神经网络整个层进行裁剪,损失的精度相对来说比较大,但是优势在于不用关心使用的模型和硬件,通用性很好。

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(Source: HanSong 2015 NIPS: neuronal pruning structure before and after)

Later studies also proposed better unstructured pruning methods, fine-grained cut, pruning can be a single neuron network layer, loss of accuracy is relatively small, but is dependent on the specific algorithms and hardware platforms, operate more complicated. In addition, a wide range of applications with enhanced learning and deep learning generated against the network in the field, more and more models using pruning pruning algorithm to produce reinforcement learning and confrontation generation networks. You can use reinforcement learning model machine automatically search space pruning, pruning to obtain the best model based on pruning requirements. The same network can generate confrontation confrontation under the guidance of a network so that the generator generates a model to meet the needs pruning.

03 knowledge distilled

Two ways to accelerate the use of the above can not meet the needs, then this time you can try the 15-year Hinton and Google julao Jeff Dean proposed the creation of knowledge distillation. On many tasks, the general performance of large, complex networks will be more than simple little stronger network performance. Using lightweight compact small network model training time to join in the training data set to change the large network convergence good information out as supervision, so that the network can fit a large network of small, eventually learning to large network similar function mapping relationship. So at deployment time we can replace it with a large network of small networks run faster to perform tasks.

StartDT AI Lab | algorithm model of visual intelligence engine acceleration

(The basic structure of the distillation process knowledge)

Knowledge distilled maximum depth of replacing computational neural network ⻣ frame network can be achieved, from the methodology more generic, and therefore has a very strong value in real singular point cloud image task, and from the accelerated effects, the use of small the smaller the amount of network computing, the more speedup, of course, the general model learning effect under the worse case, the greater the loss of precision. However, use of the network feature extraction different tasks difference is quite large, in general strategy distillation process need to be adjusted according to different tasks.

Epilogue

In summary, StartDT AI Lab in practice the model acceleration, the integrated use of weights to quantify knowledge distillation, compact neural network design and network pruning, keep small, fast, accurate models required for various types of business, great enhance R & D efficiency.

Guess you like

Origin www.cnblogs.com/StartDT/p/11911058.html