This paper proposes a dynamic network DS-Net that can adapt to hardware acceleration, and realizes dynamic routing through the proposed double-headed dynamic gating. Based on the high-performance network design and IEB, SGS training strategy proposed in the paper, the static SOTA network performance can be achieved with only 1/2-1/4 of the calculation amount, and the actual acceleration is also 1.62 times

. Source: Xiaofei's algorithm engineering note public account

Thesis: Dynamic Slimmable Networks

Introduction

Model speed is very important in the mobile application of models. Methods to improve model inference speed include model pruning, weight quantization, knowledge distillation, model design, and dynamic inference. Among them, dynamic reasoning adjusts its structure according to the input to reduce the overall calculation time, including two directions of dynamic depth and dynamic dimension. As shown in Figure 2, dynamic networks automatically trade-off between accuracy and computational effort, which is more flexible than static model design and pruning methods.

However, the paper found that the actual running speed of the network with dynamic dimensions is mostly not as expected, mainly because the sparse convolution after dynamic pruning does not match the computational acceleration of current hardware. The dynamic pruning of most convolution kernels is done through zero masking (conventional convolution and then taking the corresponding output through the mask) or path indexing (directly through $[:,:]$ to obtain the corresponding new convolution and then recalculate), as shown in Table 1, the computational efficiency of these methods is not high, resulting in the overall inference speed is not accelerated.
In order to solve this problem, the paper proposes a dynamic slender network DS-Net, which has good hardware matching while realizing the dynamic network.
The main contributions of the paper are as follows:

A new dynamic network routing mechanism is proposed to realize the dynamic routing of the network structure through the proposed double-headed dynamic gating. In addition, the dynamic pruning of convolution maintains the memory continuity of weights by slicing, which can be well adapted to hardware acceleration.
A two-stage training method for DS-Net is proposed, including IEB and SGS methods. IEB is used to stabilize the training of the slender network, and SGS is used to increase the diversity of gated outputs, both of which can help improve the performance of DS-Net.
Compared with the ImageNet experiment, the overall performance of DS-Net is about 5.9% higher than that of the SOTA dynamic network, which is slightly lower than that of the static network such as ResNet and MobileNet, but it has 2-4 times of computational savings and 1.62 times of actual inference acceleration.

Dynamic Slimmable Network

The dynamic slimmable network proposed in this paper achieves the purpose of dynamically generating a network according to different input samples by learning a slender supernet and a dynamic gating mechanism. As shown in Figure 3, the supernet of DS-Net is a complete network containing all complete convolutions. Dynamic gating is a series of prediction modules that dynamically set the convolution dimension of each stage according to the input, and then generate a subnet, which is also called dynamic routing.
In current dynamic network research, the main network and dynamic routing are usually jointly trained, similar to the network search method of joint optimization. Referring to the one-shot NAS method, the paper proposes a decoupled two-stage training method to ensure the generalization of each path in DS-Net. In stage I, the function of gating is disabled and the supernet is trained by the IEB method, and in stage II, the weights of the fixed supernet are trained by the SGS method alone.

Dynamic Supernet

Here we first introduce the channel slicing method that can run efficiently in hardware and the paper design supernet, and then introduce the IEB method used in Stage I.

Supernet and Dynamic Channel Slicing

In dynamic networks such as dynamic cropping, dynamic convolution, etc., the convolution kernel $\mathcal{W}$ to the input $\mathcal{X}$ 进行动态参数化 $\mathcal{A}(\theta, \mathcal{X})$ ，这样的卷积可表示为：

动态卷积根据输入去掉不重要的特征通道，降低理论计算量，但其实际加速大都不符合预期。由于通道的稀疏性与硬件加速技术不匹配，在计算时不得不多次索引和拷贝需要的权值到新的连续内存空间再进行矩阵相乘。为了更好地加速，卷积核在动态权值选择时必须保持连续且相对静态。
基于上面的分析，论文设计了结构路由器 $\mathcal{A}(\theta)$ ，能够偏向于输出稠密的选择结果。对于 $N$ 输出、 $M$ 输入的卷积核 $W\in\mathbb{R}^{N\times M}$ ，结构路由器输出精简比例 $\rho\in(0,1]$ ，通过切片操作 $[:]$ 选择卷积核的前 $\rho\times N$ 部分构成切片动态卷积：

$[:]$ 切片操作加 $*$ 稠密矩阵乘法要比索引操作或稀疏矩阵相乘要高效得多，保证了实际运行时的速度。

SuperNet

将多个动态卷积组合起来即可搭建超网，超网通过设置不同的特征维度组合创建多个子网。将结构路由器禁用时，超网等同于常见可精简网络，可用类似的方法进行预训练。

In-place Ensemble Bootstrapping

经典的Universally Slimmable Networks通过两个方法来有效地提升整体的性能：

sandwich rule：每次训练的网络组合包含最大的子网、最小的子网以及其它子网，其中最大的子网和最小的子网分别决定了可精简网络性能的上界和下界。
in-plcae distillation：将最大子网的向量输出作为其它子网的训练目标，而最大子网的训练目标则是数据集标签，这样对可精简网络更好地收敛有很好的帮助。

虽然in-place distillation很有效，但最大子网权值的剧烈抖动会导致训练难以收敛。根据BigNas的实验，使用in-place distillation训练较为复杂的网络会极其不稳定。如果没有残差连接或特殊的权值初始化，在训练初期甚至会出现梯度爆炸的情况。为了解决可精简网络收敛难的问题并且提升整体性能，论文提出了In-plcae Ensemble Boostrapping(IEB)方法。
首先，参考BYOL等自监督和半监督方法，使用过往的表达能力进行自监督的in-plcae distillation训练的做法，将模型的指数滑动平均(EMA, exponential moving average)作为目标网络生成目标向量。定义 $\theta$ 和 $\theta^{'}$ 为在线网络和目标网络：

$\alpha$ 为动量因子，控制历史参数的比例， $t$ 为训练轮次。在训练时，模型的EMA会比在线网络更加稳定和准确，为精简子网提供高质量的训练目标。
接着，参考MealV2使用一组teacher网络来生成更多样的输出向量供student网络学习的做法，在进行in-place distillation时使用不同的子网构成一组teacher网络，主要提供目标向量给最小子网学习。

整体训练过程如图4所示。结合sandwich rule和上述优化的in-place distillation，每论训练有以下3种网络：

最大的子网 $L$ 使用数据集标签作为训练目标。
$n$ 个随机维度的子网使用目标网络的最大子网的向量输出作为训练目标。
最小的子网使用上述子网在目标网络中对应的子网的向量输出的组合作为训练目标，即训练目标为：

总结起来，超网训练的IEB损失为：

Dynamic Slimming Gate

这里先介绍公式2中输出 $\rho$ 因子的结构路由器 $\mathcal{A}(\theta, \mathcal{X})$ 以及动态门控的double-headed设计，最后再介绍Stage II训练使用的sandwich gate sparsification(SGS)方法。

Double-headed Design

将特征图转换为精简比例 $\rho$ 有两种方法：1）标量模式：直接通过sigmoid输出0到1的标量作为精简比例。2）one-hot模式：通过argmax/softmax得到one-hot向量，选择离散的候选向量 $L_p$ 中对应的精简比例。
论文对这两种方法进行对比后，选择了性能更好的one-hot模式。为了将特征图 $\mathcal{X}$ 转换为one-hot向量，将 $\mathcal{A(\theta, \mathcal{X})}$ 转换为两个函数的组合：

$\mathcal{E}$ 将特征图下采样为向量， $\mathcal{F}$ 将向量转化为one-hot向量用于后续的维度切片。参考DenseNet等网络， $\mathcal{E}$ 为全局池化层， $\mathcal{F}$ 为全连接层 $W_1\in\mathbb{R}^{d\times C_n}$ +ReLU+ $W_2\in\mathbb{R}^{g\times d}$ +argmax函数( $d$ 为中间特征维度， $g$ 为 $L_p$ 的长度)：

以图3的第 $n$ 个门控为例，将大小为 $\rho_{n-1}C_n\times H_n\times W_n$ 的特征图 $\mathcal{X}$ 转换成向量 $\mathcal{X}_{\mathcal{E}}\in \mathbb{R}^{\rho_{n-1}C_n}$ ，随后用argmax将向量进一步转换成one-hot向量，最后通过计算one-hot向量与 $L_p$ 的点积得到预测的精简比例：

论文采用的精简比例生成方法跟通道注意力方法十分类似，通过添加第三个全连接层 $W_3^{\rho_{n-1}\times d}$ ，可直接为网络引入注意力机制。基于上面的结构，论文提出double-headed dynamic gate，包含用于通道路由的hard channel slimming head以及用于通道注意力的soft channel attention head，其中soft channel attention head定义为：

$\delta(x)=1+tanh(x)$ ，channel attention head参与stage I的训练。

Sandwich Gate Sparsification

在stage II训练中，论文使用分类交叉熵损失 $L_{cls}$ 和复杂度惩罚函数 $L_{cplx}$ 来端到端地训练门控，引导门控为每个输入图片选择最高效的子网。为了能够用 $L_{cls}$ 来训练不可微的slimming head，论文尝试了经典的gumbel-softmax方法，但在实验中发现门控很容易收敛到静态的选项，即使加了Gumbel噪声也优化不了。
为了解决收敛问题并且增加门控的多样性，论文提出Sandwich Gate Sparsification(SGS)训练方法，使用最大子网和最小子网识别输入图片中的hard和easy，为其生成slimming head输出精简因子的GT。基于训练好的超网，将输入大致地分为三个级别：

Easy samples $\mathcal{X}_{easy}$ ：能够被最小子网识别的输入。
Hard samples $\mathcal{X}_{hard}$ ：不能被最大子网识别的输入。
Dependent samples $\mathcal{X}_{dep}$ ：不属于上述两种的输入。

为了最小化计算消耗，easy samples应该都使用最小子网进行识别，即门控的GT为 $\mathcal{T}(\mathcal{X}_{easy})=[1,0,\cdots,0]$ 。而对于dependent samples和hard samples则应该鼓励其尽量使用最大的子网进行识别，即门控的GT为 $\mathcal{T}(\mathcal{X}_{hard})=\mathcal{T}(\mathcal{X}_{dep})=[0,0,\cdots,1]$ 。基于这些生成的门控GT，SGS损失定义为：

$\mathbb{T}_{sim}(\mathcal{X})\in{0,1}$ 代表 $\mathcal{X}$ 是否应该被最小子网预测， $\mathcal{L}_{CE}(\mathcal{X},\mathcal{T})=-\sum\mathcal{T}*log(\mathcal{X})$ is the cross-entropy loss between the gated output and the generated GT.

Experiment

Compare ImageNet performance with different types of networks.

CIFAR-10 performance comparison.

VOC detection performance comparison.

A comparative experiment is carried out on each module of the IEB training method.

Visualization comparing the SGS loss with the reduction scale distribution.

Comparing different SGS training strategies, Try Best is the strategy in this paper, and Give up is the goal of giving up hard samples and classifying them as the goal of the smallest streamlined network.

Compare different gating design details.

Conclusion

This paper proposes a dynamic network DS-Net that can adapt to hardware acceleration, and realizes dynamic routing through the proposed double-headed dynamic gating. Based on the high-performance network design and IEB and SGS training strategies proposed in the paper, the static SOTA network performance can be achieved with only 1/2-1/4 of the calculation amount, and the actual acceleration is 1.62 times.

If this article is helpful to you, please like it or watch it~
For more content, please pay attention to the WeChat public account [Xiaofei's Algorithm Engineering Notes]

See how Google uses pretrained weights in object detection tasks | CVPR 2022

Introduction

Dynamic Slimmable Network

Dynamic Supernet

Supernet and Dynamic Channel Slicing

SuperNet

In-place Ensemble Bootstrapping

Dynamic Slimming Gate

Double-headed Design

Sandwich Gate Sparsification

Experiment

Conclusion

Supongo que te gusta