VanillaNet principle and code interpretation

paper:VanillaNet: the Power of Minimalism in Deep Learning

official implementation: GitHub - huawei-noah/VanillaNet

existing problems 

While complex networks perform well, their increasing complexity poses deployment challenges. For example, the shortcut operation in ResNets consumes a lot of off-chip memory traffic when merging features from different layers. For another example, the axial shift operation in AS-MLP and the shift window self-attention operation in Swin Transformer require complex engineering implementation, including rewriting CUDA code.

The innovation of this article

This paper proposes VanillaNet, a new neural network architecture with a simple and elegant design while maintaining remarkable performance in vision tasks. VanillaNet solves the problem of complexity by discarding complex operations such as excessive depth, shortcut, and self-attention, and is very suitable for environments with limited resources.

method introduction

A Vanilla Neural Architecture

The architecture of most SOTA classification networks consists of three parts: a stem block converts the input image from 3 channels to multi-channel and downsamples, a main body extracts features, and a fully connected layer is used to output classification results. The main body usually contains 4 stages, each stage stacks multiple identical blocks, and after each stage, the resolution of the feature map decreases and the number of channels increases. The difference between different networks is mainly in the design of blocks.

The VanillaNet proposed in this paper also follows this popular design architecture. The difference is that each stage contains only one network layer to build an extremely concise network.

The structure of VanillaNet-6 is shown in Figure (1), which specifically includes: the stem part is a 4x4x3xC convolutional layer with stride=4, which maps the 3-channel input image to a C-channel feature map. In stages 1, 2, and 3, a stride=2 maxpooling is used for downsampling and the number of channels is doubled. Keep the number of channels constant in stage4 because it is followed by an average pooling layer. The last fully connected layer outputs classification results. In order to use the smallest amount of computation, all convolutional layers are 1x1 in size, and each convolutional layer is followed by a BN layer and an activation function.

Although VanillaNet has a simple structure and few layers, its weak nonlinearity limits its performance, and then the author proposes a series of methods to solve this problem.

Training of Vanilla Networks

Deep Training Strategy

The core idea of ​​the deep training strategy is to train two convolutional layers and one activation function in the early stage of training instead of training only one convolutional layer. As training progresses, the activation function gradually evolves into an identity map. At the end of training, through structural reparameterization, two convolutional layers can be merged into one, reducing inference time.

For an activation function \(A(x)\) (such as common ReLU and Tanh), we combine it with an identity map, as follows

Where \(\lambda\) is a hyperparameter used to balance the non-linearity of the modified activation function \(A'(x)\). Assuming that the current epoch and the number of complete training epochs are \(e\) and \(E\) respectively, we set \(\lambda =\frac{e}{E} \). In this way, \(e=0,A'(x)=A(x)\) at the beginning of training, which means that the network has a strong nonlinearity. When the training convergence is completed, \(A'(x)=x\), which means that there is no activation function between the two convolutional layers, we can combine them into one convolution through the structural reparameterization method.

Series Informed Activation Function

The poor performance of simple and shallow networks is mainly due to poor nonlinearity. There are two ways to improve the nonlinearity of the network: stacking nonlinear activation layers and increasing the nonlinearity of each activation layer. Most networks choose the former, but this article chooses the latter, but also through stacking. ( Here the article says that the former is serially stacking, while the latter is concurrently stacking. Personal understanding should be continuous stacking, but the former usually stacks convolutional layers and activation functions together to make the network deeper and deeper. This article only stacks activations. function ).

Specifically, through weighted stacking, where \(n\) represents the number of stacks, and \(a_{i}, b_{i}\) are the scale and bias of each activation function. With this stacking, the nonlinear capability of the activation function can be greatly improved.

Equation (5) can be regarded as a series series in mathematics. In order to further improve the approximation ability of the series, the author enables the series-based function to learn global information by changing the input from neighbors, similar to BNET. Specifically for an input feature\(x\in\mathbb{R}^{H\times W\times C}\), the activation function can be expressed as

其中 \(h\in\left \{ 1,2,...,H \right \} ,w\in\left \{ 1,2,...,W \right \} ,c\in\left \{ 1,2,...,C \right \} \)。

Experimental results

The specific structure of VanillaNet is shown in Table 6

On the ImageNet dataset, the comparison with some other SOTA models is shown in Table 4

 It can be seen that VanillaNet has achieved a top-1 accuracy of 80.57 with only 10 layers. Under different layers, it has a significant speed advantage compared with other models of the same accuracy.

code interpretation

The first is the deep training strategy. In models/vanillanet.py, class VanillaNet() contains the specific implementation of the network. Among them , self.deploy is used to indicate whether it is the inference phase. When self.deploy=False , it means the training phase. You can see that the stem phase contains self.stem1 and self.stem2 , and each block in the main body phase contains self.conv1 and self.conv2 , the last fully connected layer also contains self.cls1 and self.cls2 . When the training is completed, it is the inference stage. The activation functions between 1 and 2 of all stages become identity mappings or there is no activation function between operations 1 and 2, and then the operation 1, 2 merged into 1.

Among them, self.act_learn is \(\lambda\) in formula (1). In main.py, act_learn changes as the training progresses.

act_learn = epoch / args.decay_epochs * 1.0
model.module.change_act(act_learn)

Then there is the stacking of activation functions. Here, the author evolves the simple weighted stacking of activation functions, that is, formula (5), into formula (6) that can learn adjacent inputs, and can be realized through deep convolution, where the number of stacked hyperparameters\( n=3\), which is self.act_num in the code .

# Series informed activation function. Implemented by conv.
class activation(nn.ReLU):
    def __init__(self, dim, act_num=3, deploy=False):
        super(activation, self).__init__()
        self.act_num = act_num
        self.deploy = deploy
        self.dim = dim
        self.weight = torch.nn.Parameter(torch.randn(dim, 1, act_num*2 + 1, act_num*2 + 1))
        if deploy:
            self.bias = torch.nn.Parameter(torch.zeros(dim))
        else:
            self.bias = None
            self.bn = nn.BatchNorm2d(dim, eps=1e-6)
        weight_init.trunc_normal_(self.weight, std=.02)

    def forward(self, x):
        if self.deploy:
            return torch.nn.functional.conv2d(
                super(activation, self).forward(x), 
                self.weight, self.bias, padding=self.act_num, groups=self.dim)
        else:
            return self.bn(torch.nn.functional.conv2d(
                super(activation, self).forward(x),
                self.weight, padding=self.act_num, groups=self.dim))

    def _fuse_bn_tensor(self, weight, bn):
        kernel = weight
        running_mean = bn.running_mean
        running_var = bn.running_var
        gamma = bn.weight
        beta = bn.bias
        eps = bn.eps
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta + (0 - running_mean) * gamma / std

    def switch_to_deploy(self):
        kernel, bias = self._fuse_bn_tensor(self.weight, self.bn)
        self.weight.data = kernel
        self.bias = torch.nn.Parameter(torch.zeros(self.dim))
        self.bias.data = bias
        self.__delattr__('bn')
        self.deploy = True

doubt

In the author's official interpretation, the end of the convolution is not a Transformer, and the potential of the minimalist architecture is unlimited - there are also comments under Zhihu that the original idea of ​​adding nonlinearity to the continuous stacking activation function of formula (5) is very good, but after evolving into formula (6) , it is restored to convolution again, and the weight \(a_{i,j,c}\) of the weighted sum of the activation function is the weight of the convolution kernel. In the official implementation, it is also a series informed activation implemented through depth convolution, so that The outer convolutional layer has been moved to the activation function, can the number of layers be reduced?

Guess you like

Origin blog.csdn.net/ooooocj/article/details/131364777