Is Huawei starting to make big moves again? CV New Architecture: VanillaNet: the Power of Minimalism in Deep Learning Paper Reading Notes


write in front

  At the VALSE conference two days ago, Huawei's article shocked me. To be honest, the performance was a bit explosive. After looking at the table on Github, the number of parameters is scary as the number of layers increases, but the inference speed is much higher than the previous models.

Insert image description here

1. Abstract

  Looking at the entire Abstract, there is no specific content, that is to say, the VanillaNet proposed in this article can use ResNet with a left hook, Swim-Transformer with a right hook, etc. The main reason is to avoid the introduction of ultra-deep, shortcuts, and self-attention mechanisms, and there are no complex activation functions.

2. Introduction

  The first paragraph briefly talks about the development and role of AI, and the second paragraph starts with the introduction of AlexNet and the follow-up of ResNet, which shows that this model is becoming more and more complex in design, but at the same time its performance is getting better and better. The third section is the home of Transformer, still emphasizing the depth of the model. The fourth paragraph is a transition from previous to next, indicating that deeper and more complex model structures are difficult to deploy. The fifth paragraph shows the problem. Flat networks will have gradient disappearance problems, and the performance of some deep networks far exceeds the previous AlexNet and VGG, so few people pay attention to the design of the model structure.
  The sixth paragraph introduces the topic. This article proposes VanillaNet. The model structure is simple: the ultra-depth, shortcut branches, and self-attention operations of the model are removed. Accordingly, a training strategy is proposed to gradually eliminate nonlinear layers to maintain the inference speed. In order to enhance the nonlinearity of the network, a series-based activation function is proposed, which is far more effective than other models. One last thing, VanillaNet is so powerful, come and follow my work.

3. Neural structure of a single vanilla

  Most SOTA classification models consist of three parts: the stem block turns the input 3-channel image into multi-channel and is accompanied by downsampling, the main body module is used to learn useful information, and a fully connected layer is used for the classifier. output. The body module has four stages, each stage is stacked by multiple identical blocks. After each stage the number of feature channels will increase, while the width and height will decrease.
  In the next paragraph, I complain that ResNet and ViT are too deep and too deep, and ViT requires multiple self-attention layers.
  In the current development of AI chips, the FLOPs or parameter quantities that were originally restricted are no longer the bottleneck, because Lao Huang's NVIDIA GPU has indeed developed. Therefore, the complex design of the model and deeper blocks have become the main factors restricting the speed. So this article proposes VanillaNet, as shown in the figure below:
Insert image description here
VanillaNet is still a three-stage design, but the difference is the depth. Each stage is built with only one layer.

  Take the 6-layer VanillaNet as an example, stem: convolutional layer 4 × 4 × 3 × C 4\times4\times3\times C4×4×3×C , step size is4 44 , will3 3The input image of the 3- layer channel is mapped toCCon channel C. On stages 1, 2, and 3, use a step size of2 2A max pooling layer of 2 is used to reduce the size of the feature map, while the number of channels is increased by 2 22 times. In stage 4, the number of channels is not increased due to subsequent average pooling. The last fully connected layer outputs the classification result. The kernel size of each convolutional layer is1 × 1 1\times11×1 , followed by an activation layer and a layer of Batch normalization. This is the entire structure, no shortcuts or extra blocks.
  Since VanillaNet is relatively simple and a shallow network, which weakens the performance of the model, a series of techniques are proposed to enhance its nonlinearity.

4. Training VanillaNet

4.1 Deep training strategy

  The strategy of training two convolutional layers plus an activation layer is different from training a single convolutional layer. It requires gradually reducing the activation function as the training epoch increases. At the end of training, the two convolutional layers can be fused into one to reduce inference time.
  For an activation function A ( x ) A(x)A ( x ) , such as ReLU and Tanh, are combined with a unique mapping:
A ′ ( x ) = ( 1 − λ ) A ( x ) + λ x A'(x)=(1-\lambda)A (x)+\lambda xA(x)=(1l ) A ( x )+λ x
whereλ\lambdaλ is the equilibrium modifiedA ′ ( x ) A'(x)A (x)nonlinear hyperparameters of the activation function. ordereeandEEE represents the current epoch and the overall number of epochs respectively, thenλ = e E \lambda=\frac{e}{E}l=Ee. So at the beginning of training, e = 0 , A ′ ( x ) = A ( x ) e=0,A'(x)=A(x)e=0,A(x)=A ( x ) means that the model has strong nonlinearity. As the training converges, we finally haveA ′ ( x ) = x A'(x)=xA(x)=x , indicating that there is no activation function between the two convolutional layers.

  Next, the batch normalization and subsequent convolution of each layer are turned into a single convolution operation.

W ∈ R C o u t × ( C i n × k × k ) W\in\mathbb R^{C_{out}\times(C_{in}\times k\times k)} WRCout×(Cin×k×k) B ∈ R C o u t B\in \mathbb R^{C_{out}} BRCoutare the weights and biases of the convolutional layer respectively, input C in C_{in}Cinchannels, output C out C_{out}Coutchannels, the convolution kernel size is kkk . The scale, translation, average and differential of batch normalization are respectivelyγ , β , μ , σ ∈ RC out \gamma,\beta,\mu,\sigma\in\mathbb{R}^{C_{out}}c ,b ,m ,pRCoutDisplay, you can display the fusion operation of the batch version:
W i ′ = γ i σ i W i , B i ′ = ( B i − μ i ) γ i σ i + β i W_i'=\ frac{\gamma_i}{\sigma_i}W_i,B_i'=\frac{(B_i-\mu_i)\gamma_i}{\sigma_i}+\beta_iWi=piciWi,Bi=pi(Bimi) ci+biwhere subscript i ∈ { 1 , 2 , … , C out } i\in\{1,2,\ldots,C_{out}\}i{ 1,2,,Cout} represents theiithThe output value of i channel.

  Then merge the two 1 × 1 1\times11×1 convolution. The input and output features are respectively expressed asx ∈ RC in × H × W x\in\mathbb R^{C_{in}\times H\times W}xRCin×H×W y ∈ R C o u t × H ′ × W ′ y\in\mathbb R^{C_{out}\times H'\times W'} yRCout×H×W , so the convolution operation can be expressed as:
y = W ∗ x = W ⋅ im 2 col ( x ) = W ⋅ X y=W*x=W\cdot\mathrm{im}2\mathrm{col}(x )=W\cdot Xy=Wx=Wim2col(x)=WX where∗ * represents the convolution operation,⋅ \cdot represents matrix multiplication,X ∈ R ( C in × 1 × 1 ) ( H ′ × W ′ ) X\in\mathbb R^{(C_{in}\times1\times1)(H'\times W')}XR(Cin×1×1)(H×W )originates fromim2col {\text{im2col}}The im2col operation converts the input into a shape corresponding to the convolution kernel. For1 × 1 1\times11×For convolution , there is no need to slide the convolution kernel on the overlapping part (because there is no overlapping part). So the weight matrix of the two convolutional layers is expressed asW 1 W^1W1 sumW 2 W^2W2 , two convolution operations without activation function can be expressed as:
y = W 1 ∗ ( W 2 ∗ x ) = W 1 ⋅ W 2 ⋅ im 2 col ( x ) = ( W 1 ⋅ W 2 ) ∗ X y =W^1*(W^2*x)=W^1\cdot W^2\cdot\text{im}2\text{col}(x)=(W^1\cdot W^2)*Xy=W1(W2x)=W1W2im2col(x)=(W1W2)X So far, two1 × 1 1\times11×1Convolutions are smoothly fused and do not reduce the inference speed.

4.2 Series Informed Activation Function

  Some of the current mainstream activation functions include Rectified Linear Unit (ReLU) and its variants PReLU, GeLU, and Swish, which are limited by the nonlinearity of simple and shallow networks. Compared with deep networks, these activation functions have not been systematically studied. .
  There are two ways to improve the nonlinearity of neural networks: stacking nonlinear activation layers or increasing the nonlinearity of each activation layer. Most mainstream networks choose the former, which leaves high potential for parallel computing.
  One of the direct ideas is to improve the nonlinear ability of stacked activation layers. Stacking activation layers in series is the key to deep networks. In contrast, turning to stacking activation layers simultaneously is a good approach. Denote the single-layer activation function as A ( x ) A(x)A ( x ) ,xxx is the input, and the function can be ReLU and Tanh. StackA ( x ) A(x)A ( x ) possible display:
A s ( x ) = ∑ i = 1 nai A ( x + bi ) A_s(x)=\sum_{i=1}^n a_i A(x+b_i)As(x)=i=1naiA(x+bi) wherennn represents the number of stacked activation functions,ai a_iai b i b_i biare the scale and bias of the activation function respectively. The non-linearity of the activation function can be enhanced by simultaneous stacking.
  In order to further enhance the approximation ability of Series, the input is changed by changing its neighbors to learn global information, similar to BNET. Specifically, given the input feature X ∈ RH × W × CX\in\mathbb R^{H\times W\times C}XRH × W × C , whereHHH W W W C C C are the input width, height, and number of channels respectively. So the activation function is shaped as follows:
A s ( xh , w , c ) = ∑ i , j ∈ { − n , n } ai , j , c A ( xi + h , j + w , c + bc ) A_s(x_{ h,w,c})=\sum\limits_{i,j\in\{-n,n\}}a_{i,j,c}A(x_{i+h,j+w,c}+ b_c)As(xh,w,c)=i,j{ n,n}ai,j,cA(xi+h,j+w,c+bc) whereh ∈ { 1 , 2 , … , H } h\in\{1,2,\ldots,H\}h{ 1,2,,H} w ∈ { 1 , 2 , … , W } w\in\{1,2,\ldots,W\} w{ 1,2,,W} c ∈ { 1 , 2 , … , C } c\in\{1,2,\ldots,C\} c{ 1,2,,C } . Whenn = 0 n=0n=0 时, A s ( x ) = A ( x ) A_s(x)=A(x) As(x)=A ( x ) . This article uses ReLU as the activation function to build the series.
  Next, analyze its computational complexity: for a convolution kernel with sizeKKThe convolutional layer of K has the input and output channel numbers C in C_{in}respectively.Cin C o u t C_{out} Cout, the computational complexity is:
O ( CONV ) = H × W × C in × C out × k 2 \mathcal{O}(\mathrm{CONV})=H\times W\times C_{in}\times C_{ out}\times k^2O(CONV)=H×W×Cin×Cout×k2The cost of the series activation layer is:
O ( SA ) = H × W × C in × n 2 \mathcal{O}(\text{SA})=H\times W\times C_{in}\times n^ 2O ( SA )=H×W×Cin×n2于是:
O ( CONW ) O ( SA ) = H × W × C i n × C o u t × K 2 H × W × C i n × n 2 = C o u t × k 2 n 2 \frac{\mathcal{O}(\text{CONW})}{\mathcal{O}(\text{SA})}=\frac{H\times W\times C_{in}\times C_{out}\times K^2}{H\times W\times C_{in}\times n^2}=\frac{C_{out}\times k^2}{n^2} O ( SA )O ( CONW )=H×W×Cin×n2H×W×Cin×Cout×K2=n2Cout×k2Ranked 4th in VanillaNet-B 4Take 4 stages as an example, whereC out = 2048 C_{out}=2048Cout=2048 k = 1 k=1 k=1 n = 7 n=7 n=7 , the ratio of the above formula is84 8484 . Therefore, the computational complexity of the proposed activation layer is lower than that of the convolutional layer.

5. Experiment

  ImageNet dataset.

5.1 Ablation experiment

The influence of the number of series in the activation function

Insert image description here

The impact of training techniques

Insert image description here

The impact of shortcut branches

Insert image description here

5.2 Visualization of attention

Insert image description here

5.3 Comparison with SOTA architecture

Insert image description here

5.4 Experiments on COCO dataset

Insert image description here

6. Conclusion

  This article studies the feasibility of simple, shallow neural networks, and proposes a deep training strategy and series activation function for training VanillaNets to enhance the nonlinearity of the model. The experimental results show that VanillaNets is very effective, and I hope everyone will give it a try.

Appendix A: Network structure

Insert image description here
Each convolutional layer is followed by an activation layer. For VanillaNet-13-1.5×, the number of channels x1.5. For VanillaNet-13-1.5׆, adaptive pooling is further used in stages 2, 3, and 4, and the corresponding shape is 40 × 40 40\times4040×40 20 × 20 20\times20 20×20 10 × 10 10\times10 10×10

Appendix B: Training details

Insert image description here
write on the back

  This Huawei article is short and concise, and it really reflects the basic skills. In this era of "deep" learning, it is truly awesome to dare to challenge shallow networks.

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/131245415