VoVNet paper study

An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

When implementing CNN models, accuracy and speed need to be considered. The common ResNet series and DenseNet series focus on accuracy, while the MobileNet series and ShuffleNet series prioritize efficiency. Is there a model that has both accuracy and efficiency? The VoVNet we looked at today is such a model. Let's go learn it together.

1. Model evaluation indicators

Before introducing the VoVNet network, we can first understand the model accuracy and speed evaluation indicators and their meaning (taking target detection as an example) so that we can compare the performance of multiple models later.

Accuracy index

  • mAP
    The full name of mAP is mean Average Precision. The Average Precision here is calculated under different recalls. The specific introduction is not detailed here, please refer to it .
    The larger the mAP, the higher the model detection accuracy.

speed indicator

  • FPS
    means Frame Per Second, which is the number of image frames processed per second. The larger the FPS, the faster the model inference speed.
  • Computation Efficiency (GFLOPs/s)
    calculation efficiency, that is, how many F calculations are calculated in one second. This indicator represents the actual calculation amount of the GPU. Some models are well designed and can fully utilize the computing potential of the GPU (VoVNet allows the GPU to calculate 400G times per second), while some models are poorly designed and cannot fully utilize the computing potential of the GPU (MobileNet allows the GPU to calculate 37G times per second). ). The advantage of GPU computing lies in its parallel computing. When the calculated tensor is large, the computing power of the GPU will be fully utilized. If a large convolution is split into several small convolutions, although the effect is the same, the GPU calculation is inefficient. Compared with FLOPs, we should pay more attention to FLOPs per Second, that is, dividing the total FLOPs by the total GPU inference time. The higher the indicator value, the higher the GPU utilization.

Memory Access Cost(MAC)

Memory access cost. For CNN, memory access increases energy consumption more than calculation. If the features in the network are relatively large, the memory access cost will increase even under the same model size, so the MAC of the CNN layer must be fully considered. The method of calculating the convolutional layer MAC is given in the shuffleNetV2 paper:
MAC = hw (ci + co) + k 2 cico MAC = hw(c_i + c_o) + k^2c_ic_o \\MAC=hw(ci+co)+k2 cico

这りのk , h , w , ci , cok, h, w, c_i,c_ok,h,w,ci,cok, h, w, c_i, c_o are the convolution kernel size, feature height and box, and the number of input and output channels respectively. The amount of calculation of the convolutional layer B = k 2 hwcico B=k^2hwc_ic_oB=k2 hwcicoB=k^2hwc_ic_o, if BB is fixedIf B B, then there are:

M A C ≥ 2 h w B k 2 + B h w MAC \geq 2\sqrt{\frac{hwB}{k^2}} + \frac{B}{hw} \\ MAC2k2h wB +hwB

According to the mean inequality, we can know that the MAC takes the minimum value when the number of input and output channels is the same, and the design at this time is the most efficient.

  • If you can use 3x3 convolution, do not use 1x1 convolution, because the calculation efficiency of 1x1 convolution is low, and 1x1 convolution is generally used to change the number of channels, resulting in increased access costs;
  • You can use large feature maps instead of small ones, because large feature maps are more computationally efficient.

Memory metrics

  • FLOPs(G)
    floating point operations refers to the number of floating point operations, which is understood as the amount of calculation and can be used to measure the complexity of the algorithm/model. See the reference for the calculation process .
  • Param(M)
    Parameter quantity of the model ( refer to the figure below )
    reference
  • Memory footprint (MB)
    memory requirements. Training the model on a computer requires matrix calculations, which consumes memory.
  • Energy Efficiency (J/img)
    Energy efficiency reflects the energy consumed to recognize an image.

2. Introduction to DenseNet model ( reference )

The VoVNet network is basically improved on the DenseNet network, which surpasses the DenseNet model accuracy, speeds up model inference time, and efficiently utilizes the GPU. So before introducing the VoVNet network, we first understand the DenseNet network structure.
In order to pursue higher accuracy, researchers often continue to increase the depth of the network, such as ResNet-18, 50, 101, or even to 1000 layers. While continuing to deepen the network, the gradient may disappear when it is back-propagated to the input through such a long path. Is there any network that can make the network deeper and the gradient will not disappear? DenseNet proposes a simple method to directly reuse features to solve this problem by establishing dense connections between the features of all previous layers and the features of subsequent layers. The following figure shows part of the network structure:
Insert image description here
The following figure is the detailed structure diagram of the DenseNet network:
Insert image description here
From the perspective of the DenseNet network structure, this dense connection allows features of various scales in each layer to be extracted, which is why it has high accuracy. Dense connections facilitate gradient backpropagation.
However, limiting FLOPs and model parameters, the output channel of each layer is fixed, which results in inconsistent input and output channels and an increase in MAC. In addition, due to the continuous accumulation of channels in the previous layer, to output a fixed channel, 1x1 convolution must be used to compress the number of channels, resulting in low GPU utilization. Therefore, even though the GLIOPs and model parameters of the DenseNet model are not large, the reasoning is not Not efficient.

Insert image description here
The picture above shows the relationship between the various convolutional layers of DenseNet. (s,l) represents the normalized L1 norm size of the weight of the convolution layer between the sth layer and the lth layer, and represents the relationship between Xs and Xl. As can be seen from the figure, the relationship between most layers is not very big. Therefore, most of the connections in DenseNet are redundant. Can some of the connections be removed? Will the accuracy be affected after removing them?

VoVNet

VoVNet is a model proposed based on the above-mentioned DenseNet, in which the OSA (One-Shot Aggregation) module is proposed.
Insert image description here
It can be seen that, unlike DenseNet, in an OSA module, each layer is only aggregated at the last layer, and the middle layer aggregation part is removed.
Insert image description here
It can be seen that after removing it, the blue part in the picture (the part where the layers are not closely connected) is obviously much smaller, indicating that each connection of the OSA module is relatively useful.

The specific calculation structure of the model:
Insert image description hereAt the end of each stage, there will be a 3x3, stride=2 max-pooling layer to reduce the feature dimension.
For specific comparisons of the results of VoVNet as a backbone on each detection model and other backbones, see the paper.

reference

  • https://zhuanlan.zhihu.com/p/393740052
  • https://arxiv.org/pdf/1904.09730v1.pdf
  • https://zhuanlan.zhihu.com/p/141178215

Additional note: This series of articles is only for my personal study

Guess you like

Origin blog.csdn.net/qq_44846512/article/details/128588327