Summary of PVAnet and common structures and tricks of CNN

《PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection》

I would like to share an earlier paper with you for two main purposes. One is to make a backup of study notes in case my OneNote crashes that day. The second is that this article uses the tricks and advantages of many previous papers, and you can review the evolutionary history of CNN through this article.

Paper address: https://arxiv.org/abs/1608.08021

Provide background:

The usual target detection is CNN feature extraction + region proposal + RoI classification

The key design principle：

less channels with more layers

The text begins:

PVANET mainly uses several methods to improve C.ReLU, Inception, HyperNet, Residual Structures, Batch Normalization, Plateau detection. The following sections are introduced in turn.

1. C.ReLU:

Since most of the features detected in the bottom layer appear in pairs, such as detecting table corners, detecting eyes, detecting ears, etc., the convolution kernel of the bottom convolution layer has the opposite phase. If only ReLU is used as the activation function, the negative response information will be lost. , it can be considered that there is redundancy between the convolution kernels of the bottom convolution layer. So C.ReLU is proposed, note that the number of convolution kernels is halved

It can be seen from the figure that many convolution kernels appear in opposite phases. According to the statistical histogram, it is more intuitive.

The formula is very simple, that is, multiply the original response by -1, then go through ReLU, and then superimpose the two responses, that is, the output dimension is twice the input dimension. But the convolution kernel is reduced by half, so the overall dimension remains unchanged, and the amount of computation is halved.

Back in the PVA network, a layer is added to adjust the slope and bias of the activation function. Structure as shown below:

2. Inception：

Inception V1 Network

Putting context: It is generally believed that deeper and wider networks are better. However, larger networks mean more parameters and are prone to overfitting, especially when labeled samples are limited in the training set. At the same time, more computing resources will be used. The diversity of features can be learned with multi-scale convolution kernels.

Specific structure:

The specific structure is shown in Figure a, but the amount of data is very large, the specific process of dimensionality reduction with 1*1 convolution kernel is, assuming that there are 28*28*192 feature maps now, 128 3*3 convolution kernels are used. Convolution requires 3*3*192*128. However, if 96 1*1 convolution kernels are used for channel integration first, 1*1*192*96+3*3*96*128 is required, which is 0.58 times the original. .

InceptionV3 network (after V2):

Design guidelines: (this is a direct copy of a section of someone else's blog) https://blog.csdn.net/wspba/article/details/68065564

(1. To avoid the bottleneck of expression, the size of the expression (that is, the size of the feature map) should not be attenuated sharply. If the information flowing through the layer (especially the convolutional layer) is excessively compressed, a large amount of information will be lost. The training of the model also creates difficulties.

(2. Local processing of high-dimensional expressions in the network will speed up the training of the network.

(3. Spatial aggregation on lower-dimensional inputs will not cause any loss of expressiveness, because on the feature map, the expressions on adjacent areas are highly correlated. If the output is spatially aggregated, then Reducing the dimension of the feature map will not reduce the information expressed. In this way, it is beneficial to compress the information and speed up the training.

(4. The depth and width of the design network are in a state of balance, and the performance of the model can be maximized by making a balanced allocation of computing resources to the depth and width of the model.

The main improvement of the V3 network and the V1 network is to convert the 5*5 structure into a 3*3 structure. This is also in line with the VGG minimum convolution kernel characteristics.

Back in the PVA network, the Inception structure is used, and the difference is to change 5*5 to 3*3, as shown below:

3. HyperNet structure

Due to the need for target detection, multi-scale feature combination is definitely necessary. The core idea is to extract feature maps from different convolutional layers, combine them, and then ROI Pooling to get a new feature map, and then perform target detection. , similar to the feature pyramid. The structure is as follows:

There is nothing to say, PVA feels like it is used directly.

4. Residual Structures:

This is also a structure that is used a lot. The deep residual network structure was made by Microsoft Asian Research Institute, mainly to solve the degradation problem and gradient dispersion problem after the network is deepened. The structure is well known. In PVA, it is only in the inception layers. added a short cut

5. Add a Batch Normalization layer before the ReLU activation layer:

BN is also obtained in the InceptionV2 network. It is a formula. The derivation is more complicated and I will not write it. I personally feel that BN has become the standard for deep learning, mainly to solve the problem of admiral in backpropagation.

6. Plateau detection:

This is a very interesting way to dynamically control the learning rate. The main idea is to decide whether or not to change the learning rate based on the average amount of change in loss over a period. For example, first use the moving average method to deal with loss, if the amount of loss decline in a certain period is less than the threshold, the model is considered to be in plateau, the learning rate will be attenuated in some way, and then continue to iterate until a predetermined iteration Stop training when the number of times or the learning rate falls below the threshold.

Summarize:

First put a full picture of PVA:

I feel that this network is mainly a fusion of ideas, and the effect is also good. The code is available on github. I am still preparing to take the front part and connect it to yolo to try the effect.

references:

1. C.ReLU: Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

2. Going deeper with convolutions

3. Batch normalization: Accelerating deep network training by reducing internal covariate shift.

4. Rethinking the Inception Architecture for Computer Vision

5. PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

Summary of PVAnet and common structures and tricks of CNN

Guess you like