VGG Network Explanation——Even Xiaobai Can Understand

Table of contents

1. Introduction to VGG network

1. Overview of VGG

 2. Introduction to VGG structure

2. Advantages of VGG

3. The highlights of VGG

Calculations

receptive field


1. Introduction to VGG network

1. Overview of VGG

VGGNet is a model proposed by the Visual Geometry Group of Oxford University . This model achieved excellent results in the classification task and the positioning task in the 2014 ImageNet image classification and localization challenge ILSVRC-2014 . The outstanding contribution of VGGNet is to prove that a small convolution can effectively improve performance by increasing the network depth. VGG has inherited Alexnet 's mantle very well and has distinctive characteristics. Compared with Alexnet , VGG uses a deeper network structure, which proves that increasing network depth can affect network performance to a certain extent.

To put it simply, VGG is a convolutional neural network with five convolutions.

 2. Introduction to VGG structure

As we have said above, VGG is actually a five-layer convolution. Let's look at this picture:

This picture is the result of the author's six experiments at that time. Before introducing this picture, let me explain a few concepts: the convolutional layers are all 3*3 convolution kernels, represented by conv3-xxx , and xxx represents the number of channels.

In this table, we can see that,

The first set (A) is a simple convolutional neural network with no bells and whistles.

The second group (A-LRN) added LRN on the basis of the convolutional neural network of the first group. have a good performance)

The third group (B) added two conv3s on the basis of A, that is, added two 3*3 convolution kernels

The fourth group (C) added three conv1s on the basis of B, that is, added three 1*1 convolution kernels

The fifth group (D) replaced three conv1 with three 3*3 convolution kernels on the basis of C

The sixth group (E) added three more conv3s on the basis of D, that is, three more 3*3 convolution kernels were added

Middle and high school told us that the experiment needs to control the variable (single variable), you see, it is used

So, what can we expect from this set of experiments?

1. Comparing the first group with the second group, LRN did not perform well here, so LRN let him go

2. Compare the fourth group with the fifth group, cov3 is better than conv1

3. Looking at these six groups of experiments as a whole, you will find that as the number of network layers deepens, the performance of the model will become better and better

Based on this, we can briefly summarize: the author of the paper has experimented with 6 network structures, among which VGG16 and VGG19 have the best classification effects ( 16 and 19 hidden layers), which proves that increasing the network depth can affect the final performance to a certain extent . There is no essential difference between the two, but the depth of the network is different.

Next, take VGG16 as an example to explain in detail:

 

I believe some Xiaobai will be wondering: what does the weight layers under the group number mean? Don't worry, see below:

Look at the addition of these numbers, is it 16?

As shown in the figure above, except for the yellow maxpool (pooling layer), the sum of all other layers is 16. Similarly, VGG19 is the same.

So, why not count the pooling layer?

This is because the pooling layer does not have weight coefficients, but other layers do, so.....

 So, let's see how VGG works next:

 Look, a picture with a length of 224, a width of 224, and a number of channels of 3, after the first module, the number of channels is increased to 64, and then the length and width of the pooling layer are halved. Then to the second module, the number of channels is increased to 128, and the length and width of the pooling layer are halved...

Until the last pooling, the result of 7×7×512 is obtained, and then through three fully connected layers, it becomes the final classification result of 1×1×1000 (not necessarily one thousand, the 1000 here is the current Match Results)

2. Advantages of VGG

1. Small convolution kernel group: the author replaces the large convolution kernel by stacking multiple 3*3 convolution kernels ( a few use 1*1 ) to reduce the required parameters;

2. Small pooling kernel: Compared with the 3*3 pooling kernel used by AlexNet , all VGGs are 2*2 pooling kernels;

3. The deeper the network, the wider the feature map: the convolution kernel focuses on expanding the number of channels, and the pooling focuses on reducing the height and width, making the model deeper and wider, while the increase in the amount of calculation continues to slow down;

4. Replace the full connection with the convolution kernel: The author replaced the three fully connected layers with three convolutions during the test phase, so that the model structure obtained by the test can receive input of any height or width.

5. Multi-scale: The author was inspired by the fact that multi-scale training can improve performance. When training and testing, images of different scales of the entire picture are used to improve the performance of the model.

6. Removed the LRN layer: The author found that the LRN ( Local Response Normalization , local response normalization) layer in the deep network has no obvious effect.

3. The highlights of VGG

In AlexNet , the author uses large convolutions of 11x11 and 5x5 , but most of them are still 3x3 convolutions. For the large convolution kernel of 11x11 with stride=4 , the reason is that the size of the original image is large at the beginning, so it is redundant. The feature changes of the original texture details can be captured as early as possible with a large convolution kernel. The deeper layers are afraid that the feature correlation in a larger local range will be lost. Later, more 3x3 small convolution kernels and a 5x5 convolution to capture changes in details.

VGGNet, on the other hand, uses all 3x3 convolutions. Because convolution not only involves the amount of calculation, but also affects the receptive field. The former is related to whether it is convenient to deploy to the mobile terminal, whether it can meet real-time processing, and whether it is easy to train, etc. The latter is related to parameter update, the size of the feature map, whether enough features are extracted, the complexity of the model, and the amount of parameters.

Calculations

An improvement of VGG16 compared to AlexNet is to use several consecutive 3x3 convolution kernels instead of larger convolution kernels ( 11x11 , 7x7 , 5x5 ) in AlexNet . For a given receptive field (the local size of the input image relative to the output), it is better to use a small convolution kernel with a stack than to use a large convolution kernel, because the multi-layer nonlinear layer can increase the depth of the network to ensure that the learning is more accurate. Complex patterns, but at a relatively small cost (fewer parameters).

In VGG , three 3x3 convolution kernels are used instead of 7x7 convolution kernels, and two 3x3 convolution kernels are used instead of 5*5 convolution kernels. The main purpose of this is to ensure the same perceptual field conditions Next, increase the depth of the network and improve the effect of the neural network to a certain extent.

For example, three 3x3 continuous convolutions are equivalent to one 7x7 convolution: the total parameters of three 3 *3 convolutions are 3x( 3 ×3×C 2 ) =27C 2 , and the total parameters of one 7x7 convolution kernel is 1×7×7×C 2 , where C refers to the number of input and output channels. Obviously, 27<49 , that is, the parameters are finally reduced, and the 3x3 convolution kernel is beneficial to better maintain the image properties, and the stacking of multiple small convolution kernels also improves the accuracy.

receptive field

The simple understanding is to output a corresponding area size on the input layer on the feature map .

Calculation formula: (from deep to shallow)

F ( i ) is the i -th layer receptive field

strider is the step distance of the i -th step

ksize is the convolution kernel or pooling kernel size

To put it bluntly, the receptive field is actually how many neuron nodes in the input layer can be affected by a change in a neuron node in the result layer.

Take the following picture as an example:

As you can see in the first picture, 1 neuron is connected to 3 neurons in the previous layer, 5 neurons are connected in the second picture, and 7 neurons are connected in the third picture. And everyone found that no, the neurons obtained by one 5*5 convolution are the same as the neurons obtained by two 3*3 convolutions, and the neurons obtained by one 7*7 convolution are the same as the neurons obtained by three 3*3 convolutions.

In addition, you can also push on the leftmost diagram, changing a node in the last layer will affect the 7 nodes in the first layer, and then look at the rightmost diagram, changing a node in the last layer will affect 7 nodes in the first layer. This means that the receptive field of the final result obtained by three conv3 and one conv7 is the same.

Therefore, in the case of the same receptive field:

Stacking small convolution kernels has more activation functions, richer features, and stronger discrimination than using large convolution kernels. Convolution is accompanied by an activation function, which can make the decision function more discriminative; in addition, 3x3 is more than 7x7 enough to capture changes in detailed features: 3x3 9 grids , the middle grid is a receptive field center, which can capture up and down Changes in left and right and diagonal features; three 3x3 stacks approximate a 7x7 , the network is two layers deeper and has two more nonlinear ReLU functions, the network capacity is larger, and the ability to distinguish different categories is stronger

Guess you like

Origin blog.csdn.net/m0_57011726/article/details/129454665