Image Classification Algorithms: Interpretation of VGG Papers

foreword

In fact, there are already many good articles on the Internet that interpret various papers, but I decided to write one myself. Of course, my main purpose is to help myself sort out and understand the papers deeply, because when writing an article, you must put what you have learned into consideration. What I write is clear and correct. I think this is a good exercise. Of course, if it can help netizens, it is also a very happy thing.

illustrate

If there is a clerical error or writing error, please point it out (do not spray), if you have a better opinion, you can also put it forward, and I will study hard.

Original paper address

Click here , or copy the link

https://arxiv.org/pdf/1409.1556.pdf

Directory Structure

Article directory

- Image Classification Algorithms: Interpretation of VGG Papers

1. Brief background:

After the advent of AlexNet in 2012, people began to be enthusiastic about researching new network structures. An improvement idea in 2013 is: a smaller receptive field window (convolution kernel) and a smaller step size are used in the first convolutional layer , because this can extract more information from the original picture. An improvement in 2014 is multi-scale multiple training.

Based on these backgrounds and the author's interest in the factor of network depth, this article appeared.

Explanation: Multiscale

Multi-scale is the comprehensive utilization of information. For example, for a picture, the overall processing is one scale, and a part of it is cut out to be another scale. Another example, using the output feature maps of different convolutional layers is also a multi-scale.

2. Overview of the content of the article:

The author's main goal is to prove the impact of depth on accuracy. For this reason, the author fixed all parameters except depth, and implemented the convolution layer with a 3*3 small convolution kernel. A total of six model architectures were introduced, among which The 16-layer core 19-layer architecture has achieved good results, also known as VGG16 and VGG19.

3. Model architecture:

The model architecture table is provided in the original paper:

insert image description here

A little explanation of the above table:

conv3-128 meaning:卷积核 3*3*64
A total of six models, where A and A-LRN are to compare whether LRN will work, and C and D are to verify the role of 1*1 convolution kernel

Of course, if you think the table is not intuitive enough, there is also a picture of VGG16 for you to enjoy (picture from the Internet):

insert image description here

For such a clear picture, the architecture should not need to be elaborated. If you ask how to calculate the size, you can refer to my paper on AlexNet, which deduces the size calculation in detail.

In addition, I think a very important point is that everyone must be familiar with the amount of model parameters, because when I implement AlexNet, I can call the GPU normally, but when I implement VGG16, due to the increase in the number of parameters, the GPU memory is full and an error is reported.

insert image description here

4. The role of continuous small convolution kernels:

Observing the parameter quantity diagram above, it is not difficult to find that as the number of layers increases, the model parameter quantity does not increase much. Why is this?

At this point, we have to mention the benefits of small convolution kernels. The author found that two 3*3 convolution kernels are equivalent to a 5*5 convolution kernel, and three 3*3 convolution kernels are equivalent to a 7*7 convolution kernel. This means a reduction in the amount of parameters and an increase in the number of nonlinear transformations , see the figure below:

insert image description here

The amount of parameters is reduced

假设一个卷积核对应一个bias，那么：
	两个3*3卷积核，参数量： 3*3*2+2 = 20
	一个5*5卷积核，参数量： 5*5 + 1 = 26
可见，参数量确实减少了

Increased number of nonlinear transformations: prevents overfitting and extracts more information

两个3*3卷积核：
	对于同样的输入，经历了两次卷积操作
一个5*5卷积核：
	对于同样的输入，只经历了一次卷积操作

Explanation: Two 3*3 convolution kernels are equivalent to a 5*5 convolution kernel

Suppose the input is m*m, the step size is 1, and the padding is 0.

Then, after two 3*3 convolutions:

第一个卷积： (m-3)/1 + 1 = m-2
第二个卷积： (m-2-3)/1 + 1 = m-4

However, after a 5*5 convolution kernel:

(m-5) + 1 = m-4

Therefore, the output size of the two is the same, but the final output must be different, but the size is the same.

5. The role of 1*1 convolution kernel:

The author introduced a 1*1 convolution kernel in the C model, and its functions are:

Adding Non-Linear Factors to Decisions

1*1卷积核其实是线性映射（你想，卷积操作用1*1来用，不就是线性操作嘛），但是不要忘记，这个1*1卷积层后会接上ReLU函数，这就增加了非线性因素。
从另外一个角度想，其实也是一种增加深度的手段

Change the number of channels

这个作用是核心作用，可以用于升维或降维，比如我可以写为1*1*64或1*1*256来改变通道数。这个作用后面也被经常使用。

6. Multi-scale:

The author's method to achieve multi-scale is image cropping . In the original paper, the author used S to represent the smallest side of the isotropically scaled training image (that is, first scale the image to S*H in the same proportion, and then cut out a 224*224 size image from it), then:

Single-scale training: fixed S, the author took two values of S=256 and 384 for testing
Multi-scale training: S is a random number between [Smin, Smax]

It is not difficult to guess that the multi-scale training method is better, because the size of the object in the image is not fixed, so the S value must fluctuate when cropping, which is better than the fixed S value.

7. The convolutional layer replaces the fully connected layer:

The author used FCN (Full Convolutional Network, full convolutional network) during the test, that is, the convolutional layer is used instead of the fully connected layer.

Speaking of this, we must ask: Why both AlexNet and VGG limit the size of the input image, the former is 227*227, and the latter is 224*224?

This is because the number of neurons in the fully connected layer is fixed, so the output of the last convolutional layer must be of a fixed size, thus requiring the input image size to be fixed.

Then FCN can lift this limitation, that is, one of the benefits of convolution instead of full connection is that it does not limit the size of the input image.

So, how did the author realize this idea? First of all, we know that the output of the last convolutional layer of VGG16 is 7*7*512 (from the above VGG16 figure or calculated by ourselves), while the input of the first FC layer is 7*7*512, and the output is 4096. Then we can replace it with a convolutional layer of 7*7*4096 (guarantee that the final output is 4096), and replace the last two FC layers with 1*1*4096 and 1*1*1000.

See the picture below, I explained a little why the first FC layer is replaced by 7*7*4096 ( the core is to keep the output consistent, which is 1*1*4096 ):

insert image description here

8. The authors' conclusions:

The author tried six models and different training details, and came to some useful conclusions:

LRN did not play the desired role
Error rate decreases with depth
Continuous small convolution kernels and 1*1 convolution kernels are very useful
Multi-scale training is better than single-scale
The effect of multi-model fusion is better (it is a bit like an integrated algorithm)

9. Summary:

From the perspective of my junior, I think articles like VGG will inevitably appear, because the importance of depth has already been mentioned in AlexNet. However, it also told me that even if there is a direction, it is difficult to achieve it, because it is also difficult to explore the importance of depth without the use of continuous small convolution kernels and other means.

However, the important contribution of VGG is:

Demonstrating the importance of depth
Continuous small convolution kernels and 1*1 convolution kernels can play a greater role
The importance of multiple scales

Finally, if you are interested in implementing VGG16, you can read my two articles, one is to simply implement the architecture VGG16, and the other focuses on the use of VGG16:

https://blog.csdn.net/weixin_46676835/article/details/128730174
https://blog.csdn.net/weixin_46676835/article/details/129582927