Scientific knowledge
An important topic in machine learning is the generalization ability of the model. A model with strong generalization ability is a good model. For a well-trained model, if it performs poorly in the training set, it will also perform poorly in the test set. This may is caused by underfitting. Underfitting means that the degree of fitting of the model is not high, the data is far from the fitting curve, or the model does not capture the characteristics of the data well and cannot fit the data well.
# Preface
SEP.
In the last article of the theoretical article, we shared the AlexNet network, which is a little deeper than the previous deep learning network, and at the same time uses large-scale convolutions, which are novel improvements compared to the previous ones. Today we continue to learn a new network architecture - VGG. The basic component is still a convolutional layer, but the depth and combination are different. Finally, the public data set has also been raised to a new height. At the same time, this network will Deep learning has advanced a new step.
VGG network
The paper shared today is: Very Deep Convolutional Networks for Large-Scale Image Recognition. You can probably know what it means when you hear the name. The translation is a deep convolutional neural network for large-scale image recognition. How deep is this network? Generally speaking, the most recognized ones include 16-layer and 19-layer versions, and the final network architectures are: VGG16 and VGG19.
Screenshot of the paper:
1. Network structure diagram
The network configuration diagram in the paper:
Figure 1
Network structure diagram on the Internet:
Figure II
Paper address: https://arxiv.org/pdf/1409.1556.pdf
2. Network analysis
Today we only share VGG16, because VGG19 has a similar architecture, but the depth of the network is a little bit more. Observe carefully, conv3 in Figure 1 represents the convolution operation with a convolution kernel of 3x3 size, and the number of channels changes from 3-64-128-256-512. From Figure 2, we can see that the original image passes through the network After that, the size is getting smaller and smaller, but the number of channels of the intermediate feature map is increasing. What is the principle? The popular explanation is to use the increase in the number of channels to make up for the decrease in spatial information (because the feature map is getting smaller and smaller).
VGG16 contains a total of 16 layers (13 layers of convolution + 3 layers of full connection). What needs to be remembered here is that the number of network layers usually refers to the layer that can be trained, and pooling is not included, because It only contains calculation operations, no training operations.
Input layer: 224x224x3
64-channel convolution layer block : 2-layer 3x3x64 convolution structure, and padding operation is adopted at the same time, so that the size of the feature map before and after the convolution operation will be kept unchanged, and the output: 64x224x224.
maxpooling1 : The size of the feature map becomes half of the original, output: 64x112x112.
128-channel convolution block : 2-layer 3x3x128 convolution structure, and padding operation is adopted at the same time, so that the size of the feature map before and after the convolution operation will remain unchanged, and the output: 128x112x112.
maxpooling2 : The size of the feature map becomes half of the original, output: 128x56x56.
256-channel convolution block : 3-layer 3x3x256 convolution structure, and padding operation is used at the same time, so that the size of the feature map before and after the convolution operation will remain unchanged, and the output: 256x56x56.
maxpooling3 : The size of the feature map becomes half of the original, output: 256x28x28.
512-channel convolution block : 3-layer 3x3x256 convolution structure, and padding operation is used at the same time, so that the size of the feature map before and after the convolution operation will remain unchanged, and the output: 512x28x28.
maxpooling4 : The size of the feature map becomes half of the original, output: 512x14x14.
512-channel convolution block : 3-layer 3x3x256 convolution structure, and padding operation is used at the same time, so that the size of the feature map before and after the convolution operation will be kept unchanged, and the output: 512x14x14.
maxpooling5 : The size of the feature map becomes half of the original, output: 512x7x7.
Fully connected layer 1 : input: 512*7*7, output: 4096.
Fully connected layer 2 : input: 4096, output: 4096.
Fully connected layer 3 : input: 4096, output: 1000. Because it is 100 categories.
The above is the structural analysis of the entire VGG16. The network mainly proves that the deeper the network can learn more information, it also improves the final classification accuracy, but is the deeper the network the better? Or is there any limit to getting deeper? We will discuss this issue later. In addition, the deeper the network consumes more video memory, especially the last two fully connected layers of 4096, so it is better to run such a network with a graphics card above 1080, otherwise the speed is very slow.
END
epilogue
This is the end of today’s sharing. Students who are serious about studying can read the original paper of the VGG network to understand the author’s original intention of designing this network and how to prove the effectiveness of the network. Next week we will continue the TensorFlow practice of VGG16 .
See you again!
Editor: Layman Yueyi|Review: Layman Xiaoquanquan
Advanced IT Tour
Past review
Deep Learning Theory (14) -- AlexNet's next level
Deep Learning Theory (13) -- LetNet-5 is surging
Deep Learning Theory Part (12) -- Pooling of Dimensionality Reduction
What have we done in the past year:
[Year-end Summary] Saying goodbye to the old and welcoming the new, 2020, let's start again
[Year-end summary] 2021, bid farewell to the old and welcome the new
Click "Like" and let's go~