Clear and clear VGG network and experience field

VGG was proposed by the well-known research group VGG (Visual Geo'metry Group) of the University of Oxford in 2014, and won the first place in the Localization Task and the second place in the Classification Task in the ImageNet competition that year. The following figure is taken from the original paper of the VGG network:
Insert picture description here
This network gave us a configuration of six VGG networks. In these six configurations, the author tried different layers, and also tried whether to use LRN local response return One, and try to compare the effect of convolution kernel size 1 and convolution kernel size 3. But in the process we usually use, we often use the D configuration , which is a 16-layer configuration (13 convolutional layers + 3 fully connected layers).

Highlights of the network:

By stacking multiple 3×3 convolution kernels to replace large-scale convolution kernels, the purpose is to reduce the required parameters . It is mentioned in the paper: You can replace a 5×5 convolution kernel by stacking two 3×3 convolution kernels, and stack three 3×3 convolution kernels to replace a 7×7 convolution kernel, that is, two 3× 3 receptive fields = 1 5×5 receptive fields, three 3×3 receptive fields = 1 7×7 receptive fields.

So what is the receptive field?
Insert picture description here
In the convolutional neural network, the area size of the input layer corresponding to an element in the output result of a certain layer is determined, which is called the receptive field .
In layman's terms: a unit on the output feature map corresponds to the size of the area on the input layer.
Insert picture description here
In the above figure, a unit in the feature map of the third layer, the corresponding receptive field in the second layer is a 2×2 area, and in the original image, it corresponds to a 5×5 area.

**How ​​is the feeling field calculated? **Here is the calculation formula of the receptive field.
Insert picture description here
In the original paper, the author explained that the 7×7 convolution kernel can be replaced by stacking 3 3×3 convolution kernels. The simple calculation is given below, and it needs to be emphasized here. One point is that in the VGG network, the default stride stride of the convolution kernel is equal to 1 by default. Assuming that a feature matrix passes through a 3-layer 3×3 convolutional layer, the receptive field of the feature map obtained is 1, then It can be obtained by the following calculation, so the final output corresponding to the original image is a size of 7×7, which is equivalent to the size of the receptive field corresponding to a 7×7 convolution kernel. In comparison, The parameters have been significantly reduced .
Insert picture description here
Insert picture description here
Analysis network For
the commonly used D model, the first input image is a 224 224 size RGB image, followed by a two-layer 3 3 convolutional layer, maximum pooling down sampling, and then a two-layer 3 3 The convolution kernel then passes through a maximum pooling down-sampling, then passes through a 3-layer 33 convolutional layer, and then passes through the maximum down-sampling, then passes through the 3-layer 33 convolutional layer, and then passes through the maximum down-sampling, and then Pass 3 layers of 3 3 convolutional layers, then pass maximum downsampling, and finally connect 3 fully connected layers, and then softmax will get the probability distribution. In the original paper, the detailed parameters of convolution and maximum pooling downsampling are not given. Here is a supplement: conv in the table has stride=1, padding=1; maxpool has a size of 3, stride=2, In this way, the height and width of the input and output after convolution by a 3*3 convolution kernel are unchanged, and the height and width can be reduced to half of the original through maximum pooling downsampling .

Note: The last layer in the network does not have a ReLU activation function . It is activated by softmax and converts the prediction result into a probability distribution. The following is the structure of the class D VGG network, with different colors representing different layers.
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42308217/article/details/110288987