In-depth interpretation of GoogleNet's Inception V1

Purpose of GoogleNet Design

The original intention of GoogleNet design is to improve the utilization of computing resources in the network.

Motivation

The larger the network, the more parameters of the network, especially when the data set is small, the network is more prone to overfitting. Another disadvantage of larger networks is the dramatic increase in the utilization of computing resources. For example, if two convolutional layers are connected in series, a uniform increase in the number of their filters would result in a quadratic waste of computational resources. The solution to these two problems is to replace the fully connected structure with a sparsely connected structure. In the early days, in order to break the symmetry of the network and improve the learning ability, the traditional network used random sparse connections, but the computing efficiency of computer hardware for non-uniform sparse connections was very poor, so full connections were enabled in Alexnet, in order to is better optimized for parallel operations. Therefore, the Incpetion structure is proposed, which can not only maintain the sparsity of the network structure, but also utilize dense matrices for high-performance computing.

Inception structure

a

1. The structure uses convolution kernels of different sizes, smaller convolutions can extract local features, larger convolutions can asymptotically approach global features, and convolutions of different sizes have different receptive fields, which can improve the performance of the network. For robustness, these features are finally merged through concatenate.

2. The reason for using 1x1, 3x3, and 5x5 convolution is to facilitate alignment. Assuming that the stride of the convolution kernel is 1, only pad=0, 1, and 2 are needed, and the feature map of the same dimension can be obtained after convolution , you can connect them directly.

3. The maximum pooling is also added to the structure. The maximum pooling function is the output of the previous layer, and the purpose should be to provide transfer invariance.

4. In the higher layers of the network, the features are more abstract, and the receptive field of the network becomes larger, so usually the number of convolutions of 3x3 and 5x5 will increase, which will introduce a large number of parameters. When the pooling unit is introduced, there are more parameters, because the number of output filters is equal to the number of filters in the previous stage, which will lead to inevitable parameter expansion.

b

In order to solve the problem of too many parameters, 1x1 convolution is introduced in Inception. 1x1 convolution has the following two benefits:

(1) The most important thing is that the 1x1 convolution plays the role of dimensional attenuation and removes the computational bottleneck. Assuming that the input feature map of the original Inception module is 28x28x192, the number of channels of the 1x1 convolution is 64, the number of 3x3 convolution channels is 128, and the number of 5x5 convolution channels is 32, then the parameters of the convolution kernel are 1x1x192x64+3x3x192x128+5x5x192x32, When adding a 1x1 convolution with a channel number of 96 and 16 to the b structure, the parameters are 1x1x192x64+(1x1x192x96+3x3x96x128)+(1x1x192x16+5x5x16x32), and the parameters are reduced to 1/3 of the original.

(2) Usually, a nonlinear activation function, namely Relu, is introduced after the 1x1 convolution, which is equivalent to introducing more nonlinear transformations, which improves the representation ability of the network.

GoogleNet

 

As can be seen from the figure, Googlenet is formed by stacking multiple Inception modules, its depth reaches 22 layers, and the network does not use a fully connected layer but an average pooling layer at the end. The advantage of this is to reduce parameters and prevent overfitting. And in order to avoid the disappearance of the gradient, the network adds 2 additional auxiliary softmax to propagate the gradient forward, and these two softmax will be removed in the testing phase. As for why instead of stacking the Inception modules at the beginning, several convolutional layers plus pooling layers are used because in the early days of the network, the output feature map scale is usually large, and using separate convolutional and pooling layers can Reduce the size of the feature map, reduce the parameters, and prevent overfitting.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325127217&siteId=291194637