Rethinking the Inception Architecture for Computer Vision Paper Notes

0 summary

Since 2014, deep convolutional networks have become the mainstream of computer vision solutions. The increase in model size and computational cost translates directly into quality gains. But for scenarios like mobile vision and big data, computational efficiency and low-parameter computation are still the limiting factors. This paper explores a way to enlarge the network with the goal of making the most efficient use of the increased computation through proper factorized convolutions and aggressive regularization.

1 Introduction

Starting in 2014, the quality of the network architecture has been significantly improved by leveraging deeper and wider networks. VGGNet and GoogLeNet achieved similar high performance on the 2014 ILSVRC classification challenge. An interesting finding is that gains in classification performance tend to translate into significant quality gains across various application domains. This means that architectural improvements on deep convolutional architectures can be used to improve the performance of most other computer vision tasks that increasingly rely on high-quality, learnable visual features.

The architecture of VGGNet is simple, but doing so is expensive, and evaluating the network requires a lot of computation. The Inception architecture is designed to perform well under strict memory and computational budget constraints. GoogLeNet has only 5 million parameters, which is 1/12 of AlexNet; VGGNet's parameters are 1/3 of AlexNet.

The computational cost of the Inception architecture is much lower than that of VGGNet, and its performance is also higher than other successors. Leveraging the Inception network becomes feasible where large amounts of data need to be processed at reasonable cost or where memory or computing power is inherently limited. However, the complexity of the Inception architecture makes it more difficult to make changes to the network. If the architecture is simply scaled up, most of the computational gain can be lost immediately, making it difficult to adapt to new use cases while maintaining its efficiency. If you increase the capacity of the Inception model, if you just double the number of filter banks, the amount of parameters and computation will increase by 4 times; in many scenarios, this design is not allowed. In this paper, we describe some general principles and optimization ideas that can scale convolutional networks in an efficient manner. These principles are not limited to Inception networks.

2 Universal Design Principles

Principle 1

Representational bottlenecks need to be avoided when designing the network.
When transferring information between layers, extreme compression of the data in this process should be avoided. That is, the size of the feature map cannot be reduced too fast, and the data is roughly reduced from input to output. This change process must be Be slow, not fast. If the information is over-compressed, a large amount of information will be lost, and the training of the model will also be difficult.
Theoretically, information cannot be obtained just by increasing the dimension, because it has discarded many important features such as correlation structure, and the dimension can only represent a rough estimate of the information. (Personal understanding: an unreasonable example, such as the shape of the output layer of the network is 227x227x3, after the first convolutional layer may become 200x200x64, the increase in dimension can be understood as the number of channels increased from the original 3 to 64, The information obtained by the increase in the number of channels cannot compensate for the information lost by the sharp compression of the feature map size.)

Principle 2

Not so easy to understand. The original text is as follows:
Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.
Combined with the comments below figure7, I understand that sparse representation is the most important when it comes to high-dimensional representation. It is easier to deal with local features, which means local convolution, use 1x1 or 3x3, do not use too large.

Principle 3

Spatial aggregation can be done on lower dimensional embeddings without much or any loss in representational power. For example, before performing large-scale convolutions (3x3), the input can be dimensionally reduced before spatial aggregation without serious impact. Given that these signals should be easily compressible, dimensionality reduction would even facilitate faster learning.

Principle 4

Balance the width and depth of the network. The performance of the model can be maximized by making a balanced allocation of computing resources on the depth and width of the model.

3 Decomposition convolution with large filter size

The excellent performance of GoogLeNet is largely due to the use of dimensionality reduction. Dimensionality reduction can be seen as a special case of decomposed convolution. For example, 1x1 convolution followed by 3x3 convolution. We will explore methods of convolution decomposition in various settings in order to improve the computational efficiency of the solution. Because the Inception structure is fully convolutional, each weight corresponds to one multiplication per activation. So reducing the amount of computation means reducing the parameters. With proper decomposition, we can get more decoupling parameters, which can speed up training. Use the saved computation and memory to increase the filter-bank size to improve network performance.

3.1 Decomposition into smaller convolutions

write picture description here
Large convolution (5x5 or 7x7) requires more computation. For example, under the same filter conditions, the 5x5 convolution kernel is 25/9=2.78 times larger than the 3x3 convolution kernel. Compared with the 3x3 convolution kernel, the 5x5 convolution kernel has a wider "field of view" and can capture more information. Simply reducing the size of the convolution kernel will cause information loss. So is it possible to replace the 5x5 convolution by using a multi-layer network with smaller parameters with the same input size and output depth? Think of the 5x5 network as a full convolution, each output is a convolution kernel sliding on the input, which can be replaced by a 2-layer 3x3 full convolution network. In this way, we end up with a network that is computationally reduced by a factor of (9+9)/25, reducing parameters by 28% through this decomposition. Does this replacement lead to any loss of representational power? If our main goal is to decompose the linear part of the computation, isn't it advisable to keep linear activations in the first layer? Experimental results show that using linear activations in all stages of the decomposition is always inferior to using rectified linear units. That is, it would be better to use all the relu activation functions. We attribute this gain to the fact that the network can have more nonlinear changes, making the network more capable of learning features.

The following figure shows the structure of the Inception module in GoogLeNet
Inception module structure in GoogLeNet

The following figure shows the improved Inception module structure designed according to Principle 3
Designed according to principle 3, improved Inception module structure

3.2 Spatial decomposition into asymmetric convolutions

write picture description here
The above results suggest that convolution kernels larger than 3x3 may not be very useful as they can always be reduced to 3x3 convolutional layers. So is it possible to decompose the 3x3 convolution into smaller ones? Like 2x2. However, even better results than 2×2 can be achieved by using asymmetric convolutions such as n×1. For example, a 3x1 convolution followed by a 1x3 convolution is equivalent to a 3x3 convolution. If the number of input and output filters is equal, decomposing a 3×3 convolution into a 3×1 convolution and a 1×3 convolution has 33% fewer parameters. In contrast, decomposing a 3×3 convolution into two 2×2 convolutions represents only 11% computational savings.
In theory, it can be further argued that any n×n convolution can be replaced by a 1×n convolution followed by an n×1 convolution, and the computational cost savings increase significantly as n grows. In fact, using this structure in the previous layers of the comparison, the effect is not very good. But for moderate grid sizes (on m×m feature maps, where m ranges from 12 to 20), it gives very good results. At this level, very good results can be obtained by using a 1×7 convolution followed by a 7×1 convolution.

The following figure shows the structure of the Inception module after asymmetric decomposition
write picture description here

4 Using Auxiliary Classifiers

The auxiliary classifier introduced by GoogLeNet can improve the convergence of deep neural networks. The original motivation was to push useful gradients to lower layers so that they immediately back-propagate, avoiding the vanishing gradient problem in very deep networks, thereby improving the convergence of the network during training. However, we found that the auxiliary classifier did not lead to improved convergence early in training. The network training progress with and without the auxiliary classifier looks almost identical until the two models reach high accuracy. Near the end of training, the auxiliary branch network begins to surpass the accuracy of the network without any branch, reaching a higher stable level.
Removing the auxiliary classifier in the lower layers of the GoogLeNet network has no adverse effect on the quality of the final network. GoogLeNet's original assumption that these branches contribute to the evolution of low-level features may be inappropriate. Instead, we consider the auxiliary classifier to play the role of regularizer. The main classifier of the network performs better if the side branch uses batch normalization or layers with dropout.

5 Reduce the effective grid size

Traditionally, convolutional networks use some pooling operations to reduce the grid size of feature maps. To avoid representation bottlenecks, the number of feature maps is often increased before applying max pooling or average pooling. For example a d×d grid with k filters, if we want to achieve

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324478445&siteId=291194637