Classical taxonomy model (five): Inception_v2_v3 (2015)

Rethinking the Inception Architecture for Computer Vision----2015_inception_v2v3

Abstract**

** convolution core network is used for various tasks most recent computer vision solutions. ** Since 2014, very deep convolution network started to become mainstream, and received considerable benefits in a variety of benchmarks. Despite the increase in the size of the model and calculate costs often translate into most immediate task of quality improvement (as long as sufficient data marker for training), but the computational efficiency and low number of parameters still use cases (such as mobile vision and large applications ) are contributing factors. Data plan. Here, ** we are exploring ways to expand the network of factoring convolution appropriate and positive regular as the goal, the amount calculated using as efficiently as possible increases aimed. When using the 5 billion network multiplication of single frame Rating: ** We classify the set challenge verify our approach benchmarked in ILSVRC 2012, results show that compared with the prior art, the method has significant advantages , top-1 error is 21.2%, top-5 was 5.6%. Each inference using less than 25 million parameters are added. By integrating a plurality of times and assessing four crop model, we report a 3.5% error top-5 and 17.3% of the top-1 error.

1.Introduction

Since Krizhevsky et al. [9] In 2012 ImageNet winning the race, their networks "AlexNet" has been successfully applied in various computer vision tasks, such as object detection [5], the split [12], the human body pose estimation [22] video classification [8], the object tracking [23] and the super-resolution [3].

The success has inspired a new field of research focused on the discovery of higher performance convolution neural network. From the beginning of 2014, through the use of deeper and broader network, the quality of the network infrastructure has been significantly improved. In 2014 ILSVRC [16] classification Challenge, VGGNet [18] and GoogLeNet [20] obtained a similar performance. An interesting observation is to improve the classification performance tends to be translated into a significant increase in the quality of a wide range of application areas , which means that the depth of convolution architecture architecture improvements can be used to improve most other increasingly dependent on computer high-quality visual performance tasks, learning of visual function. Moreover, AlexNet not be used with manual design, production solutions (eg, NetApp) in the case of competition, improve network quality of convolution lead to a new network applications. Detection of generated proposals [4].

Although VGGNet [18] has a simple structure compelling features, but it paid a high price: evaluate the network requires a lot of computing. On the other hand, even under severe memory and computing budget constraints, the Inception architecture [20] GoogLeNet may also perform well. For example, GoogleNet only 5 million parameters, which reduces 12 times faster than its predecessor AlexNet (using 60 million parameters). Furthermore, the parameter used VGGNet about 3 times more than AlexNet.

Inception computational cost is much lower than, or higher performance VGGNet successor [6]. This allows a large data for the scene [17] [13] feasible using Inception network, a large scene data, large data volumes at a reasonable cost, inherently limits or computing power or memory scene, for example, moving the vision set. By special solutions for the target application's memory usage [2], [15] or optimized to perform certain actions by calculating techniques [10], can certainly mitigate the impact of these issues. However, these methods adds additional complexity. In addition, these methods can also be used to optimize Inception architecture to expand the efficiency gap again.

Nevertheless, the complexity of Inception architecture allows the network to make changes more difficult. If naively extended structure, a large part of the system to calculate earnings may be lost immediately. Similarly, [20] does not provide a clear description of the factors that lead to all kinds of design decisions GoogLeNet architecture of. This makes its efficiency while maintaining adapt new use cases more difficult. For example, if deemed necessary to increase the capacity of certain Inception style model, simply the number of all the simple conversion of doubling the size of the filter bank will result in an increase of 4 times the number of parameters and computational cost. In many practical cases, this may prove to be prohibited or unreasonable, especially in the case of small-related earnings. In this paper, we describe some general principles from the start and optimization of thinking, these principles and ideas for optimization in an effective way to expand the network convolution is useful. Although our principle not limited Inception type of network, but the general structure Inception style building blocks flexible enough to naturally combine these constraints, so in this case is more easily observed, this is due to the extensive use of dimension reduction and Inception module the parallel structure, mitigate structural changes on nearby components. Still we need to proceed with caution, since some guidelines should be followed in order to maintain the high quality of the model.

2.General Design Principles

Here, we will be based on a large-scale experiment based on various architectural choices convolution of the network, describe some of the design principles. In this regard, the following practical principles are speculative , the future need for more empirical evidence to assess its accuracy and effectiveness. Nevertheless, with a serious deviation from these principles often lead to decline in the quality of the network, while a fixed position (to detect these deviations) often results in improved architecture.

1. caution bottlenecks, especially early in the network. Feedforward Network may be represented by acyclic graph from the input layer to the classifier or regression filter. This defines a clear direction for the flow of information. For any shear separate input and output, the amount of information can be accessed by shear. People should avoid extreme compression bottlenecks. Typically, the size of the representation should be input to the output gradually decreases, until the final task for the current representation. In theory, the information content can not be assessed solely by the dimensions indicated, because it discarded important factors such as the relevant structures and the like. It provides only a rough estimate of the dimensions of information content.

2. High-dimensional representation is easier to handle in the local network. Increasing the number of activations for each tile in a convolutional network characterized in that the number may be increased. Generated network will converge faster.

3 may be lower-dimensional space on the polymerized embedding , without losing the ability to represent a lot. For example, before performing the dispersion more (e.g. 3 × 3) convolution, you can reduce the size of the input space represented before aggregation, without serious adverse effects desired. We assume that the reason for this is that, if the aggregate output in the space environment, dimensionality reduction during strong correlation between adjacent cells will lead to less loss of information. In view of these signals should be easily
compressible dimensionality reduction may even promote faster learning.

4. The width and depth of the balance network . By balancing network depth and number of filters of each stage, you can achieve the best performance of the network. Increasing the width and depth of the network can help improve network quality. However, if a parallel increase in both, can be improved to achieve the best constant amount of calculation. Thus, the budget should be allocated is calculated in a balanced manner between the depth and width of the web.

Although these principles may be justified, but use them to improve the quality of existing networks is not easy. The idea is to use them wisely only in borderline cases.

3. Factorizing Convolutions with Large Filter Size

Many GoogLeNet network [20] of the original proceeds from the extensive use of dimensionality reduction. This can be seen as an effective way to calculate the convolution exploded special case. Consider, for example, 1 × 1 layer followed by a convolution case 3 × 3 convolution layer. In the visual network, the desired output is activated in the vicinity of highly correlated. Thus, we can expect their active prior to polymerization can be reduced, and this will lead to similar local expression.

Here, we explored other methods of decomposition convolution in each case, in particular, to improve the computational efficiency of the solution. Because the initial network convolution is complete, so that each weight corresponds to a multiplication of each activation. Thus, any reduction in the computational cost will result to reduce the number of parameters. This means that, through proper factorization, we can get more scattered parameters, so you can get faster training. Similarly, we can use the savings to increase the computing and memory size of the filter bank network, while maintaining the ability to train a copy of each of the models on a single computer.
Here Insert Picture Description
Here Insert Picture Description

3.1. Factorization into smaller convolutions**

In terms of computation, with a larger spatial filter (e.g., 5 × 5 or 7 × 7) convolution tends proportionally expensive. For example, on a grid ** has m filters, having n filters 5 × 5 convolution higher than 3 × 3 convolution calculation cost filters have the same number of times 25/9 = 2.78. ** course, 5 × 5 filter can capture the correlation between the activation signals between the earlier layer distant unit, thus reducing the geometrical dimensions of the filter expression will bring greater cost. ** However, we can ask whether a parameter less the same size input and output depth multilayer network instead of 5 × 5 convolution. ** If 5 × 5 convolution calculation amplification, we see that each output looks like a small fully connected network, which slides on the tile 5 × 5 on the input (see FIG. 1). As we are building the visual network, reuse and replacement of components full translation invariance connected with two convolution seems natural structure: The first layer is a 3 × 3 convolution, the second layer is a 3 × 3 outputs the first layer fully connected layers above the grid (see Figure 1). This small sliding input activation grid network attributed to the replacement of two 5 × 5 3 × 3 convolution with convolution (FIG. 4 and 5 will be compared).

Shared between neighboring blocks by weight, the setting parameter count significantly reduced. In order to analyze the case of calculating the expected savings, we will typically be applied to some of the simplifying assumptions: We can assume that n = αm, i.e., we want to change the number of activations / unit by a constant factor alpha. Since the 5 × 5 convolution is gathered, so α is typically greater than 1 (for GoogLeNet, about 1.5). Alternatively a 5 × 5 two-layer, this seems to be realized in two steps expansion: in two steps will increase the number of filter root α. By selecting for α = 1 (without extension) simplify our estimate, if we do not naively sliding between the case of calculating a network of adjacent tiles grid reused, it will increase the computational cost. ** slide the network may be represented by two layers 3 × 3 convolution, the convolution reusable activation layer between adjacent tiles. Thus, we finally reduces the net (9 + 9) / 25 times the amount of calculation, the relative gain of this decomposition by 28%. ** Since each parameter is used only once in the calculation of each activation unit, thus saving parameter counts the same. Still, this setting raises two general issues: whether this replacement would result in a loss of skills? If our main goal is to break down the linear part of the calculation, if not recommended in the first layer remains activated linear? We have carried out some control experiments (e.g., see FIG. 2), and at all stages of decomposition, a linear activation always better to use a linear rectifier unit. We attributed this increase in earnings in the network can learn the space changes, especially if we quantities of standardized [7] output activated. When activated for linear dimensionality reduction component, similar effects can be seen.

3.2 convolution asymmetrical spatial decomposition

These results indicate that the use of larger filter 3 a 3 × convoluted generally not be useful because they can always be reduced to a 3 × 3 convolution sequence layer. We can still ask the question of whether they should be broken down into smaller convolution, for example, 2 × 2 convolution, but the fact that the use of asymmetric convolution can even do better than the 2 × 2. n × 1. For example, using the convolution plus 3 × 1 1 × 3 convolution is equivalent to the slide 3 × 3 convolution having two network receives the same field (see Figure 3). If the number of input and output filters are equal, for the same number of output filter, two solutions to 33% cheaper still. In contrast, the 3 × 3 convolution into two 2 × 2 convolution computation savings of 11% only.

Theoretically, we further demonstrate that can be coupled with 1 × n convolutional convolving an × 1 n × n convolutional any place, and with the increase of n, a significant increase in computational cost savings (see FIG. 6). In practice, we have found that the use of such ineffective in the early decomposition layer, but in the medium mesh size (in m × m wherein FIG. M ranges between 12 to 20) provides good results on . At this level, by using a 1 × 7 convolution, convolution and is 7 × 1, you can achieve very good results.
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

4.Utility of Auxiliary Classifiers

[20] introduced the concept of auxiliary classifier to improve the convergence of a very deep network . The initial motivation was to push the gradient useful to a lower level, so that it is immediately available, and to improve the convergence of the training period by gradient solve the problem disappear in a very deep network . Lee et al [11] also believes that the auxiliary classifier learning can contribute to a more stable and better convergence. Interestingly, we found that at the beginning of the auxiliary classifier does not improve the convergence of training: Before the two models are high precision, with and without the training process of the network side head looks almost the same. Towards the end of the training, with auxiliary branch network began to exceed the accuracy without any auxiliary branch network and achieve higher plateau.

Similarly, [20] the use of two side head at different stages of the network. Remove the lower sub-branch will not have any adverse effect on the final quality of the network. Together with the earlier observation in the previous paragraph, which means [20] The original hypothesis is that these branches contribute to the evolution of low-end network. Level features are likely misplaced. On the contrary, we believe that serve as auxiliary classifier regularization device. If the side branch is granted normalized [7] or an exit layer, the network's main classifier will better support this fact. It also provides supporting evidence is weak batch normalization acts as a regularization of conjecture.
Here Insert Picture Description

5.Efficient Grid Size Reduction

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
传统上,卷积网络使用某种池化操作来减小特征图的网格大小。为了避免出现代表性瓶颈,在应用最大或平均池之前,将扩展网络过滤器的激活维度。例如,从一个具有k个滤波器的d×d网格开始,如果我们想得到一个具有2k滤波器的d / 2xd/2网格,我们首先需要计算具有2k个滤波器的stride-1卷积,然后应用附加的合并步骤。这意味着,使用2d^2 k^ 2运算在较大的网格上进行昂贵的卷积运算将占总体计算成本的主导。一种可能性是切换到卷积合并,因此导致2(d / 2)2k2将计算成本降低四分之一。但是,这会造成表示瓶颈,因为表示的整体维数下降到(d / 2)^2k,导致表达网络较少(请参见图9)。代替这样做,我们建议另一种变体在消除代表性瓶颈的同时进一步降低计算成本。 (请参见图10)。我们可以使用两个并行的跨步2块:P和C。P是激活的池化层(平均池或最大池),它们都属于跨步2,其级联如图10所示。

6.Inception-v2
Here Insert Picture Description
在这里,我们从上方连接各个点,并根据ILSVRC 2012分类基准提出了一种具有改进性能的新体系结构。表1给出了我们网络的布局。请注意,基于第3.1节中所述的相同思想,我们已将传统的7×7卷积分解为三个3×3卷积。对于网络的初始部分,我们有3个35×35的传统启动模块,每个模块有288个过滤器。使用第5节中描述的网格缩减技术,可以将其缩减为具有768个滤波器的17×17网格。随后是5个实例化的初始模块的实例,如图5所示。这可以缩减为8×8×1280。网格使用图10所示的网格缩减技术。在最粗糙的8×8级别上,我们有两个Inception模块,如图6所示,每个图块的串联输出滤波器组大小为2048。该网络的详细结构,包括Inception模块内部的滤波器组的大小,在补充材料中给出,该材料在此提交的tar文件中的model.txt中给出。

However, we have observed that, as long as compliance with the principles of the second part, the quality of the network is relatively stable and can adapt to change. Although our network up to 42 layers deep, but we calculate the cost of only about 2.5 times higher than GoogLeNet and still much more efficient than VGGNet.

11.Conclusions

We provide a variety of design principles to expand the network convolution, and it has been studied in the context of Inception architecture. And simpler, compared with a single architecture and more, which can lead to high performance visual guide network with relatively modest computational cost. In ILSVR 2012 category, for a single crop assessment, we have the highest quality version of Inception-v3 reached 21.2%, top-1 and 5.6% top-5 error, and establish a new level of technology. Compared with network Ioffe et al [7] is described, by which is relatively modest (2.5-fold) increase in computational cost to achieve. Based on the best results compared to a dense network of calculation used by our solution still much less: Our model is better than the results of He et al [6] - the top-5 (top-1) error is reduced 25% (14%), respectively, relative - six times computationally inexpensive, and parameters used in at least five times less (estimated). Our model four Inception-v3 total reached 3.5%, 3.5% multicast assessment of the first five errors, which reduces more than 25% higher than the best published result, almost half ILSVRC 2014 win GoogLeNet ensemble wrong.

We also demonstrated that at low reception field 79 × 79 resolution up to high-quality results, which may be helpful for the system to detect smaller objects. We have studied the decomposition of internal convolution neural network and how active dimension reduction while maintaining a high quality, the network has a lower computational cost. Lower parameter count and additional regularization smoothly into the batch normalized and auxiliary classifier labels, high-quality network can be trained on a training set of relatively modest size.

Published 47 original articles · won praise 21 · views 7230

Guess you like

Origin blog.csdn.net/qq_18315295/article/details/103568876