Classical taxonomy model (B): VGGnet (2014)

Very Deep Convolutional Networks for Large-Scale Image Recognition-----VGGNet_2014

Convolution large-scale network for ultra-deep image recognition

Summary

In this work, we studied the effect on the accuracy of the convolution network depth in large-scale image recognition environment. Our main contribution is the use of very small (3 × 3) increase the depth of convolution filter architecture for a comprehensive assessment of the network , which indicates the depth by pushing 16-19 weighting layer may realize a significant configuration of the prior art Improve. These findings are the basis of our ImageNet Challenge 2014 submitted our team in the process of locating and classifying won first place and second place respectively. We also show that our representation to other data sets generalization of the well, and achieved the best results in the other data sets. We make our best performance of the two models ConvNet publicly available, in order to further study the use of computer vision in depth visual representation.

1.INTRODUCTION

Convolution Network (ConvNets) recently made large-scale image and video recognition was a huge success (Krizhevsky, etc., 2012; Zeiler & Fergus, 2013; Sermanet, etc., 2014; Simonyan & Zisserman, 2014) due to the large public image repository, such as ImageNet, and emergence of high-performance computing systems, such as GPU or large-scale distributed clusters (Dean et al., 2012), makes this possible. In particular, advances in visual recognition architecture in depth, ImageNet large visual identity challenge (ILSVRC) (Russakovsky et al., 2014) played an important role, it has become a large-scale test bed for generations the image classification system, from high-dimensional shallow feature coding (Perronnin etc., 2010) (ILSVRC-2011 winner) to deep ConvNets (Krizhevsky etc., 2012) (ILSVRC-2012 winner).

With ConvNets increasingly commoditized in the field of computer vision, in order to achieve better accuracy, many attempts have been made to improve the Krizhevsky et al. (2012) original architecture. For example, ILSVRC-2013 (Zeiler & Fergus , 2013; Sermanet et al., 2014) submitted using the best performance in a smaller and smaller window size feel the first convolution layer step. Another measure to improve the network over the entire image and more densely-scale training and testing (Sermanet, etc., 2014; Howard, 2014). In this paper, we address another important aspect ConvNet architectural design - its depth. To this end, we have corrected architectures other parameters, and by adding more layers to stabilize the convolution depth of the network increases, it is possible, because the use of very small (3 × 3) convolution filtering in all layers device.

Therefore, we propose a more precise ConvNet architecture, not only can achieve the best accuracy in ILSVRC classification and positioning tasks, but also for other image recognition data sets, they can achieve superior performance, even with relatively simple part of the process (e.g., without trimming depth characteristic by the linear SVM classification). We released two best-performing model 1, for further study.

The remainder of this paper is organized as follows. In Section 2, we describe our ConvNet configuration. Details of the training and evaluation of image classification in Section 3, and the configuration were compared on ILSVRC classification tasks in Section 4. Section 5 summarizes the paper. For completeness, we will describe and evaluate our targeting system ILSVRC-2014 in Appendix A and discussed features very deep generalization on other datasets in Appendix B. Finally, Appendix C contains a major revision of the list of papers.

2.ConvNet Configuration

In order to measure improvement ConvNet depth in a fair environment brought all our ConvNet layer configured to use the same rules, inspired by Ciresan, etc. (2011); Krizhevsky et al. (2012). In this section, we first describe the general design of our ConvNet configuration (Section 2.1), and then the specific configuration used in the evaluation described in detail (Section 2.2). Finally, our design choices will be discussed in 2.3 and compared with the prior art.

2.1 Architecture

During training, we ConvNet input is 224 × 224 RGB image of fixed size . Our only minus RGB mean pretreatment on the training set is calculated from each pixel. A stack image obtained by convolution (CONV.) Layer, we use a small receptive field filter: 3 × 3 (which is captured left / right, up / down, the central concept of the minimum size). In one configuration, we used a 1 × 1 convolution filter can be regarded as a linear transformation of input channels (hereinafter non-linear). Convolution step fixing of a pixel; spatial convolution input layer after filling to retain the spatial resolution to meet the convolution, i.e., 3 × 3 convolution filling layer is one pixel. Space by the pool of the largest pools of five layers, which after several convolution layer (not after all the largest pool of all convolution layer). Maximum pool of 2 × 2 pixels in the window, step 2.

After convolution stack of layers (having different depths in different architecture) is fully connected three (FC) layer: the first two channels, each with 4096, the third dimension ILSVRC 1000 performs classification and therefore contains 1000 channels (a channel corresponding to a category). The last layer is a layer of soft-max. All full connection network is configured in the same layer.

All are equipped with a hidden layer amended (Relu (Krizhevsky et al., 2012)) nonlinear. We note that our network (except one) do not contain normalized partial response (LRN) (Krizhevsky et al., 2012): see in Section 4, this normalization does not improve performance on ILSVRC data set, but increased memory consumption and computation time. In applications where the parameter is LRN layer (Krizhevsky et al., 2012) parameters.

2.2 Configuration

Herein assessed ConvNet configuration in Table 1, each of a column. Next we will site name (AE) to refer to the network. All configurations follow the general design presented in Section 2.1, and are merely different depths: A 11 from the network weighted layers (eight layers and a convolution layer 3 FC) E to the network layer weight of 19 (16 FC convolution layer and layer 3). Convolution layer width (number of channels) is quite small, starting from the first layer 64, and then 2-fold increase in the maximum cell after each layer, until it reaches 512.

Table 1: ConvNet configuration (in columns). As more layers are added, configurable deep left (A) increases to the right (E) (shown in bold added layer). Convolution layer parameter represents "conv⟨ receptive field size ⟩- channels⟩". For brevity's sake, do not show ReLU activation.
Here Insert Picture Description
Table 2: Number (million level) parameters
in Table 2, we report the number of parameters for each configuration. Despite the great depth, the number of weights in our network and not more than the number of the weight (weight of focusing 144M (Sermanet et al., 2014)) is shallow layer network having a greater width and a convolution in receptive field weights.

2.3 discussion

Our ConvNet configuration ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 contest (Zeiler & Fergus, 2013; Sermanet et al., 2014) using the best performing submit entries ConvNet configuration is very different. Instead of using a relatively large receptive fields in the first convolution layer (e.g., the (Krizhevsky et al., 2012) in 11 × 11, step 4, or (Zeiler & Fergus, 2013; Sermanet et al., 2014) in of 7 × 7, step 2), we use a very small 3 × 3 receptive field across the network, with each input pixel (step 1) convolved . It is easy to see two layer stack 3 × 3 convolution (no space pooled) have an effective receptive field of 5 × 5; three such layers has an effective receptive field of 7 × 7. So what have we gained? Alternatively for example, by using a 3 × 3 convolution stack of three layers of a single layer of 7 × 7. First, we combine the three nonlinear correction layer, rather than a single, which makes decision-making more discriminant function. Second, we reduce the number of parameters: Suppose three 3 × 3 convolution stack has input and output channels, parameter layer stacked convolution weights; same time, a single layer of 7 × 7 convolution would require parameters, i.e., more than 81% of parameters. This can be seen as a 7 × 7 convolution filter regularization, forcing them through a 3 × 3 filter (nonlinear injected therebetween) decomposition.

1 × 1 binding convolution layer (configuration C, Table 1) non-linear decision function is to increase convolution layer without affecting the way the receptive field. Even in our case, 1 × 1 linear convolution is substantially the same dimensional space is projected on (the same number of input and output channels), the additional non-linear correction function is introduced. It should be noted that the recent 1 × 1 convolution layer in Lin et al. (2014) of "Network in Network" architecture has been using.

Ciresan et al. (2011) previously used small size of the convolution filter, but their network is far below the depth of our network, they are not evaluated on a large scale ILSVRC data sets. Goodfellow et al (2014) using deep ConvNets (11 weights layer) in the street number recognition task, show an increased depth results in better performance. GooLeNet (Szegedy, etc., 2014), ILSVRC-2014 classification of the best performing project tasks, is independent of the development work outside of us, but it is similar is based on a very small deep ConvNets (22 weights layer) and convolution filters (except 3 × 3, which also uses a 1 × 1 and 5 × 5 convolution). However, their network topology is more complex than ours, and the resolution to be reduced more actively in the feature space in Fig first layer to reduce the calculation amount. As in Section 4.5 shows, our model is better than Szegedy et al. (2014) in a single network classification accuracy.

3 Classification Framework

In the previous section, we present the details of our network configuration. In this section, we will describe the details of the classification ConvNet training and evaluation.

3.1 Training

ConvNet training process usually follow Krizhevsky et al. (2012) (in addition to sample the input image is cropped from the outer multiscale training image, as described below). That is, the gradient descent through the use of small quantities have momentum (based on back-propagation (LeCun et al., 1989)) optimize logistic regression polynomial objective function for training. ** batch size to 256, momentum is 0.9. Training attenuation by weight ** (L2 penalty multiplier setting) for regularization, the first two fully connected layers discard the regularization (drop rate is set to 0.5). The initial learning rate is set, and when the authentication set to improve the accuracy of stop, 10 times less. A total of three times to reduce the rate of learning, learning stops after 370 000 iterations (74 epochs). We speculate that, although compared with (Krizhevsky et al., 2012) more our network parameters, the greater depth of the network, but the network requires a smaller epoch can converge, this is due to (a) a greater depth and more small positive implicit convolution filter of size is caused, (b) the pre-initialization of certain layers.

Network weights initialization is important because the depth of the network due to the gradient instability, bad initialization may hinder learning. To circumvent this problem, we started training Configuration A (Table 1), shallow enough to randomly initialize training. Then, when the training architecture deeper, we (the intermediate layer is randomly initialized) with four convolutional initialize the network layer of the front layer A and the last three layers fully connected. We did not learn to reduce the rate of pre-initializing layer, allowing them to change in the learning process. For random initialization (if applicable), from the normal distribution with zero mean and variance of the sampling weights and weight. Bias is initialized to zero. It is noteworthy that, after the submission of papers, we find that to initialize the weights can not be pre-trained by using Glorot & Bengio (2010) of random initialization procedure.

To obtain the 224 × 224 ConvNet fixed size of the input image, which are randomly cut from the normalization of the training images (each image cropping once per iteration SGD). To further enhance the training set, the cropped image level through random and random flip RGB color shift (Krizhevsky et al., 2012). The following explains the training image normalization.

Let S be a minor axis, etc. normalized training image side, ConvNet S input from the cut (S we also referred to as training scale). Although cutting size is fixed at 224 × 224, but in principle S can not be any value less than 224: For, crop the image capture statistics of the entire image, fully extended minimum training image edge; for, crop the image corresponding to the image a fraction comprising a portion of the small object or objects.

We consider two ways to set up the training scale S. The first is to amend the corresponding single-scale training of S (Note that the sampling image content cropped image can still represent multi-scale image statistics). In our experiments, we evaluated the model-scale training of two fixed :( has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and. Given ConvNet configuration, we first use to train the network. In order to accelerate the training of the network, with the right to pre-trained to re-initialize, we use a smaller initial learning rate.

** The second method is to set the multi-scale training S, wherein each of the training images by the random sampling from the range (and we use) S normalized to separate. ** As the object in the image may have different sizes, so take this into account during training is beneficial. It can also be seen through the training set enhanced jitter scales, wherein a single model trained to recognize objects within a certain size range. For speed reasons, we have all the layers by single scale model having the same configuration of trimming, a multiscale model training, and with fixed pre-trained.

3.2 Test

In testing, training ConvNet and given input image, it is classified in the following manner. First, the shaft to its other normalized to a predefined minimum image side, expressed as Q (also referred to as our test scale). We note, not necessarily equal to Q S Training scale (as we have shown in Section 4, each Q is S using several values ​​will lead to performance improvement). Then, similar to the network (Sermanet et al., 2014) intensively used in the way of normalization on a test image. That is, the whole connection layer is first converted into a convolution layer (a first layer FC conversion layer to the 7 × 7 convolution, the last two layers FC convolution converted to 1 × 1 layer). Then the resulting network is applied across the full convolution (not cropped) image. The result is the number of channels is equal to the number of class category score map, and the image size depends on the input variable spatial resolution. Finally, in order to obtain a vector of fixed size image category score, the mean score plot class (and pooling) in space. We also enhanced by tests set flip the image horizontally; the soft-max type flip the image and the original image are averaged after the test, to obtain the final score of the image.

Since the network full convolution is applied over the entire image, it is not necessary in the test sample to a plurality of cropping the image (Krizhevsky et al., 2012), because it requires the network re-calculated for each crop the image, less efficient. Meanwhile, as Szegedy et al. (2014) done, a large number of cropped image will improve the accuracy, as compared with a full convolution network, which enables a finer sampling of the input image. Further, due to different boundary conditions convolution, multi-crop evaluation image is evaluated dense supplement: ConvNet when applied to crop the image, in the case of convolution wherein FIG intensive evaluation, filled with zero padding, the same image is cropped naturally from the adjacent portion of the image (due to the convolution and spatial pooling), which greatly increased the receptive field of the whole network, and therefore capture more context. Although we believe that in practice, increasing the computation time multi-cropped image is not sufficient to prove the accuracy of earnings potential, but as a reference, we are still using 50 each scale crop the image (5 × 5 grid rules, flipping twice ) evaluated our network, 144 crop the image a total of 150 crop the image, and Szegedy et al (2014) used on four scales on three scales.

3.3 implementation details

Our implementation of C ++ Caffe public from the toolbox (Jia, 2013) (2013 Launched in December), but contains some significant changes that allow us to train and evaluate multiple GPU installed in a single system, training and evaluation can be full-size on a plurality of scales (as described above) (not cropped) image. Multi-GPU using training data parallelism, each batch of training images by the GPU into several batches, each GPU parallel processing. After calculating the gradient batch GPU, to obtain the average gradient of the complete batch. Gradient calculation between the GPU is synchronized, so that exactly the same result on a single GPU training.

Recently proposed a more sophisticated method of training ConvNet acceleration (Krizhevsky, 2014), and they use the model to the data in parallel between the different layers of the network, we found our conceptual simpler than using a single program the GPU, the prior the 4-GPU systems have provided a 3.75-fold acceleration. Equipped with four NVIDIA Titan on Black GPU system, according to need 2-3 weeks to train a single network architecture.

4 classification experiment

Dataset In this section, we introduce the ConvNet architectural description (for ILSVRC 2012-2014 challenge) image classification results achieved in the ILSVRC-2012 data set. 1000 includes an image data set categories, and divided into three groups: Training (1.3 million images), verification (50,000 image) and the test (left class labels 100,000 images). Two measures to evaluate the classification performance: top-1 top-5 and error rate. The former is a multi-class classification error, i.e. an incorrect proportion of image classification; ILSVRC latter is the main evaluation criteria used in the preceding image is calculated and the ratio of 5 than the real image type is predicted categories.

For most experiments, we use the validation set as a test set. On the test set is also carried out some experiments, and as ILSVRC-2014 contest (Russakovsky et al., 2014) enter the "VGG" group submitted to the official ILSVRC server.

5 CONCLUSION

In this work, we evaluated very deep convolution network (weights up to 19 layers) for large-scale image classification. Facts have proved conducive to express the depth of classification accuracy, and can use the traditional ConvNet architecture for the latest performance ImageNet challenge dataset (LeCun, etc., 1989; Krizhevsky et al., 2012). ), The depth of greatly increased. In the appendix, we also show that our model can be extended to a variety of tasks and data sets, match or beat around not too deep image shows a more complex identification pipeline construction. Our results once again confirmed the importance of depth in the visual representation.

Published 47 original articles · won praise 21 · views 7233

Guess you like

Origin blog.csdn.net/qq_18315295/article/details/103567957