学习---论文笔记:OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

copy Link

2014 ICLR
New York University LeCun team
Pierre Sermanet, David Eigen, Xiang Zhang , Michael Mathieu, Rob Fergus, Yann LeCun
brief (What)

Ovefeat is 2013 ImageNet positioning tasks champion, classification and inspection tasks at the same time also achieved good results.

  1. It uses a shared CNN to simultaneously process image classification, localization, testing three tasks, you can enhance the performance of three tasks.
  2. It effectively achieved with a CNN multi-scale, the sliding window approach, to process the task.
  3. Proposed a method (non-maxima suppression rather than traditional) to seek bounding boxes by accumulating the prediction.

Paper motivation (Why)

Although certain ImageNet data center comprises a substantially full image, but has a significant difference in the target image size and position. There are several approaches to solve this problem.

  1. Moving a plurality of sliding window fixed size, make CNN prediction window for each image sweep. A disadvantage of this method is that the window does not comprise the entire target, nor even the center, comprising only a portion (such as the dog's head), although suitable for classification, but poor positioning and detection.
  2. Convolution training a network, not only produce distribution classifications, but also generate a predictive frame bouding box (predicted target size and location).
  3. Cumulative confidence for each location and size of the corresponding category.

AlexNet shows CNN may have achieved excellent performance in image classification and positioning tasks, but did not openly describe their location method.
This paper is the first to clearly explain how CNN for positioning and detection.

Visual task (How)

Paper explores the three major tasks of image processing, in order of increasing difficulty are:

  1. Classification (classification), for each image tagging, indicate what objects. As long as the maximum probability of the first five in one is correct that it is the correct (top5).
  2. Positioning (localization), in addition to tagging, but also gives the size of the target position, and the bounding box and the real similarity frame must reach a threshold value (for example, post and ratio of at least 0.5). There top5 indicators, five tag must have a label, correct classification and frame meet the conditions, will be right.
  3. Detection (detection), there are a lot of image target objects, all need to find out (classification and localization). Classification and positioning using the same data set, detect using additional data set, which object will be smaller.
  4. The following paper is divided into three parts to say something about how to do (classification, localization and testing), the main focus is about the classification task, then the task is to locate, as the final test mission papers did not mention how specific practices, put on a little a bit.

Classification task

  1. Network structure and alexNet paper is very similar in approach network design and testing phases made some improvements. Network thesis is divided into two versions, a fast version, an accurate version. FIG precise version of the network structure in FIG.
    Here Insert Picture Description
  2. AlexNet the network and the like, with a few differences, one does not use the contrast normalization, there is no use of two overlapping pool of the three super-reference stride is used instead of 2 4, a large stride can enhance the speed, accuracy is reduced .
  3. The largest network and alexNet except that the test phase the use of different methods to predict.
  4. During the testing phase 256 alexNet do cropped image 256 (four corners and the middle) and a horizontal flip, to obtain 5 the image 10 is 2 * 227 227, which is then sent to the network 10 to obtain averaged results for prediction. This approach has two problems, may have overlooked some of the areas of the image while cropping, and 10 images have a lot of overlapping part led to redundancy calculations.
  5. Testing phase of the network, and uses a multi-scale , the sliding window method ( experiment up to six input images of different scales ). This is the biggest innovation of the paper.

Multi-scale classification - full convolution (convolution means are all full convolution layer)

  1. Enter the size of the image above layers is training, since the test chart will enter six different sizes, so size is certainly not the same.
  2. What is the full convolution is: figure above three layers fully connected layers full convolution is actually used, all the connection layer can be converted to a full convolution is, for example, the whole shape input connecting layer 5 . 5 feature map 1024, the output is 4096, then the number of parameters would be 5 5 1024 4096, this time into a full convolution layer, then the parameter is the convolution, convolution kernel size is 5 5 * 1024, the number of nuclear convolution 4096, the amount of the two parameters are the same.
  3. What led to full convolution: As shown below, 14 image 14 performs convolution operation, to give 5 this step 5 after the feature map, fully connected if used, will then flatten it fully connected, so It undermines the positional relationship between the image feature map directly into a feature. But if you are using the full convolution will eventually get 1 1 the Feature the Map of C, C is the channel number, but also the size of the category. If at this time a 16 comes image 16 is, after full convolution will give 2 2 C in feature map, this time the two can be four values of 2 takes a maximum or average, will become a values, and so to a greater image feature map is finally obtained. 3 . 3 C,. 4 . 4 C,. 5 . 5 related to the size and the size of the size of the input C, the output, but this can always output pooling map (taking the maximum) to obtain the value of this category.
  4. The benefits of the full convolution: the first FIG. 14 is a graph with a training image 14, and finally generating an output, the test is below FIG, 16 can use the image 16 produces a "2 2" outputs to such push we can use when testing a larger image (using multiple scale), to produce "more" output (taking the maximum) forecast. The conventional approach with respect to the sliding window (14 with the advantages of size 14, a sliding window step size is 2 4 times convolution is performed on the image classification of 16 * 16) is performed only once, at the same time to ensure the efficiency of can be modeled using a variety of different scales images is not limited to a fixed crop flip way (with respect to the practice alexNet test phase), but also eliminates a lot of redundant computations to improve the robustness of the model but also to ensure efficiency.

Here Insert Picture Description

Multi-scale classification --offset pooling

  1. To explain in conjunction with the last calculated offset after pooling output, the following diagram, for example, (A) is one dimension Write pooled fifth layer was a map, such as the size of 20 FIG 23 FIG lower the draw is 20 in 2320. (20 * 23 is a graph of the size of the latter in the layer 5 scale2 obtained, later we will use six Scale, where a certain dimension scale2 as an example).
  2. Traditionally, after a maximum of 20 sequences of the 3 * 3 pool will be a sequence of length 6 of length, that is, (b) in Δ = 0 such pooling
  3. Pooling is offset a predetermined position and then moved pooling, (b) in Δ = 0,1,2 can be done represent three kinds of pooling, three results obtained, because the image is two-dimensional, it will be the last 3 * 3 is the result of nine kinds of pools, each category for the final result there are nine, the results may be integrated to predict (the example in FIG consider only one-dimensional so that the last drawing will be three outcomes, red and blue green three colors represent the pooled results of three obtained after).
  4. © representation 3 3 6 pooled to give images (6 squares) 6. (d) shows after five full convolution is obtained 2 5 2 (2 squares). e represents position information (length 2) and offset manner (three kinds) interleaved final output obtained after FIG.

Here Insert Picture Description
5. This operation will repeat the above 6 * 12 i.e. 2, wherein the scale 6 represents a 6, 6 as shown in FIG different scale, and 2 denotes a horizontal flip will be two in FIG.
6. In every 12 times that inside, taking the position information of the maximum to Scale2 an example, the final size of 6x9xC, this will take the maximum value of 6x9.
7. it will obtain a vector C of length 12, and 12 vectors averaged together to obtain a vector of length C, and then find or Top5 Top1, the final result obtained.
Here Insert Picture Description

Classification results on the validation set

Here Insert Picture Description
Wherein coarse stride represents Δ = 0, fine stride represents Δ = 0,1,2.

  1. Use fine stride can improve model performance, but the upgrade is not, in fact, explain the role of offset-pooling is not here.
  2. Using multi-scale, increasing the scale model can enhance performance.
  3. Finally, multi-model integration, but also enhance the performance

Positioning tasks

  1. The aforementioned classification task, 1-5 do feature extraction network layer, the output layer 6 as the network classification, this time as long as the network can then do a regression positioned in the back (after pooling) layer 5.
  2. Training time fixed feature extraction network for training according to l2 loss between the box and the real box.
  3. As shown below, the same scale2 example, the output of the fifth layer 6 in FIG. 7, the network through a series of convolution regression to obtain 2 3 position information (2 * 3 box), 4 th channel represents box of four boundary value (coordinate).
  4. Finally, regression layer 1000 version (s), the figure only shows a class.
    Here Insert Picture Description
    Here Insert Picture Description

Guess you like

Origin blog.csdn.net/czp_374/article/details/94734192