Understanding neural network (eight) SPP-NET

SSP-Net: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Look at the R-CNN Why so slow detection speed, a map needs to 47s! R-CNN frame closer look found to provide image Proposal End Region (about 2000) after each of the Proposal as an image for subsequent processing (mentioned features CNN + SVM classification), an image was actually 2000 times stripping process characteristics and classification! This 2000 Region Proposal is part of the image you do not, then we can once convolution layer characteristics of the image stripping, and then only need to be mapped to the Region Proposal convolution feature map layer in the original position, so for an image we only need to provide a convolutional layer characteristics and the characteristics of each layer convolution Region Proposal input layer is connected to the full make subsequent operations. (For CNN, most operations are spent in the convolution operation, this can save a lot of time).
The problem now is that each Region Proposal scale is not the same, so enter the full direct connection layer is certainly not, because the whole input connection layer must be fixed length. SPP-NET just can solve this problem.

Traditional image processing under the following standardized
Here Insert Picture Description
since the conventional CNN limits the input to be a fixed size (for example AlexNet is 224x224), the original image is often needed for crop or warp operations in actual use:

  1. crop: a fixed size of the original picture taken patch
  2. warp: the original ROI image scaled to a fixed-size patch

Whether it is crop or warp, we can not guarantee in case the undistorted picture of them passed to CNN:

  1. crop: objects may be generated truncated, especially in the aspect ratio of the big picture.
  2. warp: the object is stretched, the loss of "prototype", especially in the aspect ratio of the big picture

As is the effect of SPP solve the above problems, so that is: What is the image regardless of the input scale, we are able to correct the incoming network.
Here Insert Picture Description
SPP-Net on the basis of RCNN made a substantial improvement (overall feature extraction, classification is only done once before in the Region interception):

  1. Cancel the crop / warp image normalization process to address the loss of information due to image distortion and storage problems;
  2. Spatial pyramid pooled (SpatialPyramid Pooling) replaces the last cell layer before the layer is fully connected

Specific ideas for: CNN layer convolution processing can enter any scale, but there are restrictions in the full-scale at the connection layer - in other words, if a method is found, the whole connection layer prior to input to the limit as long as , then solve the problem.
Here Insert Picture Description
If the input picture is 224x224, the output after conv5 out 13x13x256 is to be understood as such a filter 256, each filter corresponding to a 13x13 the activation map. If the image on the map as the activationmap pooling into three 4x4 2x2 1x1 sub-picture, after doing max pooling, out of the feature is (16 + 4 + 1) x256 much of a fixed length dimension. If the input is not original 224x224, out of the feature is still (16 + 4 + 1) x256 ; intuitively that the original can be understood as a fixed size (3x3) pool5 adaptive window into the window size, the window and the activation map is proportional to the size, length feature guarantees after pooling out of the same.
In order to accommodate different resolutions FIG characteristics, define a pool of scalable layers, regardless of the input resolution is much, can be divided into m*nsections. This is the first remarkable feature SPP-net, which is an input characteristic diagram conv5 FIG candidate frame and the feature (picture candidate block map obtained by the stride), the output of a fixed size (m * n) wherein;

After the key is the position of the SPP, which in all layers of the convolution; multiscale extracted by increasing the robustness characteristic, which is not critical, this feature has been discarded in a Fast-RCNN improved latter: Pyramid effective solution to the problem of double counting convolution layer (test speed increased by 24 to 102 times), which is the core contribution of the paper.

Compared to using the SPP-NET R-CNN can greatly accelerate the speed of target detection, but there are still many problems:

  1. And RCNN as the training process is still isolated, extracted candidate box | computing CNN feature | SVM classification | Bounding Box return to independent training, a large number of intermediate results need to dump, not the whole training parameters; the training is divided into several stages, complicated steps : fine-tuning SVM + network + training training training border regressor
  2. SPP-NET network fixed in time to fine-tune the convolution layer, only fine-tune the whole connection layer, and for a new task, it is necessary to convolution layer is also fine-tuning. (Model classification features extracted semantic level pay more attention, and the target detecting task required in addition to the semantic information of the location information of the target) SPP-Net in Tuning not simultaneously on both sides of SPP-Layer convolution layer fully connected layers, to a large extent the effect of limiting the depth of CNN;
  3. Throughout the process, Proposal Region is still very time-consuming to solve these problems, RBG ( http://www.rossgirshick.info ) also proposed Fast R-CNN, a streamlined and fast object detection framework.
Published 163 original articles · won praise 117 · views 210 000 +

Guess you like

Origin blog.csdn.net/u010095372/article/details/91294611