Understanding Neural Network (XI) R-FCN

Gu name Incredibles: the whole network convolution, convolution is all layer without fully connected layer (fc).
https://blog.csdn.net/shenxiaolu1984/article/details/51348149
R & lt-the FCN (based on the detection region) of the method is: shared over the entire image is calculated, implementation (i.e., by deleting the last layer is removed fc All sub-networks). Use "position sensitive score map" to solve the contradiction between the image classification and translation invariant object detecting horizontal and vertical movement.
The paradox is: shift-invariant object classification requires large as possible (without distinguishing the moving object in the image), translating the object detecting changes required. So, ImageNet classification leading results demonstrate the extent possible, a full translation invariance of the convolution structure is more affected by the pro-Lai. On the other hand, the object detection task requires some changes in the positioning translational FIG. For example, translation of the body so that the network should generate responses that describe the quality of the candidate block cover real object is significant. The deeper we assume that the convolution of image classification network layer, the network less sensitive to translate.
With increasing depth of the CNN network, network more and more low sensitivity to position (Position), that is, the so-called translation-invariance, but when Detection, the need for a strong sensitivity to location information. Then the detection ResNet-101 is how to do?
Before R-FCN, is very simple, the ROI-pooling layer into a layer in front of the convolution, convolution and back layer is not shared computing, so to avoid excessive loss of information, two layers can be used later convolution learning location information.
R-FCN: as a network structure using the full convolution FCN, FCN was introduced to change the translational, position sensitive fraction construct map (position-sensitive score maps) convolution with a special layer. The relative spatial position information of each encoded space-sensitive area of interest map. Increasing a pool of position sensitive RoI FCN top layer maps to monitor these scores.
R-FCN idea is to use the last layer of the network constituting a feature map by the position-sensitive FCN. Specifically, the position information of each of the proposal are to be encoded, the first proposal is divided into k*ka grid, each grid is then encoded. After the last layer map, and then calculated to produce a convolution k*k*(C+1)of the map ( k*krepresenting the number of total grid, C for class num, +1 representatives to a background class).
Here Insert Picture Description
Here Insert Picture Description
RPN given region of interest, R-FCN classifying the region of interest. R-FCN pay more layers in a convolution with the shared RPN convolution layer. Therefore, R-FCN as the RPN input to the entire image. But the final output of a convolution layer R-FCN segmented image convolution from the region of interest in response to the image in response to the convolution of the whole image.
R-FCN last convolution layer on the entire image for each class to generate k*ka score map position sensitive, there is a class C plus background object, therefore k*k*(C+1)the output of the channel layer. k*kA fractional view corresponding spatial grid describing the location. For example, k × k = 3 × 3 , the FIG. 9 fraction encodes a single object class {top-left, top-center , topright, ..., bottom-right}.
Finally, R-FCN pool RoI position sensitive layer, a score for each RoI1. Selective pooling illustration: fancy orange response image (top-left) of FIG, RoI block cutout orange, orange pooled to give an orange small square block RoI (fraction); other colors in response to image the same way. Vote (or pooling) of all colors of the small squares results obtained in response to class 1. After finished generating this map, then according to generate a proposal for the length and width k, channel number to the score map C + 1. DETAILED score map is generated, if k = 3, C = 20, then the 21 class score map each class has 3*3a Feature, a total of nine grid, each grid spatial information is recorded. A grid and each of which corresponds to each of the foregoing classes 3*3*21in which a map of the channel map is large. Now the map score map information corresponding to the grid in the region averaging, and this average value is the score map grid. Finally, the value of the score map is vote (avg pooling) to form a 21-dimensional vector to do classification.
Here Insert Picture Description
Here Insert Picture Description
When correctly classified, in response to most of the orange solid line grid position sensitive score map these aisles RoI strongest over the entire range of positions.
Bbox regression corresponding C + 1 need only be set to 4 on it.
Some methods used R-FCN increased Faster RCNN than 3 points of baseline, and faster than the original Faster RCNN (since all calculations are shared). However, and had improved compared to Faster R-CNN (ROI Pooling that advance) increased 0.2 points, 2.5 times faster. So far the results of this method should be all methods combined with speed and Performance of the best.

Published 163 original articles · won praise 117 · views 210 000 +

Guess you like

Origin blog.csdn.net/u010095372/article/details/91345854