SPP Net space pyramid pooling principle

First compare RCNN and SPPNet process:

The above process is RCNN, the following process is SPPNet.

Both have in common: they have to choose the possible use of selective search area.

A difference between the: RCNN for each image region selected from the convoluted, extracts feature, and each image region using a shared SPPnet convolution, a convolution of the input image can then be selected by the coordinate mapping, corresponds to the characteristic of FIG.


This is an improvement of at SPPnet, the original is RCNN each candidate boxes are convoluted, characterized mentioned, by one convolving SPPnet avoid double counting. To a convolution, then you need to address the mismatch between the dots, that is to say after the coordinates of the original image is mapped to the coordinates on the map feature. This is actually very easy to think, for the convolution network, only stride will change the size of the network, the method for computing the coordinates of the upper left and lower right corner as follows:

 

Here s is the product of all stride.

Such a picture after convolution, corresponding feature will be able to find the location, but due to inconsistencies in the size of each frame, and the neural network is required to be the same size of the input, in order to solve this problem by cutting RCNN scalable solution to this problem, but it will bring a loss of precision, you see, take a picture, looking at both sides obviously not the same. So the authors proposed space pooling layer to improve this problem.

This space is actually thought pool layer can be divided into simple, each frame is divided into three levels to pooling, that is to say the picture is divided into small 4x4 grid, each taking a small cell number, 2 * 2 each small cell and small cell pool for pooling of the whole global image, and then combined, thus obtained output is fixed, in particular below FIG.

To facilitate understanding, I re-drew about spp treatment process:

As shown, for the selected regions of different sizes corresponding to the characteristic diagram after the convolution of the feature it is inconsistent with the size of the view area, a thickness of 256, for each region (thickness of 256), in three divided pooling way:

(1) directly to the entire region of the pool, each point to obtain a total of 256 points, a vector composed of 1x256

(2) the area into a 2x2 grid, each grid cell, thereby obtaining a 1x256 vector, a total of 2x2 = 4 lattice finally obtain the vector 1x256 4

(3)将区域划分成4x4的格子,每个格子池化,得到一个1x256的向量,共4x4=16个格子,最终得到16个1x256的向量

将三种划分方式池化得到的结果进行拼接,得到(1+4+16)*256=21*256的特征。

由图中可以看出,整个过程对于输入的尺寸大小完全无关,因此可以处理任意尺寸的候选框。

空间池化层实际就是一种自适应的层,这样无论你的输入是什么尺寸,输出都是固定的(21xchannel)。SPPNet改变了卷积的顺序,提出了自适应的池化层,避免了预测框大小不一致所带来的问题。从这个结构设计上来看,整体也非常巧妙,不像RCNN那样蛮力求解。

Guess you like

Origin blog.csdn.net/sinat_33486980/article/details/81902746