Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, SPP-Net, Spatial Pyramid Pooling

Spatial Pyramid Pooling

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, SPP-Net, Spatial Pyramid Pooling

1. Relevant theories

   This blog post mainly explains the 2014 paper of the great god He Kaiming: "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition". The main innovation of this paper is that it proposes spatial pyramid pooling. Paper home page: http://research.microsoft.com/en-us/um/people/kahe/eccv14sppnet/index.html   This algorithm is n times faster than the R-CNN algorithm.

    We know that in the existing CNN, for a network whose structure has been determined, a fixed-size image needs to be input, such as 224 224, 32 32, 96 96 , etc. In this way, when we want to detect pictures of various sizes, we need to go through a series of operations such as cropping or zooming, which will often reduce the accuracy of recognition and detection. Therefore, the paper proposed the "spatial pyramid pooling" method. This algorithm is awesome. The advantage is that the network we built can input pictures of any size without cropping and zooming operations. As long as you like, pictures of any size are fine. Not only that, after this algorithm is used, the accuracy will also be improved. In a word: awesome.

    Spatial pyramid pooling, also known as "SPP-Net", remember this name, because you will often encounter it in future foreign language literature, especially papers on object detection. This is like what: OverFeat, GoogleNet, R-CNN, AlexNet... For convenience, after learning this paper, you need to remember what SPP-Net is. Space Pyramid has seen this algorithm several times in the related literature of feature learning and feature expression.

   Since the previous CNN required input of a fixed-size image, we first need to know why CNN needs to input a fixed-size image? CNN generally consists of 3 parts, convolution, pooling, and full connection.

The first is convolution. Does the convolution operation have any requirements on the size of the image input? For example, a 55 convolution kernel, I input a picture with a size of 30 81, and I can get a picture with a size of (26,77), which will not affect the convolution operation. I input 600 500, it can still perform convolution, that is, convolution does not require the size of the image input, as long as you like, any size of image can be entered, and convolution can be performed.

Pooling: Does pooling have any requirements for image size? For example, if my pooling size is (2, 2) and I input a picture of 30 40, then I can get a picture of 15 20 after pooling . Input a picture of size 53 22, after pooling, I can get a picture of size 26 11. Therefore, the pooling step does not require the image size. As long as you like, you can input a picture of any size and perform pooling.

Fully connected layer: Since both pooling and convolution do not require the size of the input image, only the fully connected layer requires the image result. Because the size W of our connection persuasion matrix in the fully connected layer is fixed after training, for example, when we go from convolution to fully connected layer, the input and output sizes are 50 and 30 neurons respectively, then Our weight matrix is ​​now a matrix of size (50,30). Therefore, the spatial pyramid pooling needs to solve the transition from the convolutional layer to the fully connected layer.

That is to say, in the future literature, the general spatial pyramid pooling layer is a network layer placed between the convolutional layer and the fully connected layer.

2. Algorithm overview

OK, then we are about to explain what is spatial pyramid pooling. Let's start with the feature extraction of the spatial pyramid (the "pooling" is not considered here). The spatial pyramid is a feature extraction method long ago, which is closely related to features such as Sift and Hog. For simplicity, we assume a very simple two-layer network:

Input layer: a picture of any size, assuming its size is (w, h).

Output layer: 21 neurons.

That is, when we input a feature map of any size, we hope to extract 21 features. The process of spatial pyramid feature extraction is as follows:


Image scaling

As shown in the figure above, when we input a picture, we use scales of different sizes to divide a picture. In the above schematic diagram, three scales of different sizes are used to divide an input image, and finally a total of 16+4+1=21 blocks can be obtained. We will extract each block from these 21 blocks. A feature is generated, which happens to be the 21-dimensional feature vector we want to extract.

In the first picture, we divide a complete picture into 16 blocks, that is, the size of each block is (w/4, h/4);

The second picture is divided into 4 blocks, and the size of each block is (w/2,h/2);

In the third picture, the whole picture is regarded as a block, that is, the size of the block is (w, h)

The process of maximum pooling of the spatial pyramid is actually to calculate the maximum value of each block from the 21 image blocks, so as to obtain an output neuron. Finally, convert an image of any size into a fixed-size 21-dimensional feature (of course, you can design the output of other dimensions, increase the number of layers of the pyramid, or change the size of the grid). The division of the above three different scales, we call each scale: a layer of the pyramid, and the size of each picture block we call: windows size. If you want a certain layer of the pyramid to output n*n features, then you have to use the windows size: (w/n, h/n) for pooling.

When we have a network with many layers, when the input of the network is a picture of any size, we can always perform convolution and pooling until the last few layers of the network, that is, we are about to connect with the fully connected layer When connecting, pyramid pooling is used, so that feature maps of any size can be converted into feature vectors of fixed size. This is the meaning of spatial pyramid pooling (multi-scale features extract feature vectors of fixed size). The specific flow chart is as follows:


Space Pyramid Pooling (Spatial Pyramid Pooling, SPP) principle and code implementation (Pytorch)

        </h1>
        <div class="clear"></div>
        <div class="postBody">

If you want to see the formula directly, you can skip to the third section 3. Formula correction

1 | 0 1. Why SPP is needed


First of all, you need to know why SPP is needed.

We all know that the convolutional neural network (CNN) consists of a convolutional layer and a fully connected layer. The convolutional layer does not require the size of the input data. The only requirement for the data size is the first fully connected layer . Therefore Basically all CNNs require a fixed size of input data. For example, the famous VGG model requires an input data size of (224*224) .

There are two problems with fixed input data size:

1. The data obtained in many scenes is not of a fixed size. For example, the height-to-width ratio of street view text is basically not fixed , as shown in the text in the red box as shown below.


2. You may say that you can cut the picture, but if you cut it, important information may be lost.

In summary, the proposal of SPP is to solve the problem that the size of the CNN input image must be fixed, so that the aspect ratio and size of the input image can be made arbitrary.

2 | 0 2. Principle of SPP


More specific principles can be found in the original paper: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition


The above picture is the schematic diagram given in the original text, which needs to be viewed from bottom to top:
  • The first is the input layer (input image), its size can be arbitrary
  • Convolution operation, to the last convolutional layer (in the figure isc o n v 5 conv5

4. Algorithm source code implementation, Pytorch:

#coding=utf-8

import math
import torch
import torch.nn.functional as F

# 构建SPP层(空间金字塔池化层)
class SPPLayer(torch.nn.Module):
	def __init__(self, num_levels, pool_type='max_pool'):
		super(SPPLayer, self).__init__()

		self.num_levels = num_levels
		self.pool_type = pool_type

	def forward(self, x):
		num, c, h, w = x.size() # num:样本数量 c:通道数 h:高 w:宽
		for i in range(self.num_levels):
			level = i+1
			kernel_size = (math.ceil(h / level), math.ceil(w / level))
			stride = (math.ceil(h / level), math.ceil(w / level))
			pooling = (math.floor((kernel_size[0]*level-h+1)/2), math.floor((kernel_size[1]*level-w+1)/2))

			# 选择池化方式
			if self.pool_type == 'max_pool':
				tensor = F.max_pool2d(x, kernel_size=kernel_size, stride=stride, padding=pooling).view(num, -1)
			else:
				tensor = F.avg_pool2d(x, kernel_size=kernel_size, stride=stride, padding=pooling).view(num, -1)

			# 展开、拼接
			if (i == 0):
				x_flatten = tensor.view(num, -1)
			else:
				x_flatten = torch.cat((x_flatten, tensor.view(num, -1)), 1)
		return x_flatten

def test():
	spp1 = SPPLayer(1)
	spp2 = SPPLayer(2)
	spp3 = SPPLayer(3)
	x1 = torch.ones((1, 3, 32, 32), dtype=torch.float)
	x2 = torch.ones((1, 3, 64, 64), dtype=torch.float)
	out11 = spp1(x1)
	out12 = spp1(x2)
	out21 = spp2(x1)
	out22 = spp2(x2)
	out31 = spp3(x1)
	out32 = spp3(x2)
	print(out11.shape, out12.shape)
	print(out21.shape, out22.shape)
	print(out31.shape, out32.shape)


test()

Guess you like

Origin blog.csdn.net/leiduifan6944/article/details/106521023