[youcans hands-on model] SPPNet model for target detection

Welcome to the "youcans hands-on model" series
The content and resources of this column are synchronized to GitHub/youcans

[youcans hands-on model] SPPNet model for target detection

This paper implements the SPPNet network model with PyTorch.

1. SPPNet convolutional neural network model

He Yuming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian published the paper [Spatial Pyramid Pooling of Deep Convolutional Networks for Visual Recognition] in ECCV 2014, and proposed the SPPNet model.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

[Download address]: https://arxiv.org/abs/1406.4729
[GitHub address]: [OverFeat_pytorch] https://github.com/BeopGyu/OverFeat_pytorch

http://research.microsoft.com/en-us/um/people/kahe/

insert image description here

1.1 Abstract of the paper

Existing deep convolutional neural networks (CNNs) require fixed-size input images. This artificial requirement may degrade the recognition accuracy for images or sub-images of arbitrary size/scale.

We propose a SPP Net network structure, using another pooling strategy "spatial pyramid pooling", which can generate fixed-length representations for input images of any size/scale. Pyramid pooling is also robust to object deformation. Based on these advantages, SPP Net can generally improve all image classification methods based on convolutional neural networks.

Using SPP Net improves the accuracy of various convolutional neural network architectures on the ImageNet 2012 dataset. On the Pascal VOC 2007 and Caltech101 datasets, SPP Net achieves state-of-the-art classification results without fine-tuning using a single full image representation.

For the target detection task, using SPP Net only needs to calculate the feature map from the entire image once, and can use any region (sub-image) as input to generate a fixed-size output feature map representation for training the detector. Since SPP Net can avoid repeated calculation of convolutional features, it is 24~102 times faster than R-CNN method and achieves better or comparable accuracy on Pascal VOC 2007.

In the 2014 ImageNet Visual Recognition Challenge (ILSVRC2014), SPP Net won the second place in the object detection task and the third place in the image classification task.

1.2 Technical Background

Recently, deep network based methods have made great progress in image classification, object detection, many other recognition tasks and even non-recognition tasks. However, there is a problem in the training and testing of convolutional neural networks: convolutional neural networks require a fixed input image size (such as 224*224), which limits the aspect ratio and scale of the input image.

The input image with a fixed size brings two problems: (1) the images obtained in many scenes are not fixed in size, and (2) if the image is cut, important information may be lost.

When applied to images of arbitrary size, current methods mostly fit the input image to a fixed size by cropping or scaling. But the cropped area may not contain the whole object, and the zoomed result may cause geometric distortion, which affects the recognition accuracy. Also, when the object scale changes, the predefined scale may not be suitable. In short, a fixed-size input image is required, ignoring the scale problem of the actual image.

Why do convolutional neural networks need a fixed input size? A convolutional neural network consists of two parts: a convolutional layer and a fully connected layer. Convolutional layers operate in a sliding window manner, do not require input images to have a fixed size, and can generate output feature maps of any size. However, fully connected layers require input vectors of fixed size. Therefore, the constraint of the fixed size of the convolutional neural network comes from the fully connected layers, which exist at the end of the network model.

This paper introduces a spatial pyramid pooling layer (Spatial Pyramid Pooling, SPP) to deal with the fixed size constraints of convolutional neural networks. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer produces a fixed-size output, which is then fed to a fully-connected layer (or other classifier). In other words, we perform information "aggregation" via SPP between convolutional and fully-connected layers, rather than cropping or scaling the image at the beginning.

insert image description here

Spatial pyramid pooling (often called spatial pyramid matching or SPM), an extension of bag-of-words (BoW), is one of the most successful methods in computer vision. It divides images into fine-to-coarse levels and aggregates local features among them. SPP has been a key component in classification and detection systems before the popularity of convolutional neural networks, but it has not been applied in convolutional neural networks.

SPP Net has several salient properties for deep convolutional neural networks:

(1) Regardless of the size of the input image, SPP-Net is able to generate a fixed-length output, which is not possible with previous convolutional and pooling layers.

(2) SPP uses multi-level spatial containers, while sliding window pooling only uses a single window size. Multi-level pooling is more robust to object deformation.

(3) Due to the flexibility of the input scale, SPP Net can pool features extracted at variable scales.

Not only does SPP Net generate representations for testing from images/windows of arbitrary size, it also allows us to feed images of different sizes or scales during training. Training with variable-sized images increases scale invariance and reduces overfitting.

We develop a simple multi-scale training method. In each epoch, we train the network with a given input size, and switch to another input size in the next epoch. Experiments show that this kind of multi-scale training can converge, and the test accuracy is better.

SPP networks also show strong capabilities in object detection. R-CNN extracts the features of candidate windows through deep convolutional networks, leading the performance in target detection tasks. However, RCNN needs to repeatedly calculate the feature maps of thousands of candidate regions for each image, and the calculation is very time-consuming.

Many studies have tried to train/run detectors on feature maps (instead of image regions). In this paper, we only perform one convolution process on the entire image (regardless of the number of windows), and then extract features on the feature map via SPP. SPP Net achieves superior accuracy and efficiency through the flexibility of SPP at arbitrary window sizes. Based on the processing flow of R-CNN, SPP Net calculates features 24~102 times faster than R-CNN, and has better accuracy.

Summarize:

Compared with Convolutional Neural Networks without SPP, SPP Net can enhance deeper and larger various networks.

By performing multi-view testing on feature maps with different size/positioning windows, the classification accuracy can be improved.

1.3 Spatial Pyramid Pooling

Convolutional layers can accept inputs of arbitrary size. Convolutional layers use sliding filters whose output has approximately the same aspect ratio as the input. These outputs are called feature maps, and they relate not only to the strength of the response, but also to its spatial location.

Convolutional layers can accept arbitrary input sizes and produce output feature maps of variable size. Whereas classifiers (SVM/softmax) or fully connected layers require fixed-length input vectors.

The spatial pyramid pooling method generates a fixed-length output vector through local spatial window pooling, and can retain spatial information. As long as the size of the pooled spatial window is proportional to the image size, an output vector of fixed size or length can be obtained regardless of the image size.

insert image description here

Therefore, in order to use convolutional neural networks for images of any size, we replace the last pooling layer in the original network with a spatial pyramid pooling layer.

(1) Through spatial pyramid pooling, the input image can be of any size. This allows not only arbitrary aspect ratios, but also arbitrary scales.

By selecting different sizes of convolution kernels (w, h) and different strides, we can adjust an input image of any size and scale to an output image of any scale, or adjust it to a fixed-length vector output for application to a classifier.

(2) Scale plays an important role in traditional image processing methods, and scale is also important for the accuracy of deep networks. Through convolution kernels of different scales and parameters, the network can extract features of different scales from the image.

Select convolution kernels of different sizes (kw, kh) and different strides to construct a 3-level (3*3, 2*2, 1*1) pyramid pooling layer.

As shown in the figure, the output feature dimension of the last convolutional layer of the network is (h, w, ch), which is used as the input of the SPP layer. At the bottom of the pyramid, the size of the convolution kernel is set equal to the height and width of the image, and the pooling output dimension is 1*1, which is actually a "global pooling" operation; the upper layer of the pyramid, the convolution kernel size is 1/2 of the image height and width, and the pooling output dimension is 2*2; the top layer of the pyramid, the convolution kernel size is 1/3 or 1/4 of the image height and width, and the pooling output dimension is 3*3 or 4*4. After flattening the 3-level pooling output into one dimension, the length after splicing is $1 + 2 * 2 + 4 * 4 =$ A one-dimensional vector of $21 .$ For ch=256, the total length of the spliced one-dimensional vector is $21 * 256 = 5376$ . (Note: The output dimension of the third layer is 4*4 in the paper and 3*3 in the example, which are all determined by the user's choice)

1.4 Object Detection

In the target detection task, R-CNN first uses selective search to select 2000 candidate windows from the image, then deforms the image area in each window to a fixed size of 227×227 as an input image, uses a convolutional neural network to extract the features of each window, and finally uses an SVM classifier for feature detection. R-CNN repeatedly runs the convolutional network on 2000 windows of an image, which is time-consuming, and feature extraction in the test phase is the main bottleneck.

We run the convolutional network on the entire image to obtain a feature map, and then for each candidate window in the image, apply spatial pyramid pooling on the corresponding window of the feature map to form a fixed-length representation of this window. Because the convolutional network is only applied once, our method is much faster.

insert image description here

For example, in each candidate window, we use a 4-level spatial pyramid (1*1, 2*2, 3*3, 6*6) to collect features, flatten the 4-level outputs into one dimension, and splicing them to obtain a length of 1 + 2 ∗ 2 + 3 ∗ 3 + 6 ∗ 6 = 50 1+2*2+3*3+6*6= $1 + 2 * 2 + 3 * 3 + 6 * 6 =$ 1D vector of $50 .$ For ch=256, the total length of the one-dimensional vector is $256 * 50 = 12800$ 。

1.5 Summary

(1) SPP-Net allows input images of different sizes to obtain feature vectors of the same size.

(2) SPP-Net only runs the convolutional network once on the entire image in the feature detection task, which is much faster than RCNN.

(3) Selective search (SS) is still used to generate candidate regions, and the process is relatively slow.

2. Define the SPP model class in PyTorch

2.1 Parameter Calculation of SPP Layer

$\begin{matrix} k_h &= &\lceil {h_{in}}/{n} \rceil &= &ceil( {h_{in}}/{n})\\ s_h &= &\lceil {h_{in}}/{n} \rceil &= &ceil( {h_{in}}/{n})\\ p_h &= &\lfloor (k_h*n-h_{in}+1)/2 \rfloor &= &floor( (k_h*n-h_{in}+1)/2)\\ h_{new} &= &h_{in} + 2*p_h\\ \\ k_w &= &\lceil {w_{in}}/{n} \rceil &= &ceil( {w_{in}}/{n})\\ s_w &= &\lceil {w_{in}}/{n} \rceil &= &ceil( {w_{in}}/{n})\\ p_w &= &\lfloor (k_w*n-w_{in}+1)/2 \rfloor &= &floor( (k_w*n-w_{in}+1)/2)\\ w_{new} &= &w_{in} + 2*p_w \end{matrix}$

Among them, $k_h, k_w$ is the height and width of the convolution kernel, $s_h, s_w$ is the step size in the direction of height and width, $p_h,p_w$ Fill values for borders (multiply by 2 on both sides).

Note that the calculation formulas of the convolution kernel and step size use ceil() to round up, and padding uses floor() to round down.

2.2 Define SPP Layer

The routine for constructing the SPP layer (Spatial Pyramid Pooling Layer) is as follows.

# 构建 SPP 层(空间金字塔池化层)
class SPPlayer(nn.Module):
    def __init__(self, num_levels, pool_type='max_pool'):
        super(SPPlayer, self).__init__()
        self.num_levels = num_levels
        self.pool_type = pool_type

    def forward(self, x):
        b, c, h, w = x.size()
        xSPP = torch.zeros([b, 0]).cuda()
        for i in range(self.num_levels):
            level = i + 1
            kh, kw = math.ceil(h/level), math.ceil(w/level)
            sh, sw = math.ceil(h/level), math.ceil(w/level)
            ph, pw = math.floor((kh*level-h+1)/2), math.floor((kw*level-w+1)/2)

            # 最大池化
            tensor = nn.functional.max_pool2d(x, kernel_size=(kh,kw), stride=(sh,sw), padding=(ph,pw))
            # 平均池化
            # tensor = nn.functional.avg_pool2d(x, kernel_size=(kh,kw), stride=(sh,sw), padding=(ph,pw))
            # 展平为一维后拼接
            xSPP = torch.cat((xSPP, tensor.view(b, -1)), 1)  # (1+2*2+3*3)*c = 14*c
        return xSPP

2.3 SPPnet model class

Referring to the AlexNet model, use the SPP Layer to construct the SPPnet model as follows.

class SPPnet(nn.Module):
    # Expected input size is 64x64
    def __init__(self, spp_level=3, num_classes=10):
        super(SPPnet, self).__init__()
        self.spp_level = spp_level
        self.num_grids = 0
        for i in range(spp_level):
            self.num_grids += (i+1)**2  # 1*1, 2*2, 3*3

        self.features = nn.Sequential(OrderedDict([
            ('conv1', nn.Conv2d(3, 128, 3)),
            ('relu1', nn.ReLU()),
            ('pool1', nn.MaxPool2d(2)),
            ('conv2', nn.Conv2d(128, 128, 3)),
            ('relu2', nn.ReLU()),
            ('pool2', nn.MaxPool2d(2)),
            ('conv3', nn.Conv2d(128, 128, 3)),
            ('relu3', nn.ReLU()),
            ('pool3', nn.MaxPool2d(2)),
            ('conv4', nn.Conv2d(128, 128, 3)),
            ('relu4', nn.ReLU())
        ]))

        self.spp_layer = SPPlayer(spp_level, 'max_pool')

        self.classifier = nn.Sequential(OrderedDict([
            # ('dropout', nn.Dropout(p=0.2, inplace=True)),
            ('fc1', nn.Linear(self.num_grids*128, 1024)),
            ('fc1_relu', nn.ReLU()),
            ('fc2', nn.Linear(1024, num_classes)),
        ]))

    def forward(self, x):
        x = self.features(x)
        x = self.spp_layer(x)
        x = self.classifier(x)
        return x

The structure of the model is as follows.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1          [-1, 128, 62, 62]           3,584
              ReLU-2          [-1, 128, 62, 62]               0
         MaxPool2d-3          [-1, 128, 31, 31]               0
            Conv2d-4          [-1, 128, 29, 29]         147,584
              ReLU-5          [-1, 128, 29, 29]               0
         MaxPool2d-6          [-1, 128, 14, 14]               0
            Conv2d-7          [-1, 128, 12, 12]         147,584
              ReLU-8          [-1, 128, 12, 12]               0
         MaxPool2d-9            [-1, 128, 6, 6]               0
           Conv2d-10            [-1, 128, 4, 4]         147,584
             ReLU-11            [-1, 128, 4, 4]               0
         SPPlayer-12                 [-1, 1792]               0
           Linear-13                 [-1, 1024]       1,836,032
             ReLU-14                 [-1, 1024]               0
           Linear-15                   [-1, 10]          10,250
================================================================
Total params: 2,292,618
Trainable params: 2,292,618
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.05
Forward/backward pass size (MB): 10.66
Params size (MB): 8.75
Estimated Total Size (MB): 19.45
----------------------------------------------------------------

3. Training of SPPnet model

[SPP-net_pytorch/ GitHub]
https://github.com/zjZSTU/SPP-net/tree/master/py

import torch
import torch.nn as nn


class ZFNet(nn.Module):

    def __init__(self, num_classes=1000):
        super(ZFNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, stride=2, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

【End of this section】

Copyright statement:
Welcome to the "youcans hands-on model" series
. For forwarding, please indicate the original link:
[youcans hands-on model] SPPNet model for target detection
Copyright 2023 youcans, XUPT
Crated: 2023-07-25