No more strided convolution or pooling: new CNN building blocks for low-resolution images and small objects

Summary

https://arxiv.org/pdf/2208.03641.pdf
Convolutional neural networks (CNN) have achieved great success in many computer vision tasks, such as image classification and object detection. However, their performance drops rapidly in more difficult imaging tasks such as low-resolution images or small targets. In this paper, we show that this is rooted in a common flawed design in existing CNN architectures, namely the use of strided convolution and/or pooling layers, which leads to the loss of fine information and the learning of more efficient feature representations. To this end, we propose a new CNN building block, called SPD-Conv, to replace each strided convolutional layer and pooling layer (thereby eliminating them completely). SPD-Conv consists of spatial-to-depth (SPD) layers and stride-less convolution (Conv) layers, and can be applied to most or even all CNN architectures. We explain this new design in two of the most representative computer vision tasks: object detection and image classification. Then, we apply SPD-Conv to YOLOv5 and ResNet and create a new CNN architecture. Empirically, it is shown that our method significantly outperforms state-of-the-art deep learning models, especially in more difficult image tasks such as low-resolution images and small objects. We have open sourced our code at https://github.com/LabSAINT/SPD-Conv.

1 Introduction

Since AlexNet [18], convolutional neural networks (CNN) have performed well in many computer vision tasks. For example, in image classification, famous CNN models include AlexNet, VGGNet[30], ResNet[13], etc.; in target detection, these models include R-CNN series[9,28], YOLO series[26,4 ], SSD[24], EfficientDet[34], etc. However, all these CNN models require "good quality" input (clear images, medium to large objects) during training and inference. For example, AlexNet was initially trained and evaluated on 227 × 227 clear images, but after reducing the image resolution to 1/4 and 1/8, its classification accuracy dropped by 14% and 30% respectively [16]. Similar observations were made for VGGNet and ResNet [16]. In the case of object detection, SSD suffers from significant mAP loss on 1/4 resolution images (equivalent to loss of 1/4 size objects), as shown in [11]. In fact, small object detection is a very challenging task because smaller objects inherently have lower resolution and the contextual information learned by the model is also limited. Furthermore, they (unfortunately) often coexist with large objects in the same image, which tend to dominate the feature learning process, leaving small objects undetected.

In this paper, we argue that this performance degradation is rooted in flawed yet common designs in existing CNNs. That is, using strided convolution and/or pooling, especially in the early layers of CNN architecture. The adverse effects of this design usually do not manifest themselves, since most of the studied scenes are "friendly", where the images are of good resolution and the objects are of moderate size; therefore, there is a lot of redundant pixel information, and the step volume Product and pooling can be conveniently skipped and the model can still learn features well. However, in more difficult scenarios where images are blurry or objects are small, the assumption of rich redundant information no longer holds, and current designs begin to suffer from the loss of fine information and poorly learned features.
Insert image description here

To solve this problem, we propose a new CNN building block called SPD-Conv to replace (and thus eliminate) strided convolution and pooling layers. SPD-Conv is a spatial-to-depth (SPD) layer followed by a non-stride (i.e. normal) convolutional layer. The SPD layer downsamples the feature map X but retains all information in the channel dimension, so there is no information loss. We are inspired by an image transformation technique [29] that scales the original image before feeding it to the neural network, but we greatly generalize it to feature map downsampling both within and across the network; additionally , we add a non-stepped convolution operation after each SPD to reduce the (increased) number of channels using learnable parameters in additional convolutional layers. Our proposed method is both general and unified, as SPD-Conv (i) can be applied to most (if not all) CNN architectures and (ii) replaces strided convolution and pooling in the same way. In summary, this article has the following contributions:

1. We discovered a flawed but common design in existing CNN architectures and proposed a new building block, called SPD-Conv, to replace the old design. SPD-Conv downsamples feature maps without losing learnable information, completely abandoning the now widely used strided convolution and pooling operations.
2. SPD-Conv represents a general and unified approach that can be easily applied to most (if not all) deep learning-based computer vision tasks.
3. We use two of the most representative computer vision tasks, namely object detection and image classification, to evaluate the performance of SPD-Conv. Specifically, we build YOLOv5-SPD, ResNet18-SPD, and ResNet50-SPD and compare them with several state-of-the-art deep learning models, evaluating them on COCO-2017, Tiny ImageNet, and CIFAR-10 datasets performance. The results show that, especially on small objects and low-resolution images, SPD-Conv achieves significant performance improvements in AP and top-1 accuracy. See Figure 1 for a preview.

4. SPD-Conv can be easily integrated into popular deep learning libraries such as PyTorch and TensorFlow, potentially having a greater impact. Our source code is available at https://github.com/LabSAINT/SPD-Conv.

The remainder of this article is organized as follows. Section 2 presents the background and review of related work. Section 3 describes our proposed method, and Section 4 uses two case studies of object detection and image classification. Section 5 provides performance evaluation. The paper concludes in Section 6.

2. Introduction and related work

We first provide an overview of the field, focusing on object detection as it encompasses image classification.

The current state-of-the-art object detection models are based on CNN and can be divided into one-stage and two-stage detectors, or anchor-based and anchor-based detectors. The two-stage detector first generates coarse region proposals and then uses the heads to classify and refine each proposal (fully connected network). In contrast, one-stage detectors skip the region proposal step and perform detection directly on densely sampled locations. Anchor-based methods use anchor boxes, which are a predefined set of boxes that match the width and height of objects in the training data, to improve loss convergence during training. We provide Table 1 to classify some well-known models.
Insert image description here

Generally, one-stage detectors are faster than two-stage detectors, and anchor-based models are more accurate than anchor-free models. Therefore, in our case studies and experiments, we focus more on the one-stage and anchor-based models in the first column of Table 1.
Insert image description here

Figure 2 shows a typical one-stage object detection model. It includes a CNN-based backbone network for visual feature extraction and a detection head for predicting the category and bounding box of each included object. In between, an additional layer is added as a neck part to combine features at different scales to produce semantically strong features for detecting objects of different sizes.

2.1 Small object detection

Traditionally, detecting small and large objects is regarded as a multi-scale object detection problem. A classic method is image pyramid [3], which resizes the input image to multiple scales and trains dedicated detectors for each scale. To improve accuracy, SNIP [31] proposes selective backpropagation based on different object sizes in each detector. SNIPER [32] improves the efficiency of SNIP by processing only the context area around each object instance instead of each pixel in the image pyramid, thereby reducing training time. In a different approach to improve efficiency, Feature Pyramid Network (FPN) exploits the multi-scale features inherent in convolutional layers, using lateral connections and combining these features using a top-down structure. Later, PANet [22] and BiFPN [34] improved the feature information flow of FPN by using shorter paths. In addition, SAN [15] was introduced to map multi-scale features into scale-invariant subspaces, making the detector more robust to scale changes. All these models consistently use strided convolution and max pooling, whereas we get rid of them completely.

2.2 Low-resolution image classification

One of the early attempts to address this challenge is [6], which proposes an end-to-end CNN model that adds a super-resolution step before classification. Afterwards, [25] proposed to transfer fine knowledge obtained from high-resolution training images to low-resolution test images. However, this approach requires high-resolution training images corresponding to specific applications (e.g., classes), which are not always available.

There are several other studies such as [37] that also require high-resolution training images. Recently, [33] proposed a loss function that incorporates attribute-level separability (where attributes refer to fine, hierarchical class labels) so that the model can learn class-specific discriminative features. However, fine (hierarchical) class labels are difficult to obtain, thus limiting the use of this approach.

3. A new building block: SPD-Conv

SPD-Conv consists of spatial-to-depth (SPD) layers and a non-subsampled convolutional layer. This section describes it in detail.

3.1 Space to Depth (SPD)

Our SPD component generalizes a (raw) image transformation technique [29] to downsampled feature maps both within the CNN and across the network as follows.

Consider any intermediate feature map X of size S × S × C 1 S \times S \times C_{1}S×S×C1, cut out a sequence of sub-feature maps as
f 0 , 0 = X [ 0 : S :  scale  , 0 : S :  scale  ] , f 1 , 0 = X [ 1 : S :  scale  , 0 : S :  scale  ] , … , f scale  − 1 , 0 = X [  scale  − 1 : S :  scale  , 0 : S :  scale  ] ; f 0 , 1 = X [ 0 : S :  scale  , 1 : S :  scale  ] , f 1 , 1 , … , f scale  − 1 , 1 = X [  scale  − 1 : S :  scale  , 1 : S :  scale  ] ; ⋮ f 0 ,  scale  − 1 = X [ 0 : S :  scale, scale  − 1 : S :  scale  ] , f 1 ,  scale  − 1 , … , , f scale  − 1 ,  scale  − 1 = X [  scale  − 1 : S :  scale, scale  − 1 : S :  scale  ] . \begin{array}{c} f_{0,0}=X[0: S: \text { scale }, 0: S: \text { scale }], f_{1,0}=X[1: S: \text { scale }, 0: S: \text { scale }], \ldots, \\ f_{\text {scale }-1,0}=X[\text { scale }-1: S: \text { scale }, 0: S: \text { scale }] ; \\ f_{0,1}=X[0: S: \text { scale }, 1: S: \text { scale }], f_{1,1}, \ldots, \\ f_{\text {scale }-1,1}=X[\text { scale }-1: S: \text { scale }, 1: S: \text { scale }] ; \\ \vdots \\ f_{0, \text { scale }-1}=X[0: S: \text { scale, scale }-1: S: \text { scale }], f_{1, \text { scale }-1, \ldots,}, \\ f_{\text {scale }-1, \text { scale }-1}=X[\text { scale }-1: S: \text { scale, scale }-1: S: \text { scale }] . \end{array} f0,0=X[0:S: scale ,0:S: scale ],f1,0=X[1:S: scale ,0:S: scale ],,fscale 1,0=X[ scale 1:S: scale ,0:S: scale ];f0,1=X[0:S: scale ,1:S: scale ],f1,1,,fscale 1,1=X[ scale 1:S: scale ,1:S: scale ];f0, scale 1=X[0:S: scale, scale 1:S: scale ],f1, scale 1,,,fscale 1, scale 1=X[ scale 1:S: scale, scale 1:S: scale ].

In general, given any (original) feature map X, a subgraph f_{x,y} consists of all those Therefore, each subgraph downsamples X by a scaling factor. Fig. 3(a)(b)© gives an example when scale=2, we get four subgraphs f 0 , 0 , f 1 , 0 , f 0 , 1 , f 1 , 1 f_{0 ,0},f_{1,0},f_{0,1},f_{1,1}f0,0,f1,0,f0,1,f1,1, each subgraph presents ( S 2 , S 2 , C 1 ) (\frac{S}{2},\frac{S}{2},C_{1})(2S,2S,C1) shape, and each can downsamplesX by a factor of 2.

Next, we concatenate these sub-feature maps in the channel dimension to obtain a feature map X ′ X^{\prime}X , the feature map is reduced by the scale factor in the spatial dimension and increased by the scale factor in the channel dimensione 2 e^{2}e2 . In other words, SPD transforms the feature mapX ( S , S , C 1 ) X\left(S, S, C_{1}\right)X(S,S,C1) is converted into an intermediate feature mapX ′ ( S scale , S scale , scale 2 C 1 ) X^{\prime}\left(\frac{S}{scale}, \frac{S}{scale}\right. , scale \left.^{2} C_{1}\right)X(scaleS,scaleS,scale2C _1) (Figure 3(d) gives an example when scale=2).

3.2 Non-subsampling convolution

After the SPD feature transform layer, we add a non-downsampling (i.e. stride 1) convolutional layer with C_{2} filters, where C 2 < scale 2 C 1 C_{ 2 }<scale^{2}C_{1}C2<scale2C _1, further X ′ ( S scale , S scale , scale 2 C 1 ) X^{\prime}\left(\frac{S}{\text{scale}}, \frac{S}{\text{scale} }, scale^{2}C_{1}\right)X(scaleS,scaleS,scale2C _1)转化为 X ′ ′ ( S scale , S scale , C 2 ) X^{\prime\prime}\left(\frac{S}{\text{scale}}, \frac{S}{\text{scale}}, C_{2}\right) X′′(scaleS,scaleS,C2) . The reason we use non-subsampled convolutions is to preserve all discriminative feature information as much as possible. Otherwise, for example, when using a 3 \times 3 filter with a stride of 3, the feature map is "shrunk", but each pixel is only sampled once; if the stride is 2, asymmetric sampling occurs, with an even number of rows/columns and odd rows/columns will have different number of samples. In general, downsampling with a step size greater than 1 results in a non-discriminative loss of information, although on the surface it seems to also reduce the feature mapX ( S , S , C 1 ) X\left(S, S, C_{1}\right)X(S,S,C1)转化为 X ′ ′ ( S scale , S scale , C 2 ) X^{\prime\prime}\left(\frac{S}{\text{scale}}, \frac{S}{\text{scale}}, C_{2}\right) X′′(scaleS,scaleS,C2) (but noX ′ X^{\prime}X)。
Insert image description here

4. How to use SPD-Conv: case study

To explain how our proposed method can be applied to redesign CNN architectures, we use the two most representative categories of computer vision models: object detection and image classification. This is done without loss of generality, as almost all CNN architectures use downsampling operations for strided convolution and/or pooling operations.

4.1 Target detection

YOLO is a series of very popular target detection models. We choose the latest YOLOv5 [14] for demonstration. YOLOv5 uses CSPDarknet53[4] and a SPP[12] module as its backbone, PANet[23] as its neck, and YOLOv3 head[26] as its detection head. In addition, it also uses various data augmentation methods and some modules of YOLOv4 [4] for performance optimization. It uses cross-entropy loss with a sigmoid layer to compute objectness and classification losses, and CIoU loss function [38] for localization loss. CIoU loss takes into account more details than IoU loss, such as edge overlap, center distance, and width to height ratio.

YOLOv5-SPD. We apply the method described in Section 3 to YOLOv5 and obtain YOLOv5-SPD (Figure 4) by simply replacing YOLOv5’s stride2 convolution with our SPD-Conv building block. There are 7 instances of this replacement because YOLOv5 downsamples the feature map by a factor of 2^5 using five stride-2 convolutional layers in the backbone and two stride-2 convolutional layers in the neck. In YOLOv5's neck, each strided convolution is followed by a connected layer; this does not change our approach, we just keep it between our SPD and Conv.
Insert image description here

Scalability. YOLOv5-SPD can be easily scaled up and down to meet different application or hardware needs, just like YOLOv5. Specifically, we simply adjust (1) the number of filters per non-subsampled convolutional layer and/or (2) the number of repetitions of the C3 module (as shown in Figure 4) to obtain different versions of YOLOv5 -SPD. The first one is called width scaling, changing the original width n_{w} (number of channels) to ⌈ nw × width-factor ⌉ 8 \left\lceil n_{w} \times \text{width-factor}\right\ rceil_{8}nw×width-factor8(Round to the nearest multiple of 8). The second one is called depth scaling, which changes the original depth n_{d} (the number of times the C3 module is repeated; for example, 9 C3s in Figure 4) to ⌈ nd × depthfactor ⌉ \left\lceil n_{d} \ times \right. depth_factor \rceilnd×depthfa c t or . By choosing different width/depth factors, we can obtain nano, small, medium and large versions of YOLOv5-SPD, as shown in Table 2, where the factor values ​​are the same as YOLOv5 used for comparison in later experiments.
Insert image description here

4.2 Image classification

A typical classification CNN starts with a stem unit consisting of a convolution with stride 2 and a pooling layer to reduce the image resolution by a factor of 4. A popular model is ResNet [13], which won the ILSVRC 2015 challenge. ResNet introduces residual connections, allowing the training of networks with a depth of up to 152 layers. It also significantly reduces the total number of parameters by using only a single fully connected layer. A softmax layer is used at the end to normalize the class predictions.

ResNet18-SPD and ResNet50-SPD. Both ResNet-18 and ResNet-50 use four convolutions with stride 2 and a max-pooling layer with stride 2 to downsample each input image by a factor of 2^5. Applying our proposed building blocks, we replace the four strided convolutions with SPD-Conv; but on the other hand, we only need to remove the max-pooling layer, since our main target is low-resolution images, so in the experiment The images in the used datasets are all small (64 \times 64 in Tiny ImageNet and 32 \times 32 in CIFAR-10), so pooling is unnecessary. For larger images, this max pooling layer can still be replaced in the same way via SPD-Conv. The two new architectures are shown in Table 3.
Insert image description here

5 experiments

This section uses two representative computer vision tasks, namely object detection and image classification, to evaluate our proposed SPD-Conv method.

5.1 Target detection

Datasets and settings. We use the COCO-2017 dataset [1], which is divided into train2017 (118,287 images) for training, val2017 (5,000 images; also called minival) for validation, and test2017 (40, 670 images) for testing. We use various state-of-the-art base models listed in Table 4 and Table 5. We report the standard average precision (AP) metric of val2017 under different IoU thresholds [0.5:0.95] and target sizes (small, medium, large). We also report AP metrics on the test2017 subset test-dev2017 (20,288 images) with accessible labels. However, these labels are not publicly released, but a JSON file of all predicted labels needs to be submitted to the CodaLab COCO Detection Challenge [2] to retrieve the evaluation metrics, which we have submitted.

train. We train different versions (nano, small, medium and large) of YOLOv5-SPD and all baseline models on train2017. Unlike most other studies, we do not use transfer learning but train from scratch. This is because we want to examine the true learning capabilities of each model without being masked by the rich feature representation it inherits through transfer learning from an ideal (high-quality) dataset such as ImageNet. This is in our own model ( ∗ − SPD − n / s / m / l ) (*-\mathrm{SPD}-\mathrm{n} / \mathrm{s} / \mathrm{m} / \mathrm{ l})(SPDn / s / m / l ) and all existing yolo series models (v5, X, v4 and their scaled versions such as nano, small, large, etc.). Other baseline models still use transfer learning because we lack resources (training from scratch consumes a lot of GPU time). Note, however, that this simply means that these baselines are in a better position than our own models because they benefit from high-quality datasets.

We choose the SGD optimizer with 0.937 momentum and 0.0005 weight decay. During three warm-up epochs, the learning rate increases linearly from 0.0033 to 0.01 and then decreases to a final value of 0.001 using a Cosine decay strategy. The nano and small models were trained on four V-100 32 GB GPUs with a batch size of 128, while the medium and large models were trained with a batch size of 32. For objectivity and classification, we adopt CIoU loss [38] and cross-entropy loss. We also employ several data augmentation techniques to mitigate overfitting and improve the performance of all models; these techniques include (i) photographic distortions of hue, saturation, and value, (ii) geometric distortions such as translation, scaling, shearing, fliplr and flipup, and (iii) multiple image enhancement techniques such as mosaic and cutmix. Note that boosting is not used during inference. The hyperparameters are directly adopted from YOLOv5 without readjustment.

Results
Table 4 reports the results on val2017, and Table 5 reports the results on test-dev. APS, APM, APL in the two tables \mathrm{AP}_{\mathrm{S}}, \mathrm{AP}_{\mathrm{M}}, \mathrm{AP}_{\mathrm{L}}APSAPMAPLBoth represent AP \mathrm{AP} for small/medium/large objectsAP , should not be confused with model scale (nano, small, medium, large). As the table shows, the image resolution of 640 \times 640 is not high in object detection (as opposed to image classification) because the resolution of actual objects is much lower, especially when the objects are smaller.
Insert image description here

Results on val2017. Table 4 is organized by model scale, separated by horizontal lines (the last group is the large-scale models). In the first class of nano models, our YOLOv5-SPD-n performs better in AP and APS \mathrm{AP}_{\mathrm{S}}APSThe best performance in all aspects: its APS \mathrm{AP}_{\mathrm{S}}APSIt is 13.15% higher than the second place YOLOv5n, and its overall AP is also 10.7% higher than the second place YOLOv5n.

In the second category, small models, our YOLOv5-SPD-s performs best in both AP and \mathrm{AP}_{\mathrm{S}}, although this time YOLOX-S is the best in AP Second best.

In the third, medium model category, although our YOLOv5-SPD-m still outperforms other models, the AP performance is already quite close. On the other hand, our APS \mathrm{AP}_{\mathrm{S}}APS8.6% higher than the second place, a large lead, which is a good sign since SPD-Conv is particularly good for smaller objects and lower resolutions.

Finally, for large models, YOLOX-L achieves the best results in terms of AP, while our YOLOv5-SPD-l is only slightly lower (3% lower) (but much better than the other benchmarks shown in the last group) . On the other hand, our APS \mathrm{AP}_{\mathrm{S}}APSis still the highest, which reflects the above-mentioned advantages of SPD-Conv.
Insert image description here

Results on test-dev2017. As shown in Table 5, the APS of our YOLOv5-SPD-n in the nano model category \mathrm{AP}_{\mathrm{S}}APSis once again the clear winner, with a good winning margin (19% higher than second place YOLOv5n). For average AP, although it seems that EfficientDet-D0 performs better than ours, this is because EfficientDet has almost twice as many parameters as ours and is trained using high-resolution images (via transfer learning, in cells marked " Trf"), so AP is highly dependent on resolution. This training benefit is also reflected in the small model category.

Although other benchmarks gain this benefit, our method regains the top ranking in the next medium model category, in AP \mathrm{AP}APAPS \mathrm{AP}_{\mathrm{S}}APSBest performance in all aspects. Finally, in the large model category, our YOLOv5-SPD-l outperforms APS \mathrm{AP}_{\mathrm{S}}APSIt also performs best in terms of AP, and is very close to YOLOX-L in AP.

Summarize. It is obvious that by simply replacing the cross-row convolution and pooling layers with our proposed SPD-Conv building blocks, the neural network can significantly improve its accuracy while maintaining the same level of parameter size. The improvement is more noticeable with smaller objects, which satisfies our goals well. While we don't always take the first position in all cases, SPD-Conv is the only method that consistently performs well; if it performs poorly, it occasionally becomes a (very close) runner-up, whereas in APS \mathrm { AP}_{\mathrm{S}}APS(our main target metric), it always wins.
Insert image description here

Finally, please note that we adopted untuned YOLOv5 hyperparameters, which means that our model may perform better after specialized hyperparameter tuning. Note also that all non-YOLO baselines (and PP-YOLO) were trained using transfer learning and therefore benefited from high-quality images, whereas our model did not.

Visual contrast. For intuitive understanding, we provide two real examples using two randomly selected pictures, as shown in Figure 5. We compared YOLOv5-SPD-m and YOLOv5m, as the latter performed best among all baselines in the corresponding (medium) category. Figure 5(a)(b) shows that YOLOv5-SPD-m is able to detect the occluded giraffe missed by YOLOv5m, and Figure 5©(d) shows that YOLOv5-SPD-m is able to detect very small objects (a face and two benches), while YOLOv5m cannot detect it.

5.2 Image classification

Datasets and settings. For the image classification task, we used Tiny ImageNet [19] and CIFAR-10 dataset [17]. Tiny ImageNet is a subset of the ILSVRC-2012 classification dataset and contains 200 classes. Each class has 500 training images, 50 validation images, and 50 test images. The resolution of each image is 64 × 64 × 3 64 \ × 64 \ × 364 ×64 ×3 pixels. CIFAR-10 consists of 60,000 images with a resolution of32 × 32 × 3 32 \ × 32 \ × 332 ×32 ×It consists of 3 images, including 50,000 training images and 10,000 test images. There are 10 categories, each category has 6000 images. We use top-1 accuracy as a measure of classification performance.
Insert image description here

train. We train our ReseNet18-SPD model on Tiny ImageNet. We perform a stochastic grid search to tune hyperparameters including learning rate, batch size, momentum, optimizer, and weight decay. Figure 6 shows a sample hyperparameter sweep generated using wandb mlop. The result is that the SGD optimizer has a learning rate of 0.01793, momentum of 0.9447, mini-batch size of 256, weight decay regularization of 0.002113, and training epochs of 200. Next, we train our ResNet50-SPD model on CIFAR-10. The hyperparameters adopt the ResNet50 paper, in which the SGD optimizer uses an initial learning rate of 0.1 and momentum of 0.9, a batch size of 128, a weight decay regularization of 0.0001, and 200 training epochs. For ReseNet18-SPD and ReseNet50-SPD, we use the same decay function as ResNet to reduce the learning rate as the number of epochs increases.

test. Accuracy on Tiny ImageNet is evaluated on the validation dataset since ground truth values ​​are not available in the test dataset. Calculate the accuracy on CIFAR-10 on the test data set.

result. Table 6 summarizes the top-1 accuracy results. The results show that our models ResNet18-SPD and ResNet50-SPD significantly outperform all other baseline models.
Insert image description here

Finally, we provide a visual illustration using Tiny ImageNet in Figure 7. 8 examples of ResNet18 misclassification and ResNet18SPD correct classification are given. A common feature of these images is low resolution, thus posing a challenge to standard ResNet, which loses fine-grained information in its strided convolution and pooling operations.

6 Conclusion

This paper identifies a common but flawed design in existing CNN architectures, namely the use of strided convolution and/or pooling layers. This results in the loss of fine feature information especially on low-resolution images and small objects. We then propose a new CNN building block, called SPD-Conv, that eliminates stride and pooling operations by replacing them with space-to-depth convolutions and non-stride convolutions. This new design has the great advantage of reducing the feature map sampling rate while retaining discriminative feature information. It also represents a general and unified approach that can be easily applied to any CNN architecture, just like strided convolution and pooling. We provide two of the most representative use cases, namely object detection and image classification, and demonstrate through extensive evaluation that SPD-Conv brings significant detection and classification accuracy improvements. We expect it to be easily integrated into existing deep learning frameworks such as PyTorch and TensorFlow, thereby broadly benefiting the research community.

Guess you like

Origin blog.csdn.net/m0_47867638/article/details/132548822