Evolution of Yolo Algorithm—YoloCS Effectively Reduces the Space Complexity of Feature Maps (Paper Download Attached)

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

1bc4ca4f44686636633e878231fcfa4f.gif

4d94c44620e7b73a0bc4c0d8c1829838.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

cdc37857af0f3eab4afacbe6357c4966.png

Paper address: YOLOCS: Object Detection based on Dense Channel Compression for Feature Spatial Solidification (arxiv.org)

Computer Vision Research Institute column

Column of Computer Vision Institute

By compressing the spatial resolution of feature maps, the accuracy and speed of object detection are improved. The main contribution of this paper is to introduce a new feature space solidification method, which can effectively reduce the space-time complexity of feature maps and improve the efficiency and accuracy of object detection.

d8ede2e14458a4d98c743eb14f504282.gif

01

Overview

In today's sharing, the researchers examine the correlation between channel features and convolution kernels during feature purification and gradient backpropagation, focusing on forward and backpropagation within the network. Therefore, the researchers propose a feature space solidification method called dense channel compression. According to the core concept of the method, two innovative modules for backbone network and head network are introduced: dense channel compression (DCFS) for feature space solidified structure and asymmetric multi-stage compression decoupling head (ADH). These two modules showed remarkable performance when integrated into the YOLOv5 model, resulting in an improved model called YOLOCS.

fa3ee131281a2707a01266817b402d29.png  cea8dcc9021062aa311e88aea6fe255b.png

Evaluated on the MSCOCO dataset, the APs of the large, medium, and small YOLOCS models are 50.1%, 47.6%, and 42.5%, respectively. While maintaining a significantly similar inference speed to the YOLOv5 model, the large, medium, and small YOLOCS models surpassed the AP of YOLOv5 by 1.1%, 2.3%, and 5.2%, respectively.

02

background

In recent years, object detection techniques have received extensive attention in the field of computer vision. Among them, Single Shot Multi Box Detector (SSD) based on Single Shot Multi Box Algorithm (SSD) and Convolutional Neural Networks (CNN) based on Convolutional Neural Networks (CNN) are the two most commonly used target detection technologies. However, due to the low accuracy of the single-shot multi-frame algorithm and the high computational complexity of the object detection technology based on the convolutional neural network, finding an efficient and high-precision object detection technology has become a current research hotspot. one.

1348f41b481947e8c9d66d3b4ce8c96c.png

Dense Channel Compression (DCC) is a new convolutional neural network compression technology, which realizes the compression and acceleration of network parameters by spatially solidifying the feature maps in the convolutional neural network. However, the application of DCC technology in the field of object detection has not been fully studied.

bec1b00682020ef21b742a51b59b70a0.png

Therefore, a target detection technology based on Dense Channel Compression is proposed, named YOLOCS (YOLO with Dense Channel Compression). YOLOCS technology combines DCC technology with YOLO (You Only Look Once) algorithm to achieve efficient and high-precision processing of target detection. Specifically, YOLOCS technology uses DCC technology to solidify the feature map in space to achieve precise positioning of the target position; at the same time, YOLOCS technology uses the characteristics of the single-shot multi-frame algorithm of the YOLO algorithm to achieve rapid calculation of target category classification.

03

new frame

  • Dense Channel Compression for Feature Spatial Solidification Structure (DCFS)

21cab30099885e31df6170ff66c65486.png

In the proposed method (above (c)), the researchers not only solved the balance between network width and depth, but also compressed features from different depth layers through 3×3 convolution, and then output and fuse features Previously halved the number of channels. This approach enables researchers to refine feature outputs from different layers to a greater extent, thereby enhancing feature diversity and effectiveness in the fusion stage.

In addition, the compressed features from each layer carry larger convolution kernel weights (3×3), which effectively expands the receptive field of the output features. We refer to this approach as dense channel compression with feature space solidification. The rationale behind dense channel compression for feature space solidification relies on utilizing larger convolution kernels to facilitate channel compression. This technique has two key advantages: First, it expands the receptive field of feature perception during forward propagation, thus ensuring that region-related feature details are incorporated to minimize feature loss throughout the compression stage. Second, the enhancement of error details during error backpropagation allows for more accurate weight adjustments.

To further illustrate these two advantages, convolutions with two different kernel types (1×1 and 3×3) are used to compress two channels, as shown in the following figure:

11138c8ed172efbfd39eceafb64494ee.png

The network structure of DCFS is shown in the figure below. A three-layer bottleneck structure is adopted to gradually compress the channel during the forward propagation of the network. Half-channel 3×3 convolutions are applied to all branches, followed by batch normalization (BN) and activation function layers. Subsequently, a 1×1 convolutional layer is used to compress the output feature channels to match the input feature channels.

8859102ea8a1dac71cb050abc41e8aea.png

  • Asymmetric Multi-level Channel Compression Decoupled Head (ADH)

In order to solve the decoupling head problem in the YOLOX model, researchers conducted a series of studies and experiments. The findings reveal a logical correlation between the utilization of decoupled head structures and associated loss functions. Specifically, for different tasks, the structure of the decoupling head should be adjusted according to the complexity of loss calculation. In addition, when applying the decoupled head structure to various tasks, directly compressing the feature channels of the previous layer (as shown in the figure below) into task channels may lead to significant feature loss due to the difference in the final output dimension. This in turn can adversely affect the overall performance of the model.

934c802405da79ecf5752e71d387f0b6.png

Furthermore, when considering the proposed dense channel compression method for feature space solidification, directly reducing the number of channels in the final layer to match the output channels may cause feature loss during the forward propagation, thereby degrading the network performance. At the same time, in the context of backpropagation, this structure may lead to suboptimal error backpropagation, hindering the achievement of gradient stability. To address these challenges, a new decoupling head is introduced, called an asymmetric multi-stage channel compression decoupling head (Fig. (b) below).

cdc6ba770c71369d56423dfe32aab9c6.png

Specifically, we deepen the network path dedicated to the object scoring task and use 3 convolutions to expand the receptive field and number of parameters for this task. At the same time, the features of each convolutional layer are compressed along the channel dimension. This method not only effectively alleviates the training difficulty related to the target scoring task and improves the model performance, but also greatly reduces the parameters and GFLOPs of the decoupling head module, thereby significantly improving the inference speed. Furthermore, 1 convolutional layer is used to separate the classification and bounding box tasks. This is because for matching positive samples, the losses associated with the two tasks are relatively small, thus avoiding over-expansion. This approach greatly reduces parameters and GFLOPs in the decoupling header, ultimately increasing inference speed.

04

experiment visualization

Ablation Experiment on MS-COCO val2017

b98bc81cf01f8950c6bee5094decd1d6.png

Comparison of YOLOCS, YOLOX and YOLOv5- r6.1[7] in terms of AP on MS-COCO 2017 test-dev

d255e390adeb2b53d81cb122a489eb4f.png

16032ac149b7b0eb5e9ecd7e97e1ba95.png

© THE END 

For reprinting, please contact this official account for authorization

f703b706823d07064f4eebe0a763530d.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as target detection, target tracking, and image segmentation. The research institute always shares the algorithm framework of the latest papers, and the platform focuses on "research" and "practice". In the later stage, we will share the practical process for the corresponding fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

714b101ef4fb85cb4a7abf5e4a49ab26.png

fff35089a372c2cbe74b6c04968a3fb1.png

08356274e3a94fb82c70a17d91a71c90.png

1e7d898bff24d7e3d3e7c677830c1bcb.png

8bb1ee882cd3bcc762379a56b1a22fdd.png

Click "Read the original text" to cooperate and consult immediately

Supongo que te gusta

Origin blog.csdn.net/gzq0723/article/details/131160179
Recomendado
Clasificación