YOLO lightweight improvement, edge GPU friendly YOLO improvement algorithm!

In this paper, the authors propose a new edge GPU-friendly module for multi-scale feature interaction based on the problem of missing combinatorial connections between various feature scales in existing advanced methods. In addition, the authors propose a new transfer learning backbone that is inspired by changes in the transformation information flow of different tasks, aiming to complement the feature interaction module and improve the availability of accuracy and inference speed on various edge GPU devices.

Table of contents

1 Introduction

The main contribution of this article

2Research background

2.1 Single-stage target detection

2.2 Building Blocks

2.3 Multi-Scale Feature Fusion

3Methods of this article

3.1 Original feature collection and redistribution - RFCR

3.2 Backbone Truncation

4 experiments

4.1 Ablation experiment

1、Truncated feature extraction backbone

2、Raw feature collection and redistribution

3. SOTA comparison

5Reference


1 Introduction

The performance of target detection models has developed rapidly in two main aspects: model accuracy and efficiency. However, in order to deploy a deep neural network (DNN)-based target detection model to edge devices, it is usually necessary to perform relatively large compression of the model, which also reduces the accuracy of the model.

In this paper, the authors propose a new edge GPU-friendly module for multi-scale feature interaction based on the problem of missing combinatorial connections between various feature scales in existing advanced methods. In addition, the authors propose a new transfer learning backbone that is inspired by changes in the transformation information flow of different tasks, aiming to complement the feature interaction module and improve the availability of accuracy and inference speed on various edge GPU devices.

For example, YOLO-ReT based on MobileNetV2-0.75 Backbone runs in real time on Jetson Nano and achieves 68.75 mAP/33.19FPS on Pascal VOC (MobileNetV2 is 68.67 mAP/28.16FPS) and 34.91 mAP/33.19FPS on COCO.

In addition, the multi-scale feature interaction module of this article is introduced in YOLOv4-tiny and YOLOv4-tiny, which improves the performance on COCO to 41.5 and 48.1 mAP respectively, which is 1.3mAP and 0.9mAP higher than the original version.

The main contribution of this article

  1. An RFCR module is proposed, which effectively combines multi-scale features and is compatible with various Backbones and detection heads. In addition, the feature collection of the RFCR module has nothing to do with the output scale number of the detection head, which facilitates better feature interaction;
  2. Extensive experimental analysis is conducted on the importance of individual transfer learning layers and truncation methods are employed to improve model efficiency. The truncation and RFCR modules complement each other, allowing the creation of faster and more accurate detection models;
  3. In-depth ablation studies that perform latency experiments on devices targeting edge GPUs rather than other indirect metrics such as MFLOPs or model size, providing accurate comparisons of various competing designs.

2Research background

2.1 Single-stage target detection

The single-stage target detection model consists of 2 parts:

  • Feature extractor pretrained on ImageNet
  • Object detection head responsible for final output

Although CNN is the first choice for feature extraction models, there are also some works exploring other forms of feature extractors, such as extreme learning machines (ELM), motion probability maps, etc. Single-stage target detection models can be further divided into Anchor-based models and Anchor-Free models based on the detection heads they use. Heatmap-based detection models, such as CornerNet, CenterNet, etc., are common examples of Anchor-Free models. However, these models require complex computational backbones as they rely on maintaining the integrity of the high-resolution information of the input image. On the other hand, Anchor-based detection models are a lighter option. For example, the YOLOv3 detection head is one of the most commonly used detection heads for edge devices and can be easily integrated with lightweight Backbone.

2.2 Building Blocks

A large amount of research on real-time object detection models has been devoted to improving the basic building blocks of CNN. Traditional CNN layers contain a large number of parameters and calculations, which makes most real-time detection models obviously shallow networks. Decoupling 2D convolutions into depthwise separable convolutions and (1×1) convolutions is a common technique to make networks lighter. Before applying convolution, 1×1 convolution is used to further reduce the number of channels, which gave rise to the idea of ​​fire module and has been applied to various lightweight detection models.

However, using multiple consecutive pointwise convolutions to reduce the computational cost of information flow violates a basic rule for designing fast deep learning models, namely network fragmentation. Network fragmentation is a phenomenon in which a heavier operation is fragmented into multiple lightweight operations and severely affects the execution speed of the model because it interferes with the degree of parallelism within the model. For example, Mobiledet found that grouped point-wise convolution performs poorly on GPU devices, while ShuffleNetV2 found that point-wise convolution is fastest when the number of input and output channels is the same.

The final feature extraction Backbone is formed by combining one or more of the above building blocks. Many studies even utilize Neural Architecture Search (NAS) to build their own backbone and detection models. However, these models ignore the transfer learning information present in other pre-trained Backbones. On the other hand, Backbone pretrained on existing datasets may contain classification task-specific features, which may add unnecessary feature computation burden. Therefore, the effective adaptation of pre-trained Backbone from classification to object detection also plays an important role in the final performance of the model.

2.3 Multi-Scale Feature Fusion

Whether in a single-stage or two-stage object detection model, multi-scale feature interaction is an important part of the object detection head. Existing feature interaction methods use a combination of top-down or bottom-up methods to handle the information flow across multi-scale features. Feature Pyramid Network (FPN) is the first to create a top-down path from high-level feature scales to low-level feature scales, with the goal of using well-processed deep features to help improve the accuracy of detection layers using shallower features. PANet goes a step further and shows that additional bottom-up paths help to further improve the detection accuracy of High-Level features.

Based on the success of FPN and PANet, NAS-FPN attempts to find the optimal information flow path between various multi-scale features. Since such architecture search-based models are specifically designed for certain datasets and Backbone networks, it is difficult to generalize them to wider applications. However, these searches revealed interesting trends that can help to understand more about the inherent needs of such models. The NAS-FPN design reveals non-adjacent direct connections between different feature scales, indicating that the flow of information through adjacent scales alone may become complex, hence the need for this ShortCut. Similarly, NAS-FPN also revealed the importance of iteratively following top-down and bottom-up paths, which BiFPN later adopted to further improve model accuracy.

Not only are there paths to combining multi-scale features together, but a lot of work has been done on how various features are combined. While most existing works simply concatenate feature maps of multiple scales together, weighted or attention-based feature fusion has also been proposed to better highlight more important feature scales. Another aspect of blending features is bringing them to a common scale. Simpler solutions include upsampling or downsampling one feature scale to match another. However, this may lead to local position mismatch between different scales, so various methods have also been explored to process features before and after fusion to promote better information flow between different scales.

3Methods of this article

3.1 Original feature collection and redistribution - RFCR

The author hopes to use the improved feature interaction network to enhance Backbone's ability to extract features and thereby improve detection accuracy without causing any significant impact on the inference speed. Although the focus is on detection in this article, the RFCR module can be generalized to provide interactive features for similar tasks.

Existing multi-scale feature interaction methods can be decomposed into a combination of top-down and bottom-up methods, which focus on only two adjacent feature scales at a time. This causes a large number of possible combination pairs to be missed, making the propagation of information between distant feature scales inefficient. Furthermore, the model’s detection accuracy begins to saturate when top-down and bottom-up paths are repeatedly used (e.g., from BiFPNx2 to BiFPNx3).

Here, inspired by the non-adjacent feature scale connection in NAS-FPN, this paper proposes a lightweight feature collection and redistribution module that fuses the original multi-scale features from Backbone and then redistributes them assigned to each feature scale. Therefore, the feature map of each scale contains features of all other scales. This layer does not involve any heavy calculations or parameters, but allows a direct connection between each pair of feature scales, as shown in Figure 1.

It should be noted that the RFCR module cannot replace the granularity provided by other feature aggregation methods. Instead, the goal is to provide an extremely lightweight feature processing that provides positive performance in accuracy before passing it to other multi-scale feature fusion methods. Deliver improvements.

Furthermore, the RFCR module design allows for a number of output scales independent of the detection head, since there is no limit between the input and output characteristics of the RFCR module. For example, although the YOLOv3 detection head has 3 output scales, 4 different Backbone features can be used in the feature acquisition stage (3 features at the same output scale and a 4th shallower feature), so that finer Granular low-level features to improve model performance. Similarly, even for a detection head with only 2 output scales, such as YOLOv4-tiny, the detection function is enriched by multiple low-level features by using the RFCR module.

As discussed in Section 2.3, the way features are fused is equally important as the aggregation path. To keep additional latency overhead to a minimum, the raw features are passed through a single 1x1 convolution during collection and the features are fused together using a simple weighted sum. The authors pass the fused feature maps through MobileNet’s convolution blocks (MBConv) and then redistribute them back to different scales.

Such a design allows to keep network fragmentation to a minimum, since the RFCR module can have only 4 layers:

  • 1x1 convolution
  • Weighted sum
  • MBConv
  • Upsampling and downsampling layers.

Parallel collection and reallocation of features can also be easily optimized, which in turn can increase execution speed.

When fusing features of different scales, simple upsampling and downsampling can lead to semantic inconsistency and local location mismatch. Therefore, it is proposed to use a 5x5 convolution kernel to increase the receptive field of the feature fusion layer instead of the traditional 3x3 or 1x1 convolution kernel to help improve the detection accuracy of the model while having a negligible impact on its inference delay. The author also found that increasing the kernel size to 7x7 did not further improve performance.

def RFCR_Module(inp_arr):
    b1c = inp_arr[0]
    b2c = inp_arr[1]
    b3c = inp_arr[2]
    b4c = inp_arr[3]

    b1c = tf.keras.layers.Conv2D(48, kernel_size=1, padding='same', use_bias=False)(b1c)
    b2c = tf.keras.layers.Conv2D(48, kernel_size=1, padding='same', use_bias=False)(b2c)
    b3c = tf.keras.layers.Conv2D(48, kernel_size=1, padding='same', use_bias=False)(b3c)
    b4c = tf.keras.layers.Conv2D(48, kernel_size=1, padding='same', use_bias=False)(b4c)

    bc = WeightedSum()([tf.keras.layers.UpSampling2D()(b1c), b2c, downsample_layer(b3c), b4c])

    bc = MobilenetSeparableConv2D(96,
                             kernel_size=(5, 5),
                             use_bias=False,
                             padding='same')(bc)

    b1 = tf.keras.layers.Concatenate()([inp_arr[0], downsample_layer(bc)])
    b2 = tf.keras.layers.Concatenate()([inp_arr[1], bc])
    b3 = tf.keras.layers.Concatenate()([inp_arr[2], tf.keras.layers.UpSampling2D()(bc)])

    return b1, b2, b3

3.2 Backbone Truncation

Most state-of-the-art lightweight image classification models try to keep the number of channels to a minimum by gradually increasing the number of channels after every few convolutional blocks. However, towards the end, even these models started to rapidly expand the number of channels after each block in an attempt to represent features more clearly before the final FC layer.

The importance of transfer learning from classification models has been questioned before, and some papers even designed specialized detection backbones. This is based on intuition, which is different in tasks. For example, classification models do not retain spatial information and may accumulate into spatial coarse features.

On the other hand, detection models try to maintain the integrity of spatial information, which is required for fine-grained detection output. The author found that the transfer learning ability of Backbone's initial layer is very important, while the last layer does not provide key information for detection/recognition.

The author tested the importance of a single Backbone convolutional layer, analyzed the transfer learning capabilities of each Backbone in detail, and completed the PANet feature aggregation path and YOLOv3 detection head.

The author conducted experiments with three commonly used Backbones, namely: MobilenetV2-0.75, MobilenetV2-1.4 and EfficientNet-B3, and divided the Backbone into different blocks, in this case the MBConv block of Mobilenet V2 and the MBConvSE block of EfficientNet. Next, gradually increase the number of blocks initialized using weights pretrained on the ImageNet dataset, from shallow to deep, while the remaining blocks are randomly initialized like the detection head, and train each individual model to converge. The collected results are shown in Figure 2.

As can be seen from the figure, when increasing the proportion of feature extraction Backbone initialized with pre-trained weights, the performance of the model is improved, which also emphasizes the importance of transfer learning. However, around 60%, performance begins to deteriorate and fluctuate. This suggests that initializing the last layer of feature extractors with transfer learning weights from ImageNet actually hurts performance compared to random initialization, possibly because the task-specific nature of these layers causes them to get stuck in local minima.

Since these last layers have no need for transfer learning, they can be analyzed purely from an architectural perspective. As shown in Figure 2, the last 2 or 3 layers contain more than 40% of the weights due to the extreme scaling of the number of channels that are not relevant for object detection. Therefore, the author proposes to use truncated versions of various Backbones as the final object detection model. The authors use the results of Figure 2 to find the truncation point, i.e. truncate the last two blocks from the MobileNetV2 version, and the last three blocks from EfficientNet.

4 experiments

4.1 Ablation experiment

1、Truncated feature extraction backbone

The authors compared 2 methods of compressing the MobileNetV2 and EfficientNet backbones, which reduce the scaling factor (or the width multiplier of MobileNetV2) and truncate the last parameter layer, and obtained the results in Table 2. Notably, the abridged version of EfficientNet outperforms other versions in both accuracy and FPS, thus emphasizing the negative impact of classification task-specific backbone features.

For MobileNetV2, when comparing models with similar frame numbers, models with backbone truncation perform better than models with smaller scaling factors. For example, when comparing truncated-backbone MobileNetV2x1.4 and full-backbone MobileNetV2x1.0, they both provide similar FPS, while the former provides a better 0.27 mAP. This is due to the fact that reducing the width multiplier reduces the number of channels in all layers, while truncating the backbone only removes features from the last layer. This difference is more noticeable on lighter models on lower power devices. For example, a truncated backbone with a width of 0.75 in MobileNetV2 provides similar FPS to a full backbone with a width of 0.5 (34.02 vs. 35.18 in the Jetson Nano), but provides a 2.64 point improvement in mAP.

Clearly, using a truncated backbone is superior when reasoning at the edge.

2、Raw feature collection and redistribution

When delving into Table 3, one can notice that the effect of feature redistribution is much more significant when there is no other feature aggregation method. This can be attributed to the fact that in the absence of any interaction between multi-scale features except through the backbone itself, this redistribution provides the much needed feature interaction. However, even with BiFPNx3, our method still achieves significant improvements in performance, which shows the importance of shortcut connections between non-adjacent layers.

Finally, all methods discussed above are combined to perform joint component ablation studies. The results are collected in Table 4. Starting with the Jetson Nano’s MobileNetV2 (0.75) backbone, the Jetson Xavier NX’s MobileNetV2 (1.4) backbone, and the Jetson AGX Xavier’s EfficientNet-B3 backbone, and a PANet feature aggregation based on the YOLOv3 object detection head and lightweight detection layer. Next, test the RFCR module without truncating the trunk. While the RFCR module performs well in both cases, the model with a full backbone suffers more FPS drop than the model with a truncated backbone. This is because the full backbone has heavier layers at the ends, which makes the underlying polymer layer heavier as well.

As discussed in Section 3.1, the author also introduced "shortcut" into the RFCR module. This additional “shortcut” from shallower layers of the backbone further improves accuracy, emphasizing the importance of low-level features to the detection task and the benefits of our design in using more backbone input features rather than the number of output scales. The degree of freedom provided. In summary, by combining the backbone truncation and the RFCR module, the inference speed can be accelerated and the accuracy can be improved.

3. SOTA comparison

5Reference

[1].YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs

Guess you like

Origin blog.csdn.net/qq_53545309/article/details/134102951