Reading classic literature--EdgeYOLO (edge device YOLO target detection)

0. Introduction

The Yolo family has iterated many times from 1 to 8, but they still expect to be able to run higher-precision and faster algorithms with lower computing power. Currently "EdgeYOLO: An Edge-Real-Time Object Detector " An efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework is proposed, which can be implemented in real time on edge computing platforms. At the same time, the article mentioned that an enhanced data enhancement method was developed to effectively suppress overfitting during training, and a hybrid random loss function was designed to improve the detection accuracy of small objects. The results in this article show that the benchmark model can achieve 50.6% AP50:95 and 69.8% AP50 accuracy in the MS COCO2017 data set, and 26.4% AP50:95 and 44.8% AP50 accuracy in the VisDrone2019-DET data set, and Real-time requirements (FPS ≥ 30) are met on the edge computing device Nvidia Jetson AGX Xavier. The model is currently also available on Github .

1. Main contributions

The contributions of this article are summarized as follows:

  1. Design an anchor-free object detector that can run in real-time on edge devices and achieve 50.6% AP accuracy in MS COCO2017 dataset
  2. A more powerful data augmentation method is proposed to further ensure the quantity and validity of training data
  3. Use reparameterizable structures in our models to reduce inference time
  4. A loss function is devised to improve the accuracy for small objects.

2. Enhanced-Mosaic & Mixup

Many real-time object detectors use the Mosaic+Mixup strategy for data enhancement during the training process , which can effectively alleviate overfitting during the training process. As shown in Figure 3(a) and (b), there are two common combination methods that perform well when a single image in the dataset has relatively sufficient labels. Due to the stochastic process in data augmentation, the data loader may provide images without valid objects while still having responses in the label space, as shown in Figure 3(a), and the probability of this situation increases with the label in each original image Increase as quantity decreases.

We designed a data augmentation structure in Figure 3©. First, we use the Mosaic method for several groups of images, so the number of groups can be set according to the richness of the average number of labels for a single image in the dataset. Then, the last simply processed image is mixed with these Mosaic processed images via the Mixup method. During these steps, the original bounds of our last image are within the bounds of the transformed final output image. This data augmentation method effectively increases the richness of the image to alleviate overfitting and ensures that the output image must contain enough valid information.

insert image description here

Figure 3: Different solutions for data augmentation (FA: full augmentation, SA: simple augmentation without large-scale transformations). As shown in (a) [23] and (b) [22], data augmentation using a fixed number of images is not suitable for all types of datasets. By using the method shown in Figure (c), we can provide a flexible solution to the overfitting problem.

3. Lightweight decoupling head (more important)

The decoupled head in Fig. 4 was first proposed in FCOS [15] and then used in other anchor-free object detectors such as YOLOX [23]. Confirm that using a decoupled structure as the last few network layers can speed up network convergence and improve regression performance.

Since the decoupled head adopts a branch structure, resulting in additional inference cost, Efficient Decoupled Head [20] is proposed, which has faster inference speed and reduces the number of intermediate 3×3 convolutional layers to only one layer, while Keep the same larger number of channels as the input feature map. However, in our experimental tests, this additional inference cost becomes more apparent as the number of channels and input size increases. Therefore, we designed a lighter decoupled head with fewer channels and convolutional layers. Furthermore, we add implicit representation layers [24] to all last convolutional layers in order to obtain better regression performance. Through a reparameterization method, the implicit representation layer is integrated into the convolutional layer to reduce the inference cost. The final convolutional layers of box and confidence regression are also merged so that the model can perform inference with high parallel computing.

insert image description here

Figure 4: As shown in the figure, we designed a more lightweight but more efficient decoupling header. Through reparameterization techniques, our model achieves faster inference speed with reduced accuracy loss.

4. Staged loss function (more important point)

For target detection, the loss function can usually be written as follows:

insert image description here

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/130226820