Yolo-Z: Improved YOLOv5 for small object detection

Table of contents

I. Introduction

2. Background

3. New ideas

4. Experimental analysis


论文地址:2112.11798.pdf (arxiv.org)

I. Introduction

As self-driving cars and self-driving racing become more popular, the need for faster and more accurate detectors increases.

While our naked eyes can extract contextual information almost instantly, even from great distances, limitations in image resolution and computational resources make detecting smaller objects (i.e., objects occupying small pixel areas in the input image) challenging for machines. Said is a truly challenging task and a vast field of research.

This study explores how the popular YOLOv5 object detector can be modified to improve its performance in detecting smaller objects, specifically for applications in autonomous racing. To achieve this, we studied how replacing certain structural elements of the model (as well as their connections and other parameters) affects performance and inference time. To this end, the researchers proposed a series of models of different scales and named them "YOLO-Z". When compared with 50% IoU detection When dealing with small objects, the mAP of these models is increased by up to 6.9%, at the cost of an increase of 3ms in inference time compared to the original YOLOv5.

Our goal is to provide future research with information about the potential of adapting popular detectors such as YOLOv5 to solve specific tasks, and to provide insights into how specific changes affect small object detection. Application of these findings to the broader autonomous vehicle environment could increase the amount of environmental information available to such systems.

2. Background

Detecting small objects in images is challenging, mainly due to the limited resolution and contextual information available to the model. Many systems that implement object detection do so at real-time speed, placing specific demands on computing resources, especially if the processing is to take place on the same device that captured the image. This is the case with many autonomous vehicle systems, where the vehicle itself captures and processes images in real time, often to inform its next actions. In this case, detecting smaller objects means detecting objects further away from the car, allowing them to be detected earlier, effectively extending the vehicle's detection range. Improvements in this specific area will better inform the system, allowing it to make more robust and actionable decisions. Due to the nature of object detectors, the details of smaller objects lose meaning as they are processed by each layer of their convolutional backbone. In this study, “small objects” refer to objects occupying small pixel areas in the input image.

Currently, there are many researchers working hard to improve the detection of smaller objects [such as An Evaluation of Deep Learning Methods for Small Object Detection], but many of them process around specific areas of the image or focus on two-stages detectors. These detectors achieve better performance at the cost of inference time, making them less suitable for real-time applications. This is why so many single-stage detectors have been developed for this type of application. Increasing the input image resolution is another obvious way around this problem, but will result in a significant increase in processing time.

3. New ideas

Some effort has been invested in developing systems that direct processing to certain areas of the input image, which allows us to adjust the resolution and thereby bypass the limitation of having fewer pixels to define an object. However, this approach is more suitable for systems that are not time-sensitive, as they require multiple passes through the network of different sizes. This idea of ​​paying more attention to specific scales can still inspire the way we process certain feature maps. Additionally, a lot can be learned by seeing how feature maps are processed rather than just modifying the backbone. Different types of feature pyramid networks (FPNs) can aggregate feature maps differently and enhance the backbone in different ways. This technique has proven to be quite effective.

YOLOv5 framework

YOLOv5 provides four different scales for its model, S, M, L and X, which represent Small, Medium, Large and Xlarge respectively. Each of these scales applies a different multiplier to the depth and width of the model, meaning that the overall structure of the model remains the same, but the size and complexity of each model is scaled.

In the experiments, we make changes to the model structure at all scales separately and treat each model as a different model to evaluate its effect. To set the baseline, we trained and tested four unmodified versions of YOLOv5. Changes to these networks were then tested separately to observe their impact separately against our baseline results. When moving to the next stage, those techniques and structures that do not appear to contribute to improved accuracy or inference time are filtered out. Then, combinations of selected techniques were tried. Repeat this process, observing whether certain techniques complement or undermine each other, and gradually add more complex combinations.

Proposed architectural changes

YOLOv5 uses yaml files to instruct the parser how to build the model. We use this setup to write our own high-level instructions on how to build the different building blocks of the model and with which parameters, thus modifying its structure. To implement the new structure, we arrange and provide parameters for each building block or layer and, if necessary, instruct the parser how to build it. In our words, we took advantage of the base and experimental network blocks provided by YOLOv5, while implementing additional blocks where needed to simulate the required structure.

Among them, the modification of neck:

In this work, the current Pan-Net [Path aggregation network for instance segmentation] is simplified to FPN and replaced by biFPN [EfficientDet: Scalable and Efficient Object Detection]. In both cases, the neck retains similar functionality, but differs in complexity and therefore the number of layers and connections required to implement them.

Other modifications can be found in the paper.

4. Experimental analysis

Guess you like

Origin blog.csdn.net/qq_53545309/article/details/134102875