[Reading notes for target detection papers] Small-object detection based on YOLOv5 in autonomous driving systems

Abstract

        With the rapid development of the field of autonomous driving, the need for a faster and more accurate object detection framework has become a necessity. Many recent deep learning based object detectors have shown convincing large object detection performance in various real-time driving applications . However, detecting small objects such as traffic signs and traffic lights is a challenging task due to the complexity of these objects . Moreover, the complexity present in the few images further makes it difficult to accurately detect small objects due to the presence of foreground/background imbalance and view distortion caused by harsh weather and low light conditions . In this position, we investigate how existing object detectors can be tuned to solve specific tasks, and how these modifications affect the detection of small objects. To achieve this goal, we explore and introduce architectural changes to the popular YOLOv5 model to improve its performance in detecting small objects without sacrificing large object detection accuracy , especially in autonomous driving. We will demonstrate that our modification adds little to no computational complexity, but significantly improves detection accuracy and speed . Compared with the traditional YOLOv5, the proposed iS-YOLOv5 model improves the mean precision (mAP) by 3.35% on the BDD100K dataset. Nonetheless, our proposed model improves the detection speed by 2.57 frames per second (FPS) compared to the YOLOv5 model.


1. Introduction

        Object detection is a very fundamental and well-studied task in the computer vision neighborhood. The purpose of the object detection task is to classify and localize the target objects in the image. With recent advances in deep learning techniques over the years, several state-of-the-art object detection methods have emerged [1]. Object detection has been widely used in many real-world applications, including autonomous driving, robot vision, intelligent transportation, remote sensing, military operations, and surveillance [2-4].

        Some object detectors generally perform well on large objects but not well on small ones . When an object occupies a small pixel area or field of view in the input image, we refer to the object as a small object. In the case of general object detectors, the features of small objects become unimportant when processed through multiple layers of their backbone. Accurate detection of small objects is essential and challenging due to poor visual appearance, abundant context information, noisy representation, indistinguishable features, complex background, limited resolution, severe occlusion, etc. [5,6]. Although modern systems aiming to achieve object detection in real-time mainly focus on speed at the expense of computational resources, they lack feasibility due to poor detection accuracy . Therefore, improvements in this specific domain will benefit the practical application of autonomous driving systems.

        Detecting objects of interest on the road is a fundamental task for autonomous driving. For most existing road object detectors, small objects are detected with less than half the accuracy of large objects. This is because they  usually cover fewer pixels, and it is difficult to extract features from low resolution , so  the model can easily confuse them with the background, resulting in missed or false detections [5]. Furthermore, one of the most critical challenges for object detectors is  that the accurate detection of objects of different scales is not well balanced . In the context of autonomous driving, traffic signs and traffic lights can be seen as small objects. Although many studies [7,8] propose to increase the representational power of the network in terms of depth and width for accurate detection, this affects the complexity and cost of the model . Therefore, such models are less suitable for autonomous driving systems due to real-time resource constraints .

        In general, object detection models based on deep learning are divided into (1) two-stage detection algorithms and (2) single-stage detection algorithms [1,3]. Two-stage models achieve higher accuracy than one-stage models at the cost of speed and complexity , but may not directly benefit real driving scenarios. Recently, efforts have been made to match or even improve the performance of single-stage models [2, 4]. Consequently, many new single-stage detectors have been developed for such applications. In this letter, we focus on a popular single-supervised detector, the You Only Look Once version 5 (YOLOv5) model [9]. This is the latest release in the YOLO series, with a clear and flexible structure designed for high performance and speed on an accessible platform. However, current systems applying this model either rely on traditional training methods, regularization/normalization techniques, or tune specific parameters to improve performance, with little or no architectural modification considered . Although YOLOv5 is a general purpose object detector, it is not optimized for the detection of small objects and thus cannot be adapted to specific use cases in practice.

        This letter proposes architectural improvements to the original YOLOv5 model to perform better in small object detection. To this end, we consider the actual road environment in autonomous driving systems to detect small road objects such as traffic signs and traffic lights. Furthermore, we discuss the impact of our modifications on how to accurately perform this task while maintaining real-time speed and slightly increasing the computational complexity of the system . The highlights of our contribution are

• We optimized the existing YOLOv5 model and designed an improved YOLOv5 architecture , named iS-YOLOv5 , to better detect small objects in autonomous driving scenarios.

• We investigate the applicability of our model to different weather scenarios to highlight its importance in the context of more robust and efficient object detection.

• Extensive experiments on the BDD100K dataset demonstrate the effectiveness of the proposed model. Furthermore, we present empirical results on traffic sign and traffic light detection on TT100K and DTLD datasets, respectively.


2. Related work

        Over the years, many researchers have shown keen interest in developing and using deep learning-based models to improve the performance of object detection tasks. With the advent of the YOLO family [6,10], various applications have leveraged YOLO and its architectural successors for object detection due to real-time detection speed rather than detection accuracy . Therefore, many investigation studies propose to apply YOLO model to detect road objects in autonomous driving scenarios. For example, works [11-13] implement YOLO v1-v3 for real-time object detection for self-driving cars in clear environments. Likewise, [14,15] exploit the strengths of YOLOv4 to detect specific road objects in ideal scenes. All of these approaches achieved promising results, but made no effort to modify the architecture . Furthermore, [16,17] explored the advantages of structural changes in YOLOv4 to improve detection accuracy in limited driving scenarios. However, these methods are not universally applicable because they do not take into account the increased complexity and inference time . Although the overall structure of YOLOv4 is similar to YOLOv5, the latter focuses on ease of use and is mainly designed for low computing platforms [9,18]. Furthermore, systems using YOLOv5 benefit from lower complexity and balanced performance in various real-world applications .

        Recently, there have been some efforts to tune and implement YOLO models to detect small objects in autonomous driving scenarios. For example, [19–23] exploit the strengths of YOLO v3-v4 to detect road objects such as traffic signs and/or traffic lights. Some works [24-26] use YOLOv5 for traffic sign or traffic light detection. Similarly, [27, 28] explore the advantages of YOLO v4-v5 in traffic cone detection. All of these works attempt to optimize the YOLO model, but only to a limited extent, since typical changes to its original structure are common . Furthermore, [29-32] combined additional modules into the YOLOv5 feature extractor to improve the detection of small-scale objects. However, these methods mainly focus on inference speed, sacrificing detection accuracy at the cost of more resources if necessary . Furthermore, structural modification has been shown to improve small object detection performance more effectively than other techniques [27, 33]. In addition, the above work also demonstrates detection results for small objects under common weather conditions. Due to the variety of road environments that autonomous vehicles on the road will encounter, daytime analysis is insufficient for challenging weather conditions . More importantly, all current YOLOv5-based systems fail to measure computational cost , which is an important metric for autonomous driving. Furthermore, performance improvements on small object detection tasks are triggered by regularization and normalization schemes in most cases [22, 26, 30] . Therefore, in this letter, we investigate how to improve detection performance, especially for small objects, with minimal cost through architectural changes, without using any other techniques .


3. Methodology

        We first discuss the motivation for our work (Section 3.1). We then briefly outline the YOLOv5 architecture and discuss its shortcomings (Section 3.2). Finally, a series of novel architectural changes are introduced to optimize and improve small object detection performance (Section 3.3).


3.1. Motivation

        Although some techniques [13, 15, 22] have been developed to improve the performance of small object detection, only a few researchers have focused on architectural modifications [24, 33] to achieve the same goal . In most cases, performance improvements are mainly driven by additional regularization/normalization methods [26, 32] or increasing parameters in the framework [8, 16] . Therefore, it is unclear how architectural improvements can contribute to improved detection performance on explicit tasks. In autonomous driving, accurate detection of small objects provides more valuable environmental context information, which helps to formulate better decision-making strategies. Detection of small objects is more challenging due to  foreground/background imbalance, fewer appearance cues, and lower image coverage [3, 6] . In a typical road traffic environment, detecting small objects is considered to be a difficult problem because distant objects become smaller due to viewing angle distortion [34]. It is worth noting that even within the same class, there are significant differences in the localization of small objects of different sizes. Also, many driving systems prioritize inference time over performance, but there are workarounds to optimize them at low cost. Therefore, there is a need for a simple and efficient road object detection model that can handle large and small objects of different sizes. 

        Inspired by the above observations, we analyze different elements of the YOLOv5 architecture to improve detection performance in specific tasks. To the best of our knowledge, there is no work that optimizes the YOLOv5 structure for autonomous driving by proposing architectural changes to accurately detect small objects at faster inference time without compromising detection accuracy for large objects, with minimal increase in model complexity .


3.2. YOLOv5

        Like the latest single-stage detection model, YOLOv5 has strong feature extraction capabilities, fast detection speed, and high accuracy. The YOLOv5 series provides four model scales: YOLOv5-S, YOLOv5-M, YOLOv5-L and YOLOv5-X, where S is small, M is medium, L is large, and X is xlarge [2]. The network structure of these models is constant, but the modules and convolution kernels are scaled, which changes the complexity and size of each model . In this letter, we analyze and focus on variants of the YOLOv5-S model. The basic architecture of YOLOv5 is shown in Figure 1, including input, backbone, neck and head.

        In input, YOLOv5 uses Mosaic data augmentation to augment data for small-scale detection . The first part of the architecture is the backbone network, which consists of the Focus layer [9], BottleNeckCSP [18] and SPP module [35]. The main function of the Focus layer is to perform image slicing. This reduces information loss during downsampling, simplifies numerical computations, and improves training speed . BottleNeckCSP not only reduces the overall computational burden, but also extracts depth information from features more efficiently . The SPP module is used to increase the receptive field of the network by converting variable-sized feature maps into fixed-sized feature vectors . The second component is the neck network, which applies PANet [36] and FPN [37] operations. The PANet structure is used for dense localization of high-level features . Meanwhile, FPN provides powerful low-level semantic features through upsampling . Then, various multi-scale features are fused to enhance the detection ability. The last element is the head network, which performs final detection at different scales. It uses anchor boxes to output a vector containing class probabilities, confidence scores, predicted angles, and bounding box information.

        The detection performance of YOLOv5 is very good, but we observe some major limitations . First, it is mainly aimed at the COCO dataset for general object detection tasks , and not necessarily applicable to the specific tasks and related datasets discussed in this work . Second, the PANet structure pays less attention to the information flow between non-adjacent layers, so the information is continuously reduced in each aggregation process . Third, the max-pooling operation in the SPP module causes a lot of information loss, so local and global information cannot be obtained for localization . Fourth, the information paths connecting different components in the YOLOv5 architecture limit computational efficiency and are not optimal for extracting relevant features of small-scale objects .


3.3. Proposed Architecture Changes 

        The original YOLOv5 model needs improvement for small object detection in autonomous driving. We introduce some modifications to further improve detection speed and accuracy without adding complexity to the model.

3.3.1. Improved SPP module

        As the depth of the CNN increases, the size of the receptive field becomes larger. Due to the limited size of input images, feature extraction is iterative over large receptive fields [34, 38]. Therefore, the SPP module is used to add corresponding modules to eliminate this problem by fusing feature maps of different receptive fields  . This module combines local and global features to maximize the expressive power of feature maps, expand the receptive range of the backbone network, and isolate the most important contextual features for To integrate features of receptive fields of different sizes, SPP uses multiple max pooling operations in parallel . This approach has shown  the benefit of improving the overall detection accuracy . However, the max pooling operation fails to capture spatial information and leads to information loss , which leads to inaccurate localization of objects, especially small ones .

        To solve this problem, we propose an  improved SPP module by replacing the pooling function with dilated convolution [39] , as shown in Figure 2. Although both operations expand the receptive field of the network, pooling reduces the spatial resolution, resulting in the loss of feature information . In contrast, dilated convolutions  with different dilation rates enrich the extracted features by capturing the multi-scale information required to detect small objects without loss of resolution . The output features are then combined at the same level to enhance feature representation. Thus, this module improves the network's learning ability to accurately localize objects, especially small ones, while maintaining fast detection speed with minimal increase in computational cost .


3.3.2. Improved PANet structure

        The purpose of the neck network is to aggregate the features obtained by the backbone to improve the accuracy of later predictions. This structure  plays a crucial role in preventing small object information from being lost due to higher abstraction levels  . To achieve this, the feature maps are upsampled again to be aggregated with the backbone layer and re-influence the detection [36, 40]. FPN provides deep semantic information to shallow layers to enhance detection capabilities, but ignores location information during shallow and deep feature aggregation . While the PANet structure provides the fusion of low-level and high-level information . However, ensemble methods rely on the aggregation of adjacent features and pay less attention to the information exchange between non-adjacent layers . Therefore, each aggregation continuously reduces the spatial information in non-adjacent layers .

        Therefore, in order to address the above limitations, we design an improved PANet structure based on the original PANet structure , as shown in Figure 3. We add two cross-layer connections (B1 and B2) , one in the top-down path of FPN and the other in the bottom-up path of PANet , to integrate non-adjacent and multi-level features . During the aggregation process, this  will allow more efficient use of semantic and shallow position information , enhancing important features of small objects without increasing computational complexity.


3.3.3. Improved Information Path

        The architectural design of YOLOv5 is simple, but due to its internal component arrangement, it needs to optimize computational efficiency and real-time applicability . Therefore, we  redirect some connections  to focus on detecting multi-resolution feature maps . Feature maps are extracted as the input is passed layer by layer through convolutions. The feature map generated by the previous convolutional layer captures small-scale objects, while the feature map generated by the latter convolutional layer captures large-scale objects . BottleNeckCSP is the most basic block of YOLOv5, which extracts the most contextual features, but  its current properties are inefficient for deep feature extraction , resulting in poor detection of small objects. Another important aspect is  choosing an appropriate activation function , which can limit performance even when adding multiple convolutional and normalization layers . Furthermore, the head lacks the ability to extract enough shallow features to localize small objects .

        In response to the above problems, we propose an improved improve Scaled YOLOv5 (is -YOLOv5) , which is a robust and efficient architecture. Its detailed structure is shown in Figure 5 (the original Figure 5 and Figure 4 are reversed). To demonstrate the limitations of BottleNeckCSP, a new functional block named N-CSP  is introduced by modifying the information path  . We reduce the number of N-CSP blocks in the backbone network to tune network parameters and improve computation speed . Furthermore, we implemented  Hard Swish activations  instead of Leaky ReLUs at specific layers of the network . We apply multiple activations depending on the input size [41] to avoid information loss and reduce computational cost . On the detection side, we  add detection heads for small-scale objects obtained from high-resolution feature maps . In the neck, we optimally tune the N-CSP block to focus on detecting multi-scale features . This will improve the overall detection ability of objects of different scales, especially small objects. Note that our collective modification barely changes the computational complexity, while significantly improving detection performance and ensuring real-time requirements .


4. Experimental results

        In this section, we describe the autonomous driving dataset, training environment, and performance evaluation metrics. Subsequently, we verify the superiority of the proposed method through several experiments.

4.1. Dataset description

        We choose BDD100K [42] as our main dataset to verify the performance of the proposed method. Furthermore, we analyze the detection performance of traffic signs and traffic lights using the TT100K [43] and DTLD [44] datasets, respectively. Both datasets are very challenging for traffic object detection tasks. The TT100K and DTLD datasets provide 30K annotated traffic signs and 200K annotated traffic lights, respectively. The BDD100K dataset includes 100K autonomous driving images of different environments under different daytime and nighttime weather scenarios. For some categories, we consider an object large if its footprint is larger than 112 × 112 pixels, and small if its footprint is smaller than 48 × 48 pixels, and if the object lies between these two thresholds , is considered moderate. In this evaluation, we mainly focus on small objects such as traffic lights and traffic signs. More than 80% of these objects were smaller than 48 × 48 pixels.


4.2. Data Augmentation

        To improve the quality of the training data, we apply a series of data augmentation techniques. Through this, the model can learn object features at different scales, illuminations, and angles , thereby  improving the generalization performance of the model on unseen data . Among several data augmentation methods [3], we  employ image displacement, linear scaling, horizontal flipping, motion blur, uniform cropping, and noise addition . Furthermore, we use Mosaic data augmentation [18], which allows us to train on four images instead of one . The benefit is that it can be trained on a single GPU . Figure 4 depicts the working process of Mosaic data enhancement (the original Figure 5 and Figure 4 are reversed). In order to compare experimental results fairly, we apply data augmentation techniques to all object detection models .


4.3. Training settings 

        The hardware environment includes Intel i9-9900K CPU and NVIDIA Quadro RTX 5000 GPU. At the same time, the software environment is PyTorch v1.8.0 on Ubuntu 18.04.5 OS. We use the SGD optimizer to learn and update parameters through the training process [9]. Several hyperparameters are set: the learning rate is initialized to 0.01, the training batch size is 4, the momentum is 0.948, the weight decay is 0.0001, and the number of epochs is 100. We keep the rest of the configuration as the default configuration model of the original YOLOv5.


4.4. Evaluation indicators

        Average precision (mAP) is the most commonly used evaluation metric to evaluate network detection performance [2, 45]. mAP takes into account the precision (P) and recall (R) defined in Equations 1 and 2, respectively. AP is the average precision at different recall values ​​obtained by the Precision Recall (PR) curve at a given threshold, defined in Eq 3. Intersection Over Union (IoU) is the degree of overlap between the predicted frame and the actual frame, that is, set to IoU ≥ 0.5. Finally, mAP is the average of the APs of all classes defined in Equation 4. Additionally, we track computational complexity in terms of floating-point operations (FLOPs) and measure inference time per image to validate model efficiency.

where TP denotes the true value and the number of detections in the result. FP only indicates the number of detections in the result, not the number of detections in the ground truth. FN only represents the number of detections in the ground truth, not the number of detections in the result.

where P( \bar{R}) is \bar{R}the measured P  at .

where n is the number of categories and APi is the AP of category i.


4.5. Effectiveness of the proposed model

        We present empirical results for traffic light and traffic sign detection in Table 1. In all experiments mAP was calculated for all classes in the dataset. As shown, the proposed iS-YOLOv5 model improves the AP to 65.93% for traffic signs and 72.03% for traffic lights on TT100K and DTLD datasets, respectively . Furthermore, our model improves mAP by 4.47% and 3.85% on TT100K and DTLD datasets, respectively.

        The performance results of the proposed iS-YOLOv5 model are shown in Table 2. From the results, we can see that our iSYOLOv5 improves the detection accuracy of small road objects such as traffic signs and traffic lights to 53.61% and 57.08%, respectively . Furthermore, the proposed model improves mAP by 3.35%, and inference speed by 2.57 FPS, while only increasing computation by 0.36.

        We show the results for different scale targets separately in Table 3. As shown, the proposed iS-YOLOv5 model improves the mAP of large, medium and small objects by 6.41%, 4.61% and 1.89%, respectively. It is worth emphasizing that the difference in accuracy between large and small objects is quite significant . Therefore, we conclude that the detection accuracy of small objects can be improved without compromising the detection accuracy of medium and large objects .

        The validation loss curve is shown in Figure 6. In particular, bounding box, classification and objectiveness loss curves are obtained during training. Furthermore, Figure 7 shows the comparison in terms of PR curves. As shown in the figure, the PR curve of our iS-YOLOv5 has completely surrounded the PR curve of YOLOv5 .

        We tested the applicability of our model under various weather conditions including sunny, cloudy, cloudy, snowy, foggy and rainy . Table 4 lists the generalization ability of the proposed iS-YOLOv5 model using AP values. From the results, it is clear that our model has strong generalization performance even in very complex environments . This further demonstrates that our modification improves performance in complex weather scenes, which helps extend current visual perception .

        The detection performance of small objects in different traffic environments is shown in Figure 8. As observed, when the traffic density increases (from top to bottom), YOLOv5's prediction confidence decreases and starts to miss objects. In contrast, the proposed iSYOLOv5 model can detect traffic signs and traffic lights with high confidence even in high-traffic scenarios.


4.6. Ablation Study of the Proposed Model 

        We analyze the contribution of different components in the iSYOLOv5 model. To this end, we conducted ablation experiments, and the results are shown in Table 5. When SPP, PANet, and Information Path are applied, respectively, the detection accuracy increases by 1.83%, 1.05%, and 0.47%, the inference speed increases by 0.03 FPS, 0.11 FPS, and 2.43 FPS, and the computation speed increases by 0.32, 0.01, and 0.03, respectively. Note that our modules do not conflict when used in combination . This clearly shows the impact of the proposed structure integrated into our iS-YOLOv5 model. (I don't know if there is a problem with my download, the picture is not displayed completely)


4.7. Comparison with other detection models

        To verify the superiority of the proposed iS-YOLOv5 model, we compare our method with several used object detectors. Table 6 compares the accuracy, speed and complexity of different frameworks under the same settings. Clearly, our iS-YOLOv5 model substantially outperforms benchmarks at relatively low computational cost . This demonstrates that our model achieves satisfactory results and is thus suitable for real-time detection in autonomous driving applications.


5 Conclusion

        In this letter, we study and analyze the effect of different architectural modifications applied to the popular YOLOv5 structure to improve small object detection performance without sacrificing large object detection accuracy . To achieve this goal, we made improvements to optimize the flow of information through different network layers . Therefore, we propose the iSYOLOv5 model, which is able to improve detection accuracy and speed without significantly increasing model complexity . We verify the superiority of our iS-YOLOv5 model by conducting extensive experiments on challenging datasets. Furthermore, we test the generalization applicability of the proposed model to complex road weather conditions. Using these insights, current driving systems can be updated to detect small objects like traffic signs and traffic lights when such models cannot detect anything at all. In this way, the detection and perception robustness of autonomous vehicles can be further extended for efficient planning and decision-making.

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130117137