YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception(2022)

YOLOP v2 is coming | YOLOv7 combines the multitasking version of YOLOP, surpassing YOLOP and HybridNets

Compared with YOLOP, the changes in this article are as follows: the backbone network is changed to ELAN; the traffic target detection loss is CIoU and Focal loss; the drivable area segmentation branch is connected to the B3 layer of the neck network, CE loss; the lane line segmentation branch is connected to In the B3 layer, the loss function is changed to Focal Loss+Dice Loss, and deconvolution is used instead of the nearest neighbor upsampling .

Summary

In the past decade, multi-task learning methods have achieved promising results in solving all-seeing driving perception problems, providing high accuracy and efficient performance. It has become a popular paradigm when designing networks for practical autonomous driving systems with limited computational resources. This paper presents an efficient multi-task learning network that can simultaneously perform tasks such as traffic object detection, drivable area segmentation, and lane line detection. Our model achieves new state-of-the-art (SOTA) performance in terms of accuracy and speed on the challenging BDD100K dataset. In particular, the inference time is halved compared to previous SOTA models. Code will be released in the near future.

1 Introduction

Although computer vision and deep learning have made tremendous progress, it is still challenging to deploy visual perception models (such as object detection, segmentation, lane detection, etc.) in low-cost autonomous driving applications. Recent efforts have been made to build a robust panoramic driving perception system, which is one of the key components of autonomous driving. The panoramic driving perception system helps the vehicle fully understand the surrounding environment through commonly used sensors. Because of the lower cost, camera-based object detection and segmentation are more popular in practical applications. Object detection plays an important role in providing information on the location and size of traffic obstacles to help self-driving cars make accurate and timely decisions during the driving phase. In addition, drivable area segmentation and lane line detection also provide rich information for route planning and improving driving safety.

Object detection and semantic segmentation are two long-standing research topics in the field of computer vision. There are a series of excellent works for object detection, such as CenterNet [3], Faster R-CNN [18] and the YOLO series. Commonly used segmentation networks are usually applied to the problem of drivable area segmentation, such as: UNET [10], segnet [1] and pspNet [28]. While for lane line detection/segmentation, a more powerful network is needed to better fuse high-level features and low-level features, so as to consider global structural context information to enhance segmentation details [14, 9, 24]. However, it is usually not practical to run separate models for each task in a real-time autonomous driving system. In this case, multi-task learning networks [19, 15, 25, 20] offer a potential solution to save computational cost, where the network is designed as an encoder-decoder pattern, and the encoders are different Tasks are effectively shared.

In this paper, after an in-depth study of the above methods, we propose an efficient network for multi-task learning. We conduct experiments on the challenging BDD100K dataset [26]. Our model achieves the best performance in all three tasks: 0.83 MAP on object detection task, 0.93 MIOU on drivable region segmentation task, and 87.3 lane detection accuracy. Compared with the baseline, these indicators are greatly improved. Furthermore, we increase the FPS to 91 on NVIDIA TESLA V100, far exceeding YOLOP under the same experimental conditions. It further demonstrates that our model can reduce computational cost and guarantee real-time prediction, while leaving room for improvement in other experimental studies.

The main contributions of this work are summarized as follows:

Better: We propose a more efficient model structure and develop more sophisticated free packages, e.g., Mosaic and Mixup data preprocessing, and apply a new mixture loss.

Faster: We implemented a more efficient network structure and memory allocation strategy for this model.

Stronger: Our model is trained under a powerful network architecture, so it can generalize well to various scenarios while ensuring speed.

2. Related work

In this section, we review related work on all tasks in the topic of panoramic driving perception. We also discuss efficient model ensemble techniques.

2.1. Real-time traffic object detection

Current object detection networks can be divided into single-stage networks and two-stage networks. The two-stage network consists of a region proposal network and a location fine-tuning network. These methods usually have high accuracy and robust results among different single-stage object detection networks, and usually run faster, so they are often preferred in real-time practical applications. The YOLO series adopts an advanced single-stage object detection design, maintains active iterations, and provides inspiration for our experiments, including YOLO4, scaled-yolov4, yolop, and yolov7). In this paper, we use a simple yet powerful network structure, together with an effective trick-for-free (BoF) to improve object detection performance.

2.2. Drivable area and lane line segmentation

Remarkable progress has been made in the study of semantic segmentation by using fully convolutional neural networks [2] instead of traditional segmentation algorithms. With extensive research in this field, people have designed higher performance models, such as Unet's classic codec structure; PSPNet uses pyramid pooling to extract features at different levels, which helps to effectively divide the drivable area . For lane line segmentation, due to the difficulties of its specific characteristics, such as the shape of the lane line being too thin and the distribution fragmented, the lane line segmentation requires meticulous detection capabilities. SCNN [14] proposed layer-by-layer convolution to transfer information between channels in each layer. Enet-SAD adopts a self-attention modification method, which can learn low-level features from high-level features, which can not only improve performance, but also keep a lightweight design for the model.

2.3. Multitasking approach

2.4. A series of techniques

In order to improve the accuracy of object detection results without increasing the cost of inference, researchers usually exploit the fact that the training phase and the testing phase are separated. Data augmentation is usually to increase the diversity of input images, so that the designed object detection model can be generalized enough in different fields. For example, conventional image mirroring, adjustments to image brightness, contrast, hue, saturation, and noise are applied in YOLOP. These data augmentation methods are pixel-level adjustments and preserve all original pixel information within the adjusted area. In addition, the authors of the YOLO series of studies also proposed a method for data augmentation of multiple images at the same time. For example, performing Mosaic enhancement [6] on four stitched images can increase the batch size and increase the diversity of the data.

3. Methodology

In this section, we detail the proposed network architecture for multi-task learning. We discuss how to implement an efficient feed-forward network to jointly accomplish tasks such as traffic object detection, drivable area segmentation, and road lane line detection. In addition, an optimization strategy for the model is proposed.

3.1. General introduction

We design a more efficient network architecture based on existing work such as YOLOP, HybridNet , etc. Our model is inspired by YOLOP, HybridNet work, we keep the core design concept but utilize a strong backbone for feature extraction. Furthermore, unlike existing work, we leverage the three branches of the decoder head to perform specific tasks, instead of running the drivable area segmentation and lane line detection tasks in the same branch. The main reason for this change is that we found that the drivable area segmentation task and the lane line detection task are completely different in difficulty, which means that the two tasks have different requirements at the feature level and thus should have different network structures. Experiments in Section 4 show that the newly designed structure can effectively improve the overall segmentation performance, and introduce negligible overhead on computational speed. Figure 2 shows the overall method flow chart of our design concept.

3.2. Network structure

The proposed network architecture is shown in Fig. 1. It consists of a shared encoder for feature extraction from input images and three decoders for the corresponding tasks. This section demonstrates the network structure of the model.

3.2.1 Shared Encoder

Different from YOLOP which uses CSPdaarknet as the backbone, we adopt the design of ELAN to utilize group convolutions, so that the weights of different layers can learn more diverse features. Figure 2 shows the configuration of group convolution.

In the neck part, the features generated from different stages are collected and fused by Concat. Similar to YOLOP, we apply a Spatial Pyramid Pooling (SPP) module [7] to fuse features at different scales, and a Feature Pyramid Network (FPN) module [11] to fuse features at different semantic levels.

3.2.2 Task header

As mentioned above, we design three independent decoder heads for each individual task. Similar to YOLOv7, we adopt a multi-scale detection scheme based on anchor boxes. First, we use Path Augmentation Network (PAN) [12], which is a bottom-up structure, to better extract location features. By combining the features of PAN and FPN, we are able to fuse semantic information with these positional features, and then perform detection directly on the multi-scale fused feature map in PAN. Each feature point in the multi-scale feature map will be assigned three anchor boxes with different aspect ratios, and the detection head will predict the offset of the position and the scaled height and width, as well as the class probability and foreground confidence.

The drivable area segmentation and lane line segmentation of this method are performed by a separate head network in different network structures. Unlike YOLOP, but instead the features for both tasks come from the last layer of the neck, we use features from different semantic levels. We find that drivable area segmentation does not require features extracted from deeper networks than the other two tasks. These deeper features do not improve predictive performance, but increase the convergence difficulty of the model during training. Therefore, the branches of the drivable area segmentation head are connected before the FPN module. Furthermore, to compensate for possible losses from this variation, an additional upsampling layer is applied, i.e. a total of four nearest neighbor interpolation upsamplings are applied at the decoder stage.

In lane line segmentation, the input of the segmentation branch is the last layer of FPN , so that deeper semantic features can be obtained, and this structure can improve the detection accuracy of thin, difficult-to-detect lane lines. In addition, transposed convolutions are used in the lane line detection branch to further improve the detection performance.

3.2.3 Design of BOF

Based on the design of YOLOP, we retain the setting of the loss function of the detection part. Ldet is the detection loss, which is the weighted sum of category loss, target loss and boundary loss, as shown in Equation 1.

In addition, Lclass and Lobj also use Focal Loss to solve the problem of sample imbalance. Lclass is used for penalized classification and Lobj is used to predict foreground confidence. Lbox reflects the distance of IoU, aspect ratio and scale similarity between the predicted result and the true value . Appropriate loss weight setting can effectively guarantee the results of multi-task detection. Drivable area segmentation using cross-entropy loss aims to minimize the classification error between the network output and the ground truth. For lane line segmentation , we use Focal Loss instead of cross-entropy loss. Because for hard classification tasks such as lane line detection, using Focal Loss can effectively focus the model on difficult samples, thereby improving detection accuracy. Furthermore, we implement a hybrid loss consisting of dice loss and focal loss [29] in our experiments. Dice loss can learn the class distribution and alleviate the problem of unbalanced voxels. Focal Loss has the ability to force the model to learn difficult samples. The final loss can be calculated according to Equation 2 as follows.

where γ is the trade-off between dice loss and focal loss, C is the total number of categories, therefore, C is set to 2 because there are only 2 categories in drivable area segmentation and lane line segmentation.

It is worth mentioning that we introduce the augmentation strategies of Mosaic and Mixup [27] in our multi-task learning method, which can improve the performance of the model in all three tasks.

4. Test

This section describes the dataset setup and parameter configuration for our experiments. All experiments in this article are based on the configuration environment of the graphics card TESLA V100 and torch 1.10 .

4.1 Dataset

Compared with Cityscapes and Camvid datasets, BDD100K contains more samples and scenes, which are marked for weather conditions, scene locations and clarity. Like other studies, we divide the dataset according to the ratio of 7:1:2 are training set, validation set and test set.

4.2. Training program

The “cosine annealing” strategy is used to adjust the learning rate during training, where the initial learning rate is set to 0.01, and warm-up training is used in the first 3 stages [13]. Also, momentum and weight decay are set to 0.937 and 0.005 respectively. And the total number of training epochs is 300. We resize images in the BDD100k dataset from 1280×720×3 to 640×640×3 in the training phase and from 1280×720×3 to 640×384×3 in the testing phase.

4.3 Results

In this section, we quantitatively and qualitatively compare the performance of the proposed algorithm with other existing models.

4.3.1 Model parameters and inference speed

Table 1 shows the comparison between two SOTA multi-task models and our model. The results show that our model has a stronger network structure and more parameters, but is faster. This benefits from the proposed efficient network design and sophisticated memory allocation strategy. We conduct all tests in the same experimental setup and evaluation metrics.

4.3.2 Traffic Object Detection

Same as YOLOP, mAP50 and recall are used here as evaluation metrics. Our model achieves higher mAP50 and recall, as shown in Table 2.

4.3.3 Drivable area segmentation

Table 3 illustrates the experimental results on drivable region segmentation and uses MIoU to evaluate the segmentation performance of different models. Our model achieves the best performance with 0.93MIoU.

4.3.4 Lane line detection

Lanes in the BDD100K dataset have two lines of annotations, so preprocessing is required. First, we compute the centerline from the two annotation lines, then draw a lane mask with a width of 8 pixels for training, while keeping the lane lines of the test set at 2 pixels wide. We use Accuracy and IoU of lanes as evaluation metrics. As shown in Table 4, our model obtains the highest accuracy.

4.3.5 Visual discussion

Figures 3 and 4 show a visual comparison of YOLOP, HybirdNet and our YOLOPv2 on the BDD100K dataset. Figure 3 shows the results during the day. The left column lists three cases of YOLOP, the first scene with some errors and missing drivable regions, and the second scene with redundant bounding boxes of small objects and missing drivable region segmentation. In the third case, a missing detection of a lane is found. The middle column shows three scenarios of HybirdNet, the first scenario has discontinuous lane prediction, the second scenario of HybirdNet has the problem of repeated detection of small vehicles and missing lane detection, and the third scenario has false detection of vehicles and lanes. The right column shows our YOLOPv2 results, which illustrate that our model provides better performance on various scenarios.

 

4.3.6 Ablation experiment

We proposed some improved methods and carried out corresponding experiments, and their results are shown in Table 5.

5 Conclusion

In this paper, we propose an efficient, end-to-end multi-task deep learning network that can simultaneously perform traffic object detection, drivable area segmentation, and lane line segmentation. YOLOPv2 achieves new state-of-the-art performance on the challenging BDD100k dataset and substantially outperforms existing models in both speed and accuracy.

Guess you like

Origin blog.csdn.net/Jad_Goh/article/details/127704342