LPCG: Guided Monocular 3D Object Detection with Laser Point Clouds

This article recommends a research result of the Zhejiang University team: LPCG: Lidar Point Cloud Guided Monocular 3D Object Detection, which was accepted by ECCV2022.

LPCG: Using laser point cloud to guide monocular 3D object detection If you use laser point cloud to guide monocular 3D object detection, what is the effect? https://mp.weixin.qq.com/s?__biz=MzUyMDc5OTU5NA==&mid=2247500597&idx=1&sn=8f2b62adc81e09afaa5c9f0f314b7035&chksm=f9e65730ce91de26820692da915cb13a102e5 2f5f9cb5c3fbf471674943d996784aaf319c3e9&token=212338627&lang=zh_CN #rd 01 Brief

Monocular 3D object detection is an extremely challenging task in the fields of autonomous driving and computer vision. Most of the previous work was manually labeled 3D label boxes, and the labeling cost was very high.

The author's team made an interesting counterintuitive discovery in their work: in monocular 3D detection, precise, carefully annotated labels may not be necessary! Detectors using perturbed coarse labels achieve very close accuracy compared to detectors using ground truth labels. The authors' team investigated this phenomenon in depth and found empirically that the labeling of the 3D position part is extremely critical compared to the rest of the labeling.

Inspired by the above conclusions, and considering the accuracy of LiDAR 3D measurements, the author's team proposes a simple and effective framework called LiDAR point cloud-guided monocular 3D object detection (LPCG). This framework is able to reduce annotation cost or significantly improve detection accuracy without introducing additional annotation cost.

Specifically, it generates pseudo-labels from unlabeled LiDAR point clouds. Since their 3D location information is accurate, such pseudo-labels can replace manually annotated labels in the training of monocular 3D detectors. LPCG can be applied to any monocular 3D detector to make full use of large amounts of unlabeled data in autonomous driving systems.

In the KITTI benchmark, the LPCG framework achieves first place in both monocular 3D and BEV (bird's eye view) detection, with a significant advantage. On the Waymo benchmark, the authors' team's method using 10% labeled data achieved comparable accuracy to a baseline detector using 100% labeled data.

Figure 1 Project framework. The author's team generates 3D box pseudo-labels from unlabeled LiDAR point clouds, aiming at training monocular 3D detectors. Such 3D boxes are predicted by trained LiDAR 3D detectors (high-precision mode), or obtained directly from point clouds without training (low-cost mode). \Unsup." and \Sup." in the figure represent unsupervised and supervised, respectively.

02 Model introduction

First, as shown in the figure, manually annotated perfect labels are not necessary for monocular 3D detection. The accuracy of the disturbed label (5%) is comparable to that of the perfect label. When implementing large interference (10% and 20%), we can see that the location (Location) dominates the performance (the AP only significantly decreases when the interference location). This shows that coarse pseudo-3D box labels with precise locations can replace perfect annotated 3D box labels.

Figure 2. The author's team perturbs perfectly hand-annotated labels by randomly shifting the corresponding values ​​within the percentage range. The author team can see that: 1) Noise labels (5%) and perfect labels lead to close accuracy; 2) Location determines the overall accuracy (10%, 20%, 40%).

We can notice that LiDAR point cloud can provide valuable 3D Location information. More specifically, in the scene, the LiDAR point cloud can provide accurate depth measurement, and the surrounding accurate depth information can provide more accurate object position, which is crucial for 3D object detection. Furthermore, LiDAR point clouds can be easily acquired by LiDAR devices, allowing offline collection of massive LiDAR point clouds without labor costs.

Based on the above analysis, the author's team uses LiDAR point clouds to generate 3D stereoscopic pseudo-labels. The newly generated labels can be used to train a monocular 3D detector. This simple and effective approach allows monocular 3D detectors to learn desired objects while reducing annotation costs for unlabeled data. The general framework is shown in Fig. 1, and the method can work in two modes according to the dependence on 3D annotation boxes. If the author's team uses a small number of 3D box annotations as before, the author's team calls it High Accuracy Mode (High Accuracy Mode), because this method will lead to high performance. In contrast, if no 3D cuboid annotations are used, what the authors' team calls a low-cost mode.

2.1 High Accuracy Mode

To take advantage of the available 3D box annotations, as shown in Figure 1, our team first trains a LiDAR-based 3D detector from scratch using LiDAR point clouds and associated 3D box annotations. The pre-trained lidar-based 3D detector is then utilized to infer 3D boxes on other unlabeled lidar point clouds. Such results are used as pseudo labels to train a monocular 3D detector. In Section 5.5 of the original paper, the author's team compared pseudo-labels with manually annotated perfect labels. Due to the precise 3D position measurement, the pseudo-labels predicted by LiDAR-based 3D detectors are quite accurate and can be directly used in the training of monocular 3D detectors.

The outline of Algorithm 1 is as follows:

Interestingly, using different training settings for LiDAR-based 3D detectors, our team empirically found that monocular 3D detectors trained from generated pseudo-labels showed close performance. This shows that monocular methods can indeed benefit from the guidance of LiDAR point clouds, and only a small number of 3D annotation boxes is enough to drive monocular methods to achieve high performance. In this way, the cost of manual annotation of high-precision patterns is much lower than the previous way. For detailed experiments, see Section 5.6 of the original text. Note that the observation of labeling requirements and 3D Location is the core motivation of LPCG. The prerequisite for LPCG to work properly is that the LiDAR points provide rich and precise 3D measurement information, thus providing precise 3D positions.

2.2 Low Cost Mode

In this section, the author team presents a method for using LiDAR point clouds to remove the reliance on manual 3D volume labels.

First, an off-the-shelf 2D instance segmentation model is used to segment the RGB image to obtain 2D box and mask estimates. These estimates are then used to build camera frustums to select the relevant LiDAR RoI points for each object, where boxes without any LiDAR points inside are ignored.

However, lidar points lying in the same frustum consist of object points and mixed background or occluded points. In order to eliminate irrelevant points, the author team utilizes DBSCAN to divide the RoI point cloud into different groups according to the density. Points that are close in 3D space will be gathered into a cluster. The authors' team then considered the cluster containing the most points as the target corresponding to the object. Finally, the author's team looks for the smallest 3D bounding box that covers all object points.

To simplify the problem of solving the 3D bounding box, the author team projects the points onto the bird's-eye view, reducing the parameters, because the object's height (h) and y coordinates (in the camera coordinate system) can be easily obtained. Therefore, the author team has:

where B_{bev}refers to the Bird's Eye View (BEV) box. The author's team solved this problem by using the convex hull of the object points, and then using rotating calipers to obtain the cube. Furthermore, the height h can be represented by the maximum spatial offset along the point's y-axis, and the center coordinate y is calculated by averaging the point's y-coordinates. The authors' team used a simple rule restricting the dimensionality of objects to remove outliers.

Algorithm 2 shows the overall training pipeline of the monocular method.

03 Application in real-world autonomous driving systems

In this section, the author team will describe the application of LPCG to real-world autonomous driving systems.

Figure 3 Data collection strategies in real-world systems

First, the author team illustrated the data collection strategy in Figure 3. Most autonomous driving systems can easily collect large amounts of unlabeled LiDAR point cloud data and RGB images simultaneously. This data consists of multiple sequences, where each sequence typically refers to a specific scene and contains multiple consecutive frames. Due to limited time and resources in the real world, only some sequences are selected for annotation to train the network, such as Waymo. Furthermore, to reduce the high annotation cost, only some keyframes in selected sequences are annotated, such as KITTI. Therefore, there is still a large amount of unlabeled data in practical applications.

Considering that LPCG can take full advantage of unlabeled data, it is natural to use it in real-world autonomous driving systems. Specifically, the high-precision mode requires only a small amount of labeled data. The authors' team could then generate high-quality training data from the remaining unlabeled data for the monocular 3D detector to improve accuracy. In experiments, the author team shows quantitatively and qualitatively that the generated 3D box pseudo-labels are good for monocular 3D detectors. Furthermore, the low-cost model does not require any 3D annotation boxes and can still provide accurate 3D box pseudo-labels. In terms of data requirements, the author's team compared LPCG with previous methods in Table 1.

04 Summary

In this paper, the labeling requirements for monocular 3D detection are first analyzed. Experiments show that perturbed labels and perfect labels can achieve very close performance of monocular 3D detectors. With further exploration, the author team empirically found that the 3D position is the most important part of the 3D box label. In addition, autonomous driving systems can generate large unlabeled lidar point clouds with precise 3D measurements. Therefore, the author team proposes a framework (LCPG) to generate pseudo 3D box labels on unlabeled LiDAR point clouds to expand the training set of monocular 3D detectors. Extensive experiments on various datasets verify the effectiveness of LCPG. However, the main limitation of LCPG is longer training time due to the increase of training samples.

wonderful recommendation

  1. What perception problems must be solved for future autonomous driving

  2. Visual 3D object detection, from visual geometry to BEV detection

  3. 20,000 words | Visual SLAM research review and future trend discussion

  4. This article talks about the sensor calibration method of the automatic driving system

  5. The whole process of robot autonomous positioning and navigation based on SLAM

  6. ECCV 2022 | 76 hours of motion capture, the largest multimodal data set of digital human open source

Guess you like

Origin blog.csdn.net/weixin_40359938/article/details/128345830