CVPR2020 | Exploration of object visibility information in 3D detection

Author: Jiang day Park
Date:2020-04-21
Foreword
This article is a collaborative work from Carnegie Mellon University and Argo AI, which has been accepted by CVPR20 (oral). The main content of this article is point cloud-based 3D target detection. Unlike previous research, this article is based on observation It is found that the free and unknown areas cannot be distinguished in the BEV view. As shown in the two red boxes shown in (a) in the figure below, in BEV it seems that they are all free spaces that do not contain points. But if the laser scanned by lidar redraws the picture as shown in figure (b), where green indicates the area scanned by the laser and white is the unknown area, that is, white is the area blocked by the foreground object, and green is the real What was scanned, therefore, the information we can get is that the red frame area on the left indicates unknown, while the right side is actually a real freespace. Therefore, the author of this article uses the information of freespace to improve the detection accuracy.
Article address: https://arxiv.org/pdf/1912.04986.pdf
Main content overview
The author believes that the main observation mentioned above in this article is that free-space and unknown information can be added to the deep learning network with available feature information, because the current deep learning network cannot distinguish between unkown and Based on the free-space information, the author built a visiblitymap based on this observation, and used pointpillars as the baseline, and adopted a variety of fusion strategies and data augmentation methods. The final experimental surface has improved a lot on Nuscense. Confirm the validity of this observation. The following is the visual experimental effect.
1. Abstract
1. A large part of the current research on 3D inspection is to find a suitable representation of 3D sensor data. At present, there are two types of mainstream representations, which are the original Point representation and the voxel representation. Among them, the point-based representation can retain the most original information without losing geometric structure information, but the Point-based method is subject to the SA and FP module timecost A big problem, but this year's CVPR20's 3DSSD adopts the omission of the FP layer, and at the same time improves the 3D detection method of the SA module design, which is comparable to the current two-stage method in accuracy and can achieve 25FPS. The method based on voxel has been greatly developed after the introduction of sparse convolution. Just as the method described in PV-RCNN in CVPR2020 has high efficiency and high precision, but the intuitive feeling is that voxelization during preprocessing At this time, there will definitely be information loss, especially detailed information. On this issue, this year ’s CVPR20 SA-SSD also characterizes voxel-feature into the most original point cloud structure, making the initial detailed geometric information and xoel's original information fusion improves voxelbackbone's ability to perceive geometry information. In addition, there is also a GNN detection network this year, so the point cloud is represented as a graph, and the point cloud structure needs to be constructed during preprocessing, but it is inevitable that this method has a large time cost. Therefore, the author of this article should also explore such a representation method, which should be able to combine the free-space information observed in this article, and finally use the expression form of voxel.
2. The author points out that the representationanions of many previous point clouds are actually proposed for real 3D data, while the point clouds in the autonomous driving scene are actually lidarsweep scanned in real time and can only be counted as 2.5D. This point put forward by the author does have such a problem. The original Point series were all experimented on the modelnet40 data set, which are all complete 3D data. For real-time Lidarsweep, there are actually only surface points, and The occlusion part needs to be reconstructed, and the complete point cloud map after reconstruction can be counted as 3D information.
3. Because lidar sweep is 2.5D information, if only (x, y, z) is used, the hidden information of freespae will be lost. Therefore, this article restores this information through 3D ray casting: the author added a voxelized visibility map as an input to the freespace. The author also experimented with the effect of this freespace combining two data augmentation methods in this article: the data expansion of virtual objects (SECOND) and the fusion of Multi-time frames.
4. Added the visibility input of this article to nuscenes, which can significantly improve the detection accuracy of the current sota method.
2. Introduction
2.1 What is Visibility in the text
As mentioned in the introduction, the occlusion characteristics of real-time lidar sweep data in this paper show that this 2.5D data can only actually collect the most superficial points, and there is actually an occlusion problem for the points after this point. The text is expressed as "once a particular sceneelement is measured at a particular depth, visibility ensures that all otherscene elements behind it along its line-of-sight are occluded". This is also the reason why the data obtained by the 3D sensor can be expressed in a two-dimensional structure. In fact, it is more accurate to express it as 2.5D data.
2.2 The importance of Visibility
The article points out that in many tasks, such as map-building and autonomous driving navigation tasks, visibility is a very important content; but in target detection, there is no article mining this information as guidance information to improve the accuracy of detection. The author of this article stated that it is possible to simply modify the deep learning architecture and add data augmentation strategies to realize the exploration of free space information in 3D detection. Therefore, the author of this article added visiblity information to the current sota's voxel-based method
2.3 The current expression of Visibility
(1) occupancy map: a commonly used expression in mobile robot map construction
(2) Octomap: Visibility representation form in general 3D composition
2.4 Current Lidar-based methods
(1) One-stage and two-stage methods, a review article will follow to introduce the comparison of several new methods
(2)Object augmentation
In this article, the author specifically mentioned the data augmentation method in SECOND (integrating the gt in the scene used to form a database, and then randomly inserting several database gts for the training scene). At present, almost all sota Have adopted this method of data augmentation, and it is also because this augmentation method is very effective, but the author of this article points out that this method of data augmentation violates the occlusion relationship in real scenes. In this article, the author Some data augmentation contents have been modified to adapt to the occlusion relationship in real scenes.
(3) The multi-frame sweep fusion is the first to use continuous information between frames, and the 3D target detection network designed by RNN was published in CVPR18 [1]. SECOND subsequently aggregates the information of different frames in one place while retaining They are relative to the timestamp of the current frame. Also at this year's CVPR20, 3D-VID's work of Baidu Research Institute can reduce the detection results of FP through spatial feature extraction and spatiotemporal fusion modules. This article was introduced in the previous blog post of the author. It is worth mentioning that these two articles are cite based on pointpillars.
2.5 Contributions to this article
1. The author first introduced the "raycasting algorithms" (ray casting) method, which is used to efficiently calculate the visibility of voxel-grid. And confirmed that the added information can be added to batch-based gradient learning
2. A simple data augmentation method is added to the voxel-based method: the author uses the voxelized visibility map as additional input information.
3. It is confirmed that the visibility map can be replaced by the combination of two current data augmentation methods: the fusion between the data expansion of the virtual object and lidar sweep.
3. Visibility for 3D Object Detection
Before introducing the structure, the author lists many current methods and points out that these methods have two major innovations. One is the use of data augmentation methods that Object inserts into the training scene, and the other is the fusion of multi-frame features. . The author will also conduct a comparative study on these two methods and his own method. In fact, according to the author's understanding, multi-frame fusion can actually achieve the effect of 3D reconstruction, which also has the distinguishable effect of free space and unknown space mentioned by the author in this article. In terms of innovations in this article:
1. The author first introduced a method to efficiently calculate visibility, called raycasting algorithm
2. Integrate the visibility obtained above with the current deep learning network structure
3.1 Structure of this article
overview network structure
As follows, the network structure and design here are the same as the previous voxel-based method, and the specific process can be expressed as two parts, namely predefined 3D anchors and network structure
1. The picture on the left shows the commonly used anchor-based method of the voxel-based method, that is, for each type of object, set the anchor box at a certain distance on the BEV plane. How many categories there are and the number of anchors Linear growth, so in the 19-year article OHS and this year's CVPR20 3D-SSD have adopted the anchor-free method to reduce memory consumption.
2. The picture on the right is the standard current voxel-based method. First, 3D sparse convolution is used from point sweep to reduce the height to 1, and then 2D convolution is used to regress and classify the 3D anchor box.
Data augmentation and multi-frame fusion
As mentioned earlier, the data augmentation method in this article mainly studies the "extracting object from gt base and inserting it into the training scene for data augmentation" proposed by SECOND. It will show the improved effect in the subsequent ablation experiment. The experiment shows an increase of 9.1 %. Also for multi-frame fusion, multi-frame fusion registration is also used in this paper, so that the input information has one more dimension expressed as (x, y, z, t), and the experiment shows that the final result can be improved by + 8.6%.
3.2 How to calculate Visibility
As mentioned in the previous article, this article increases the detection accuracy by adding a Visibility map information. Here we focus on how to calculate the Visibility:
1. We all know that after the laser radar emits laser light in a certain direction, it encounters the surface reflection of the object and is received, and the return point, that is, the position information of the surface point of the object can be calculated by the laser time of flight TOF.
2. The author's Visibility method is actually a very intuitive method: according to the position of point and lidar, we can connect the two points to form a line in space, and the voxel that the line passes through is marked as free-space, point The existing voxel is marked as occupied, and the rest is unknown (default). In the implementation, it starts from the original voxel and calculates from which side the voxel comes out, the next voxel under investigation is the Share the voxel of this interface until the end of the last point is reached. Can be expressed as the following pseudo-code process:
Calculate the Visibility of a single-frame point cloud
Consistent with the above algorithm, only the last termination condition needs to be changed from reaching end to encountering BLOCKED to terminate, that is to say, the point cloud data that has been augmented is actually treated as BLOCKED. As shown in the following figure, here (a) represents the original scene, (b) inserts the augmented object but does not add any processing, it can be clearly seen that there is an obstruction behind the wall Of the object, this is not common sense. Therefore, in (c), a common sense approach is adopted, that is, the blocked object is deleted, but this may bring about the problem that the inserted object may be deleted, so the correct approach is to delete the front blocked wall, As shown in figure (d), this is the end of encountering BLOCKED as mentioned above.
Calculate Visibility of multiple frames
The above calculation of Voxel's Visibility only calculates the Visibility of a single frame. For continuous sweep, a simple idea is that since we know the position of the initial sensor, we can treat all the frames used as a single frame, but this method It will cause a relatively large time cost. The author uses Bayesian filtering to predict the visibility map of consecutive frames. As shown in the following figure, the left picture shows the top view of the single-frame sweep and the corresponding visiblitymap, where red means marked as occupy, blue is freespace, and gray is unknown; Figure (b) shows the Bayesian filter prediction A multi-frame point cloud top view and corresponding visiblity, here for each vxole, the redder the greater the possibility of being occupied.
3.3 Integrate Visibility map into backbone
As shown in the figure below, the fusion methods selected by the author are early fusion and late fusion, which is actually the difference between fusion of original information and fusion of semantic features.
Here is the backbone used in this article. It is the 19-year CVPR Pointpillars. The network results are as follows. This article is improved on the basis of voxelnet. The voxelnet of voxelnet is divided into pillars, so you can directly omit the 3D CNN part and do it at the same time. The accuracy achieved does not decrease.
4. Experiments
4.1 nuscene
As shown in the figure below, the author conducted an experiment on the nuscence benchmark. The effect is as follows. It can be seen that in most cases, the baseline has been improved a lot.
The verification effect on the val dataset is also as follows:
4.2 Ablation experiment
As shown in the figure below, the following ablation experiments have been done for the fusion method and the treatment of augmented objects, such as multi-frame fusion, etc. The combined "early + drilling + multi-frame" experimental effect is the best.
5. The author's thinking
This article is based on observations and found that in the current SOTA method, the information of freespace is not used, so the corresponding visilitymap is added to the baseine network according to this observation; this article conducts experiments on nuscenes, and the ablation experiment is done well; and The difference between many articles is that the starting point of this article is not to improve the problems in the network structure, but to observe the information that is ignored in practice. Similarly, the author believes that this article is closer to the project and the bottom layer compared to other 3D inspection articles this year, using more basic information loss, but it can also be thought that the pillar itself has lost depth information, and the same The lidar scan has an angle, can we also use this information? In addition, the author believes that such a free-spce uses voxel's representation method, the focus is on the free-spce information can be expressed in this way, and if the point-based method is used, I do not know how to attach the free-space information to In Point.

Guess you like

Origin www.cnblogs.com/YongQiVisionIMAX/p/12742156.html