Multi-View 3D Object Detection Network for Autonomous Driving

1. Summary

MV3D uses LIDAR point cloud and RGB as network inputs to predict 3D bounding box with direction. The network contains two parts: 1. 3D target proposal generation; 2. Multi-view feature fusion. The proposal network takes 3D point clound bird's-eye view as input, Generate 3D candidate boxes. And design a deep fusion mechanism combined with multi-view region-wise features, and carry out information exchange in the middle layer under different paths.

2. Introduction

Network pipeline

3. Related work

3D object detection in point cloud

Most methods use 3D point cloud with voxel grid representation, sliding shapes, and Vote3D to use SVM classifier for 3D object detection. VeloFCN projects the point cloud to the front view.

3D object detection in images

3DVP uses the 3D voxel mode and the ACF detector for 2D detection and 3D pose estimation. 3DOP uses the method of entropy minimization to reconstruct the depth from the binocular image, and then input it to the R-CNN for target recognition. Mono3D and 3DOP have the same pipeline, but only use monocular images to generate 3D proposals. In order to fuse time series information, some work combines 2D target detection and 3D target detection in combination with the structure in motion and ground estimation.

Multimodal Fusion

[10] Combine images, depth, optical flow, and use a hybrid framework for 2D pedestrian detection. [7] Fuse RGB and depth images at an early stage, and train a pose-based 2D classifier. The method of this article was inspired by [14, 26].

3D Object proposals

......

4. MV3D network architecture

The input of the network is a multi-view point cloud input and RGB images. First, a 3D proposal is generated from the bird's eye view, and multi-view feature fusion is performed based on the region feature representation. The fused features are used for classification and 3D box regression with direction.

4.1 3D point cloud representation

4.1.1 Representation of bird's eye view

The bird's-eye view shows that it contains height, intensity and density information, and the projected point cloud is discrete to a 2D grid with a resolution of 0.1m. For each cell, the height is the maximum height in the cell. In order to obtain more detailed height information, the point cloud is divided into equal M slices, and each slice has a height map, so M height maps can be obtained. The intensity feature is the reflection value of the highest point cloud of each cell. The density of point clouds indicates the number of point clouds in each cell. All point clouds need to calculate intensity and density features, and M slices need to calculate height features, so the feature channel of the bird's eye view is (M + 2).

4.1.2 Front view representation

The front view provides complementary features for the bird's eye view. Since the radar point cloud is very sparse, projecting it on the image plane will result in a sparse 2D point map. This article projects it onto a cylindrical surface to generate a dense front view. Given a 3D point cloud p = (x, y, z), the corresponding front view coordinates are:

,among them

 

4.2 3D proposal network

Using the bird's eye view as input, in the 3D target detection network, the bird's eye view has the following advantages over the front view and image: 1. When projecting onto the bird's eye view, the physical size of the target is retained; 2. The target is occupied in the bird's eye view In order to avoid the occlusion problem in different spaces; 3. In the road scene, the target occupies the horizontal pavement, the variance in the vertical position is small, and the bird's eye view can get a more accurate 3D bounding box.

Given a bird's eye view, the network generates 3D box proposals from the 3D a priori box. The parameters of each 3D box are (x, y, z, l, w, h), which represents the center position of the target under the point cloud coordinate system And the size of the target. For each 3D prior frame, the anchor corresponding to the bird's eye view can be obtained by discretization (x, y, l, w). In this paper, N 3D prior frames are designed by the truth value of the clustering training set. For car, the value of (l, w) is {(3.9, 1.6), (1.0, 0.6)}, and the height is 1.56m.

Because the laser point cloud is sparse, it leads to a lot of empty anchors. This paper removes these empty anchors to reduce the amount of calculation. For each non-empty anchor, the network will generate a 3D box. For less redundancy, NMS is used for suppression.

 

4.3 Region-based fusion network

4.3.1 Multi-view ROI Pooling

Features from different perspectives and modalities have different resolutions. ROI Pooling is used to obtain feature vectors of the same length in each view. In this paper, the generated 3D proposals are projected into three views: bird's eye view (BV) View (FV) and image plane (RGB), given a 3D proposal, use the following model to get the ROIs of each view:

Given an input feature map x from the front-end network of each view, we obtain fixed-length features fv via ROI pooling:

 

4.3.2 Deep integration

 , 

4.3.3 3D box regression with direction

Return to the 8 vertices of the 3D box .

Multi-task loss is used to predict the category and orientation of the target. Cross-entropy loss is used for category loss, and l1 loss is used for 3D box loss. The conditions for 3D proposals to be positive samples are: the IOU of proposals and the true value is greater than 0.5, otherwise it is a negative sample. In the inference stage, NMS is used on the 3D box, and the threshold is 0.05.

4.3.4 Regularization of the network

 For each iteration, we randomly choose to do global drop-path or local drop-path with  a probability of 50%, if it is global-drop-path, randomly select one of the three views, if it is local-drop-path, enter The path has a 50% chance of being dropped. Make sure there is at least one input.

 add auxiliary paths and losses to the network

 

 

5. Experiment

 

Guess you like

Origin www.cnblogs.com/ahuzcl/p/12691286.html