Motionnet: joint Perception and Motion prediction for autonomous Driving based on Bird's Eye View Maps

1. Summary

Motionnet takes Lidar point cloud sequence as input and outputs a bird's eye view. The bird's eye view contains the target category and motion information. Motionnet's backbone network is the space-time pyramid network. Regularization with the loss of spatiotemporal consistency ensures the smoothness of spatiotemporal prediction. Open source address https://github.com/pxiangwu/MotionNet

2. Introduction. Environmental state estimation consists of two parts: 1. Perception-perception of foreground targets from the background; 2. Prediction-prediction of the target ’s future trajectory · [5] [22], camera-based 2D target detection [20,27, 41,63]. Point cloud-based 3D target detection [19,46,64]. Fusion-based target detection [6,23,24]. The detected bounding box is sent to the target tracker. Some methods have bounding boxes with trajectories. [4,31,59]. This state estimation strategy is prone to failure in real open scenarios.

Occupancy grid map (OGM) is used to represent 3D environmental information. OGM evenly divides the 3D point cloud into 2D grid cells. OGM can be used to specify the driving area, but the disadvantage is that it is difficult to ensure consistency at consecutive times The category information of the target is not provided. In order to deal with this problem, BEV map is used to represent environmental information. Similar to OGM, BEV map expands OGM to provide 3 layers of information, occupancy, movement and category information. In this way, the drivable area can be determined and the movement behavior of each target can be described. The contribution points are: 1. Propose a network model based on joint perception and motion prediction of bird's eye view motionNet, bounding box free; 2. Propose a spatiotemporal pyramid network; 3. Spatiotemporal consistency constraint loss to constrain the training of the network.

 

3. Method

The pipeline contains 3 parts. 1. Multiply the original 3D point cloud representation by BEV Figure 2. The backbone is a spatio-temporal pyramid network; 3. head for classification and motion prediction

3.1 Self-motion compensation

The input of the network is a 3D point cloud sequence. A single-frame point cloud has its own coordinate system. It is necessary to synthesize past frames into the current frame, and use the current coordinate system to represent all point cloud coordinates.

3.2 Representation based on bird's eye view

Unlike 2D images, 3D point clouds are sparsely and irregularly dispersed and cannot be processed with standard convolutions. In order to deal with this problem, they are converted into bird's eye views. First quantify it as a regular voxel, simply use the binary state as a voxel representation, indicate whether the voxel contains more than one point cloud, and then represent the 3D voxel lattice as a 2D pseudo image, the height As an image channel, this allows 2D convolution.

3.3 Space-time pyramid network

3.4 Output Heads

STPN 后有3个head:1. cell-classification head, output is H × W × C, where C is the number of cell categories,; 2.motion-prediction head;, output shape is N × H × W × 2, N 为未来帧数目 3. state-estimation head( static or moving),  output is H × W.

 

3.5 Loss function

Classification and state estimation use cross-entropy loss, motion prediction estimation uses l1 loss

3.5.1 

Spatial consistency loss(for the cells belonging to the same rigid object, their predicted motionsshould be very close without much divergence

Foreground temporal consistency loss(assume that there will be no sharp change of motions between two consecutive frames

Background temporal consistency loss

 

total loss

 

4. Experiment

Guess you like

Origin www.cnblogs.com/ahuzcl/p/12675422.html