Paper speed reading -- CenterPoint

Paper speed reading – CenterPoint (Center-based 3D Object Detection and Tracking)

References :
1. CenterPoint paper2
. CenterPoint code 3. CenterPoint
paper notes4. CVPR
2021 |



1. Summary

Three-dimensional objects are usually represented using 3D boxes in point clouds. This representation mimics image-based 2D bounding box detection, but objects in the 3D world do not follow any particular orientation, and 3D box-based detectors struggle to enumerate all orientations or fit axis-aligned bounding boxes for rotated objects. Our framework CenterPoint 首先检测使用关键点检测器检测物体中心,同时回归其他属性:3D 大小、3D 方向和速度。在第二阶段,它使用目标上的附加点特征来细化以上的估计。3D object tracking simplifies to greedy closest point matching. The detection and tracking algorithm is simple, efficient and effective. On the nuScenes and Waymo open data sets, the sota effect is shown.

2. Introduction

CenterPoint uses a lidar-based backbone network such as VoxelNet or PointPillars to represent the input point cloud. The representation based on the center point has many advantages. For example, unlike the bounding box, the point has no concept of rotation; in addition, it can simplify the tracking of downstream tasks; it is also convenient to design the refinement module of the second stage more effectively, and the second stage extracts the three-dimensional boundary of the estimated object A point feature at the 3D center of each face in the box. It recovers the local geometric information lost due to the stride and limited receptive field, which is more effective than previous schemes.

Related work :
2D目标检测: Predict bounding boxes along coordinate axes. RCNN家族Identify category-unknown bounding boxes and YOLO, SSD, and RetinaNetidentify category-specific box candidates. CenterNet 、CenterTrackDirectly detect the center point of the object without candidate boxes.
3D目标检测: Predict the 3D bounding box. Vote3Deep、VoxelNet、SECOND、PIXOR、PointPillars、MVF、Pillar-od、VoteNet
两阶段3D目标检测: ROIPool, RoIAlign
3D目标追踪: 3D Kalman filter, CenterTrack.

Preliminary work :
2D CenterNet generates a heat map (heatmap) w×h×k, k represents the category, and the local eight-neighborhood maximum value represents the center of the detected object. CenterNet first predicts the object center of each category, and then sets the w×h×2size of the two-channel regression object ( object size ), including object width and height. At the same time, a local offset (local_offset) is returned.
**3D detection. **Usual method: through VoxelNet or PointPillars, through the backbone network, process a frame of point cloud into 图视角的特征图(map-view feature map, W×L×F), plus a detection head

3. Network and method

The overall structure of the network is shown in the figure below. For CenterPoint's first-stage predictions 特定类别的热图、目标大小、亚体素位置细化、旋转和速度, all outputs are dense predictions.
insert image description here

3.1 The first stage Centers and 3D boxes
1. The head of the center heat map .
The goal is to generate the center position of the heatmap peak for any detected object. During the training process, the peak of the heat map will be generated by using the 3D center of the labeled bounding box to the map view (map-view), and then using the 2D Gaussian curve. Then use Focal lossfor training. If the standard CenterNet is directly used for training, it will lead to a very sparse supervision signal, and most positions will be considered as the background. The authors amplify the supervision signal by dilating the Gaussian peak at the center of each ground truth (gt) object.

2. Back to the head .
We also need to save other properties:

  • Sub-voxel location refinement: Reducing quantization error for voxelization and stepping operations in backbone networks, 2D.
  • Height above the ground: locate objects, add height information, one-dimensional
  • 3D size: three-dimensional
  • yaw angle: 2D

3. Velocity Head and Tracking .
Velocity estimation requires current frame and past frame, two frames of data. Then predict the position difference between the two frames, and use L1 loss for supervision. During inference, use this offset (offset) to link the detection of the preceding and following frames. Project the object center of the current frame to the previous frame, calculate their negative velocity estimates, and then perform the closest distance matching.

3.2 The second stage CenterPoint .
Extra 点特征(point-feature) is extracted from the backbone network output. We extract a point feature from the center of each face of the predicted 3D object. Because the top and bottom centers are both a center in map-view, the author considered it 四个点和一个中心点. For each point, use 双线性插值to get.

4. Experiments and Results

Evaluation Metrics : mAP ,mAPH,NDS,PKL,MOTA,MOTP
Waymo
insert image description here
nuScenes
insert image description here

Guess you like

Origin blog.csdn.net/weixin_36354875/article/details/127363529