CenterPoint of Laser Point Cloud 3D Target Detection Algorithm
This article was first published on the public account [DeepDriving], welcome to pay attention.
foreword
CenterPoint
Yes , a laser point cloud target detection and tracking algorithm framework proposed in CVPR 2021
the paper " " is different from previous algorithms in that this algorithm does not use bounding boxes but proposes to use key points to represent, detect and track targets: in the detection process, In the first stage, a key point detector is used to detect the center point of the target, and then the center point feature is used to regress the attributes of the target such as size, orientation, and speed. In the second stage, these attributes are analyzed based on the additional point features of the target. optimization; object tracking is reduced to a simple closest point matching process. The proposed target detection and tracking algorithm is very simple and efficient, and has achieved state-of-the-art performance on both datasets .Center-based 3D Object Detection and Tracking
3D
3D
3D
CenterPoint
3D
nuScenes
Waymo
SOTA
Paper link : https://arxiv.org/pdf/2006.11275.pdf
Code link : https://github.com/tianweiy/CenterPoint
This article will CenterPoint
briefly explain the algorithm.
Preliminary knowledge
3D object detection
假设用 P = { ( x , y , z , r ) i } \mathcal{P} =\left \{ (x,y,z,r)_{i} \right \} P={
(x,y,z,r)i} to represent an unordered point cloud, where( x , y , z ) (x,y,z)(x,y,z ) indicates3D
position,rrr represents the reflection intensity. The purpose of target detection is to predict a series of bounding boxesB = { bk } \mathcal{B}=\left \{b_{k}\right \}3D
to represent the target from the point cloud3D
B={
bk} , each bounding box can be expressed asb = ( u , v , d , w , l , h , α ) b=(u,v,d,w,l,h,\alpha)b=(u,v,d,w,l,h,α ) , where( u , v , d ) (u,v,d)(u,v,d ) indicates the position of the center point of the target relative to the ground plane,( w , l , h ) (w,l,h)(w,l,h ) represents3D
the size of the target,α \alphaα represents the orientation angle of the target.
The current mainstream 3D
target detection algorithm generally 3D
divides the unordered point cloud into regular cells ( Voxel
or Pillar
) through a feature encoder, and then uses a point-based network ( PointNet/PointNet++
) to extract the features of all points in each cell , and then retain the most important features through the pooling operation. These features are then fed into a backbone network ( VoxelNet
or PointPillars
) to generate a feature map M ∈ RW × L × F \mathbf{M} \in \mathbb{R} ^{W\times L\times F}M∈RW × L × F , of whichW , L , FW, L, FW,L,F represents the width, length and channel number respectively. Based on this feature map, a single-stage or two-stage detection head can generate target detection results from it. Previousobject detection algorithms (for exampleto regress the position of the targetbased on predefinedanchor-based
,objects usually have a variety of sizes and orientations, which leads to the need to define a large number ofmodels for training and inference, so that Add a lot of computational burden. In addition,the method based on α cannot accurately regressthe size and orientation of the target.3D
PointPillars
anchor
3D
anchor
anchor
3D
If 3D
you don’t know the target detection process, you can take a look at it first PointPillars
. The structure of the algorithm is shown in the figure below:
Algorithm interpretation can refer to this article I wrote before :
Laser Point Cloud 3D Target Detection Algorithm PointPillars
CenterNet-based 2D target detection algorithm
CenterPoint
CenterNet
The idea of target detection based on the center point is continued . CenterNet
Treat 2D
target detection as a standard key point estimation problem, represent the target as a single point at the center of its bounding box, and other attributes such as target size, dimension, orientation, and attitude are directly obtained from the image of the center point feature regression. The model inputs the image into a fully convolutional network to generate a heat map. The peak position of the heat map is the center of the target, and the image features at each peak position are used to predict the width and height of the target bounding box.
If you don't know the target detection algorithm yet, you can take a look at this articleCenterNet
I wrote before , so I won't introduce too much here.
Classic target detection algorithm: CenterNet
CenterPoint model
The following figure is CenterPoint
the framework of the algorithm. First, the feature map under the bird's-eye view is extracted from the point cloud through a standard backbone network such as VoxelNet
or , and then the detection head based on the convolutional neural network is used to find the center of the target and use the center feature to return the bounding box. properties.PointPillars
3D
2D
3D
Center point heat map
CenterPoint
The generation method of the center point heat map is CenterNet
basically similar to that, this regression branch will generate K
a channel heat map Y ^ \hat{Y}Y^ , each channel represents a category. During the training process, a2D
Gaussian function is used to map the real center of the marked target3D
to the bird's-eye view, and the loss function usesFocal Loss
. Compared with the image, the target on the bird's eye view is much sparser and the area occupied by a target is very small, and there is no phenomenon of near large and far small caused by perspective transformation (a nearby object may account for more than half of the area) image area). To solve this problem, the authors enlarge the Gaussian peaks presented at the center of each target to increase the accuracy of the target heatmapYYFor forward supervision of Y , set the radius of the Gaussian function to
δ = max ( f ( wl ) , τ ) \delta = max(f(wl),\tau )d=max(f(wl),t )
where τ = 2 \tau=2t=2 is the minimum allowed radius,fff isCornerNet
the radius calculation function defined in .
Target attribute regression
In addition to the center point, information such as size and orientation are also required 3D
to completely form a 3D
bounding box to represent an object. CenterPoint
From the central feature, the following target attributes are regressed:
- Position correction value o ∈ R 2 o \in \mathbb{R}^2o∈R2
stride
, for reducing quantization errorsdue to voxelization and in the backbone network - Height from the ground hg ∈ R h_{g} \in \mathbb{R}hg∈R , to help
3D
locate objects in space and add height information that is lost due to mapping to a bird's-eye view. 3D
Size s ∈ R 3 s \in \mathbb{R}^3s∈R3. The length, width and height information of the target is expressed by a logarithmic function during regression, because the actual object may have various sizes.- Orientation angle ( sin ( α ) , cos ( α ) ) ∈ R 2 (sin(\alpha),cos(\alpha))\in \mathbb{R}^2( s in ( a ) ,cos ( a ))∈R2 , using the sine and cosine of the orientation angle as the continuous regression target.
Velocity Estimation and Object Tracking
In order to track the target, an additional regression branch is added to predict the two-dimensional velocity v ∈ R 2 \mathbf{v} \in \mathbb{R} ^2 ofCenterPoint
each detected objectv∈R2 . Different from other attributes, velocity estimation requires the feature maps of the previous frame and the current frame as input, and the purpose is to predict the position offset of the target in the current frame and in the previous frame. Like regressing other target attributes, velocity estimation also usesL 1 L_{1}L1The loss function is used for supervised training.
In the inference stage, the center point of the target in the current frame is mapped back to the previous frame by negative velocity estimation, and then they are matched with the tracked target by the closest distance matching method. Like SORT
the target tracking algorithm, if a target consecutive 3
frame does not match successfully, it is deleted. The figure below is the pseudocode of the tracking algorithm, and the whole process is relatively simple.
SORT
The principle of the target tracking algorithm can refer to this article I wrote before :
SORT of multi-target tracking algorithm
Two-stage CenterPoint
The one-stage algorithm introduced above CenterPoint
uses the method based on the center point to detect the target and returns the attribute information of the target. This detection head is very simple and the effect is better than the anchor
detection method based on the center point. However, since all attribute information of the target is inferred from the features of the target center, it may be possible to accurately locate the target due to lack of sufficient feature information, so the author designed a lightweight point feature extraction network for the target. Attributes undergo a two-stage optimization.
3D
In this stage, point features need to be extracted from the center of each face of the bounding box predicted in the first stage . Considering that the center of the bounding box, the top and the center of the top will be projected to the same point in the bird's eye view, the author only considers the center of the bounding box and the four outward centers. First, the feature map M \mathbf{M} output from the backbone networkIn M , the features of each point are extracted through bilinear interpolation, and then the extracted features are stacked and sent to aMLP
network to optimize the bounding box predicted in the previous stage.
This stage also predicts a confidence score, which is calculated as follows during training:
I = m i n ( 1 , m a x ( 0 , 2 × I o U t − 0.5 ) ) I=min(1,max(0,2 \times IoU_{t}-0.5)) I=min(1,max(0,2×IoUt−0.5))
In that I o U t IoU_{t}IoUtIndicates the ttground-truth
The value between t candidate boxes and3D IoU
. The loss function adopts binary cross entropy loss, and the formula is as follows:
L s c o r e = − I t l o g ( I ^ t ) − ( 1 − I t ) l o g ( 1 − I ^ t ) L_{score} = -I_{t}log(\hat{I}_{t}) - (1-I_{t})log(1-\hat{I}_{t}) Lscore=−Itlog(I^t)−(1−It)log(1−I^t)
where I ^ t \hat{I}_{t}I^tIndicates the confidence score for the prediction. During inference, the confidence score
is calculated as follows:
Q ^ t = Y ^ t ∗ I ^ t \hat{Q}_{t}=\sqrt{\hat{Y}_{t}*\hat{I}_{t}} Q^t=Y^t∗I^t
其中 Y ^ t = m a x 0 ≤ k ≤ K Y ^ p , k \hat{Y}_{t}=max_{0\le k \le K}\hat{Y}_{p,k} Y^t=max0≤k≤KY^p,k和 I ^ t \hat{I}_{t} I^tRespectively represent the first stage and the second stage to the target ttConfidence score predicted by t .
Summarize
CenterPoint
It can be said that it is a continuation CenterNet
of 3D
the target detection task, and continues to use a simple and elegant way to solve the target detection problem. The proposed center-based
detection head can directly replace the detection head in the previous VoxelNet
and PointPillars
other anchor-based
algorithms, greatly simplifying the training and reasoning process, and The detection effect is better.
References
- Center-based 3D Object Detection and Tracking