CenterPoint of Laser Point Cloud 3D Target Detection Algorithm

CenterPoint of Laser Point Cloud 3D Target Detection Algorithm

This article was first published on the public account [DeepDriving], welcome to pay attention.

foreword

CenterPointYes , a laser point cloud target detection and tracking algorithm framework proposed in CVPR 2021the paper " " is different from previous algorithms in that this algorithm does not use bounding boxes but proposes to use key points to represent, detect and track targets: in the detection process, In the first stage, a key point detector is used to detect the center point of the target, and then the center point feature is used to regress the attributes of the target such as size, orientation, and speed. In the second stage, these attributes are analyzed based on the additional point features of the target. optimization; object tracking is reduced to a simple closest point matching process. The proposed target detection and tracking algorithm is very simple and efficient, and has achieved state-of-the-art performance on both datasets .Center-based 3D Object Detection and Tracking3D3D3DCenterPoint3DnuScenesWaymoSOTA

Paper link : https://arxiv.org/pdf/2006.11275.pdf

Code link : https://github.com/tianweiy/CenterPoint

This article will CenterPointbriefly explain the algorithm.

Preliminary knowledge

3D object detection

假设用 P = { ( x , y , z , r ) i } \mathcal{P} =\left \{ (x,y,z,r)_{i} \right \} P={ (x,y,z,r)i} to represent an unordered point cloud, where( x , y , z ) (x,y,z)(x,y,z ) indicates3Dposition,rrr represents the reflection intensity. The purpose of target detection is to predict a series of bounding boxesB = { bk } \mathcal{B}=\left \{b_{k}\right \}3Dto represent the target from the point cloud3DB={ bk} , each bounding box can be expressed asb = ( u , v , d , w , l , h , α ) b=(u,v,d,w,l,h,\alpha)b=(u,v,d,w,l,h,α ) , where( u , v , d ) (u,v,d)(u,v,d ) indicates the position of the center point of the target relative to the ground plane,( w , l , h ) (w,l,h)(w,l,h ) represents3Dthe size of the target,α \alphaα represents the orientation angle of the target.

The current mainstream 3Dtarget detection algorithm generally 3Ddivides the unordered point cloud into regular cells ( Voxelor Pillar) through a feature encoder, and then uses a point-based network ( PointNet/PointNet++) to extract the features of all points in each cell , and then retain the most important features through the pooling operation. These features are then fed into a backbone network ( VoxelNetor PointPillars) to generate a feature map M ∈ RW × L × F \mathbf{M} \in \mathbb{R} ^{W\times L\times F}MRW × L × F , of whichW , L , FW, L, FW,L,F represents the width, length and channel number respectively. Based on this feature map, a single-stage or two-stage detection head can generate target detection results from it. Previousobject detection algorithms (for exampleto regress the position of the targetbased on predefinedanchor-based,objects usually have a variety of sizes and orientations, which leads to the need to define a large number ofmodels for training and inference, so that Add a lot of computational burden. In addition,the method based on α cannot accurately regressthe size and orientation of the target.3DPointPillarsanchor3Danchoranchor3D

If 3Dyou don’t know the target detection process, you can take a look at it first PointPillars. The structure of the algorithm is shown in the figure below:

pointpillars_overview

Algorithm interpretation can refer to this article I wrote before :

Laser Point Cloud 3D Target Detection Algorithm PointPillars

CenterNet-based 2D target detection algorithm

CenterPointCenterNetThe idea of ​​target detection based on the center point is continued . CenterNetTreat 2Dtarget detection as a standard key point estimation problem, represent the target as a single point at the center of its bounding box, and other attributes such as target size, dimension, orientation, and attitude are directly obtained from the image of the center point feature regression. The model inputs the image into a fully convolutional network to generate a heat map. The peak position of the heat map is the center of the target, and the image features at each peak position are used to predict the width and height of the target bounding box.

If you don't know the target detection algorithm yet, you can take a look at this articleCenterNet I wrote before , so I won't introduce too much here.

Classic target detection algorithm: CenterNet

CenterPoint model

The following figure is CenterPointthe framework of the algorithm. First, the feature map under the bird's-eye view is extracted from the point cloud through a standard backbone network such as VoxelNetor , and then the detection head based on the convolutional neural network is used to find the center of the target and use the center feature to return the bounding box. properties.PointPillars3D2D3D

centerpoint_model

Center point heat map

CenterPointThe generation method of the center point heat map is CenterNetbasically similar to that, this regression branch will generate Ka channel heat map Y ^ \hat{Y}Y^ , each channel represents a category. During the training process, a2DGaussian function is used to map the real center of the marked target3Dto the bird's-eye view, and the loss function usesFocal Loss. Compared with the image, the target on the bird's eye view is much sparser and the area occupied by a target is very small, and there is no phenomenon of near large and far small caused by perspective transformation (a nearby object may account for more than half of the area) image area). To solve this problem, the authors enlarge the Gaussian peaks presented at the center of each target to increase the accuracy of the target heatmapYYFor forward supervision of Y , set the radius of the Gaussian function to

δ = max ( f ( wl ) , τ ) \delta = max(f(wl),\tau )d=max(f(wl),t )

where τ = 2 \tau=2t=2 is the minimum allowed radius,fff isCornerNetthe radius calculation function defined in .

Target attribute regression

In addition to the center point, information such as size and orientation are also required 3Dto completely form a 3Dbounding box to represent an object. CenterPointFrom the central feature, the following target attributes are regressed:

  • Position correction value o ∈ R 2 o \in \mathbb{R}^2oR2stride , for reducing quantization errorsdue to voxelization and in the backbone network
  • Height from the ground hg ∈ R h_{g} \in \mathbb{R}hgR , to help3Dlocate objects in space and add height information that is lost due to mapping to a bird's-eye view.
  • 3DSize s ∈ R 3 s \in \mathbb{R}^3sR3. The length, width and height information of the target is expressed by a logarithmic function during regression, because the actual object may have various sizes.
  • Orientation angle ( sin ( α ) , cos ( α ) ) ∈ R 2 (sin(\alpha),cos(\alpha))\in \mathbb{R}^2( s in ( a ) ,cos ( a ))R2 , using the sine and cosine of the orientation angle as the continuous regression target.

Velocity Estimation and Object Tracking

In order to track the target, an additional regression branch is added to predict the two-dimensional velocity v ∈ R 2 \mathbf{v} \in \mathbb{R} ^2 ofCenterPoint each detected objectvR2 . Different from other attributes, velocity estimation requires the feature maps of the previous frame and the current frame as input, and the purpose is to predict the position offset of the target in the current frame and in the previous frame. Like regressing other target attributes, velocity estimation also usesL 1 L_{1}L1The loss function is used for supervised training.

In the inference stage, the center point of the target in the current frame is mapped back to the previous frame by negative velocity estimation, and then they are matched with the tracked target by the closest distance matching method. Like SORTthe target tracking algorithm, if a target consecutive 3frame does not match successfully, it is deleted. The figure below is the pseudocode of the tracking algorithm, and the whole process is relatively simple.

SORTThe principle of the target tracking algorithm can refer to this article I wrote before :

SORT of multi-target tracking algorithm

Two-stage CenterPoint

The one-stage algorithm introduced above CenterPointuses the method based on the center point to detect the target and returns the attribute information of the target. This detection head is very simple and the effect is better than the anchordetection method based on the center point. However, since all attribute information of the target is inferred from the features of the target center, it may be possible to accurately locate the target due to lack of sufficient feature information, so the author designed a lightweight point feature extraction network for the target. Attributes undergo a two-stage optimization.

3DIn this stage, point features need to be extracted from the center of each face of the bounding box predicted in the first stage . Considering that the center of the bounding box, the top and the center of the top will be projected to the same point in the bird's eye view, the author only considers the center of the bounding box and the four outward centers. First, the feature map M \mathbf{M} output from the backbone networkIn M , the features of each point are extracted through bilinear interpolation, and then the extracted features are stacked and sent to aMLPnetwork to optimize the bounding box predicted in the previous stage.

This stage also predicts a confidence score, which is calculated as follows during training:

I = m i n ( 1 , m a x ( 0 , 2 × I o U t − 0.5 ) ) I=min(1,max(0,2 \times IoU_{t}-0.5)) I=min(1,max(0,2×IoUt0.5))

In that I o U t IoU_{t}IoUtIndicates the ttground-truthThe value between t candidate boxes and3D IoU. The loss function adopts binary cross entropy loss, and the formula is as follows:

L s c o r e = − I t l o g ( I ^ t ) − ( 1 − I t ) l o g ( 1 − I ^ t ) L_{score} = -I_{t}log(\hat{I}_{t}) - (1-I_{t})log(1-\hat{I}_{t}) Lscore=Itlog(I^t)(1It)log(1I^t)

where I ^ t \hat{I}_{t}I^tIndicates the confidence score for the prediction. During inference, the confidence score
is calculated as follows:

Q ^ t = Y ^ t ∗ I ^ t \hat{Q}_{t}=\sqrt{\hat{Y}_{t}*\hat{I}_{t}} Q^t=Y^tI^t

其中 Y ^ t = m a x 0 ≤ k ≤ K Y ^ p , k \hat{Y}_{t}=max_{0\le k \le K}\hat{Y}_{p,k} Y^t=max0kKY^p,k I ^ t \hat{I}_{t} I^tRespectively represent the first stage and the second stage to the target ttConfidence score predicted by t .

Summarize

CenterPointIt can be said that it is a continuation CenterNetof 3Dthe target detection task, and continues to use a simple and elegant way to solve the target detection problem. The proposed center-baseddetection head can directly replace the detection head in the previous VoxelNetand PointPillarsother anchor-basedalgorithms, greatly simplifying the training and reasoning process, and The detection effect is better.

References

  • Center-based 3D Object Detection and Tracking

Guess you like

Origin blog.csdn.net/weixin_44613415/article/details/130770945