Pointpillars Paper learning summary

Pointpillar Paper

  • PointPillars proposes a new point cloud encoding method and 3D to 2D method. It uses 2D convolution to achieve target detection without using time-consuming 3D convolution. It achieves a good balance between speed and accuracy. The characteristics of high speed, high precision and easy deployment make it widely used in the industry.

  • The processing idea is to convert 3d to 2d, and perform target detection on the 2d pseudo image.

  • The main steps are divided into three stages:
    insert image description here

    1. A feature encoder network that converts point clouds into sparse pseudo-images;
      • First, the dimensions of the grid (H x W) are drawn on the plane of the top view; then for each point in the column corresponding to each grid, (x, y, z, r, x_c, y_c, z_c, x_p ,y_p) 9 dimensions. Among them, the first three are the real position coordinates of each point, r is the reflectivity, the subscript with c indicates the deviation from the point to the center of the column, and the subscript with p is the deviation of the point relative to the center of the network. The points in each column with more than N are sampled, and those with less than N are filled with 0. Then (D, N, P) D=9, N is the number of points (set value), P=H*W.
      • Then learn the features, use a simplified PointNet to learn C channels from the D dimension, become (C, N, P), and then perform the maximization operation on N to become (C, P), and because P is H *W, we expand it into a pseudo image form, H, W is the width and height, and C is the number of channels.
    2. 2D convolutional base network for processing pseudo-images into high-dimensional feature representations;
      • Contains two sub-networks (1.top-down network, 2.second network). In order to capture feature information at different scales, the top-down network structure is mainly composed of convolutional layers, normalization, and nonlinear layers. The second network It is used to fuse feature information of different scales, which is mainly realized by deconvolution (upsampling). It consists of a 2D convolutional neural network, which is used to extract high-dimensional features from the pseudo-image output by the first part of the network.
    3. Detect the head (detection head, SSD), predict the category and regress the position of the 3D detection frame
      • The SSD detection head is used to realize 3D target detection. Similar to SSD, PointPillars performs target detection in a 2D grid, while the Z-axis coordinates and height are obtained by regression.
  • data augmentation

    1. A lookup table of ground truth 3D boxes is created for all classes and associated point clouds in these 3D boxes. Then for each sample, we randomly selected 15, 0, and 8 ground truth samples of cars, pedestrians, and cyclists (ground truth: classification accuracy, used to verify or overthrow a certain research hypothesis in a statistical model), and set It is put into the current point cloud.
    2. Then, all ground truth boxes are individually enhanced. Each box is rotated (drawn uniformly from [-\pi/20,\pi/20]) and translated, further enriching the training set.
    3. Applies a random mirror flip along the x-axis, followed by a global rotation and scaling. Finally, a global translation of x,y,z extracted from N(0,0.2) is applied to simulate localized noise.

input Output

insert image description here

insert image description here

In this paper, the input data set format is composed of lidar point cloud and image samples. We train only on LiDAR point clouds and compare with fusion methods that seem to use both LiDAR and images.
PointPillars, a novel encoder, leverages PointNet to learn point cloud features in vertical columnar organization, although encoded features can be used with any standard 2D convolutional detection architecture.

enter

LAS format is a kind of point cloud data, and LAS is a binary file packaged according to several specifications. It is commonly used in autonomous driving and high-precision map making. las files are intended to contain lidar point cloud data records.
Its data record format is as follows:
insert image description here

insert image description here

output

Point cloud data carrying detection and regression information is still in LAS format

loss loss function ( refer to the website for details )

The 3D frame of each target is represented by a 7-dimensional vector: (x, y, z, w, l, h, θ), where x, y, z represent the coordinates of the center point of the 3D frame; w, l, h represent The width, length, and height of the 3D frame; θ represents the orientation angle of the 3D frame, and the residuals of ground truth and anchors are defined as:
insert image description here

Among them, xgt and xa respectively represent the Ground truth and the anchor box, da=the square root of the sum of the square of the width of the anchor box and the square of the length of the anchor box.
The positioning loss function uses the Smooth L1 function:
insert image description here

The Smooth L1 loss function is:
insert image description here

The smooth L1 loss function curve is shown in the figure below. The purpose is to make the loss more robust to outliers. Compared with the L2 loss function, it is not sensitive to outliers (points farther from the center) and outliers. , the magnitude of the gradient can be controlled so that it is not easy to run away during training.
insert image description here

Like SECOND, PointPillars uses softmax classification loss to learn the target orientation, and the loss function is represented by Ldir.
For target classification tasks, PointPillars uses Focal Loss:
insert image description here

Where Pa represents the category probability of the anchor box, Alpha represents 0.25, and Gamma represents 2.
The total loss function is as follows:
insert image description here

where Npos is the number of true positive anchor boxes, Bloc=2, Bcls=1, Bdir=0.2
insert image description here

Result analysis

  • Pre-note: In the KITTI dataset, BEV refers to a bird's-eye view, and 3D refers to a three-dimensional image.
    insert image description here

insert image description here

PointPillars has the highest refresh rate compared to all algorithms in the graph. PointPillars outperforms all published methods in terms of average precision (mAP). PointPillars achieves better results on all classes and difficulty layers (except the easy car layer) compared to lidar-only methods. It also outperforms car- and bike-based fusion methods.
insert image description here

The Average Orientation Similarity (AOS) detection benchmark for the KITTI test in Fig. SubCNN is the best performing image-only method, while AVOD-FPN, SECOND and PointPillars are 3D object detectors for predicting orientation. Pointpillasr predicts 3D oriented boxes, but orientation is not considered in BEV and 3D. So Pointpillars uses the AOS algorithm, which projects 3D boxes into the image, performs 2D detection matches, and then evaluates the orientation of these matches. Compared to the only two 3D detection methods that predict oriented boxes, PointPillars on AOS outperforms the other methods significantly in all layers.

AOS explanation: average orientation similarity, Average Orientation Similarity (AOS). The indicator is defined as:
insert image description here

Among them, r represents the recall rate recall() of object detection. Under the dependent variable r, the directional similarity s belonging to [0,1] is defined as the normalization of the cosine distance between all predicted samples and the ground truth:
insert image description here

Where D® represents the set of all predicted positive samples under the recall rate r, and δ and θ represent the difference between the predicted angle of the detected object i and the ground truth. In order to punish multiple detections matching the same ground truth, if the detection i has matched to the ground truth (IoU at least 50%), set \delta i=1, otherwise it is 0.

New innovation points (problems exist)

  • Excessive reliance on previous feature extraction. SSD is a one-stage method, with only one regression (two stages such as pointRCNN can be used to continue to integrate local features in the subsequent RCNN stage).
  • In view of the slightly poor pedestrian detection effect, it is suspected that pedestrians with complex spatial angles cannot be accurately processed for high-dimensional conversion to fake images. I personally think that it is possible to pre-label pedestrians in several forms, and then cluster a set of pedestrians with similar characteristics, and then guide the process of high-dimensional conversion to fake images. It is also possible to use AVOD-FPN with better pedestrian prediction in the pseudo image for processing (specifically, how to integrate backbone and AVOD-FPN needs to be considered).
    • Correction: In the point cloud, because of the special physical characteristics of people, compared with objects such as cars, the display effect may be simpler and easier to find, and it may not be caused by spatial angle problems.

code run

  1. Source code download
  2. Dataset download

Guess you like

Origin blog.csdn.net/weixin_44077556/article/details/128974059