Comparison of voxelNet and pointpillars of target detection algorithm

Algorithm comparison

A Brief History of 3D Object Detection Development

Point cloud target detection has been developed through VoxelNet, SECOND, PointPillars, and PV-RCNN.

In 2017, Apple proposed voxelnet, which was the first paper to convert point clouds into voxels for 3D target detection.

Then in 2018, Yan Yan, a graduate student of Chongqing University, perfected the voxelnet code during the main line technology internship of the autonomous driving company, proposed an efficient spconv implementation, and added data enhancement, etc., the effect was improved a lot, and the paper was named SECOND.

PointPillars saw that the effect of SECOND was so good, so they took a shortcut and changed the voxels in the SECOND code into strip-shaped pillars to make the code run faster, but the actual effect was reduced.

The follow-up is PV-RCNN, which rearranges the ugly SECOND code, combines the advantages of the point method, and changes it into a two-stage method to further improve the SECOND effect, and publish it to CVPR2020. So far, the voxel-based method has basically achieved a very high effect, and it cannot be improved on a large scale. Yan Yan of Chongqing University has played a crucial role in the development of this field.

Since PointPillars does not have 3D convolution, its structure is simple and its effect is good. It is a practical 3D point cloud detection algorithm.

Use and understanding of netron

  1. Environment installation

pip install netron

  1. run the py file
import netron
if __name__ == '__main__':    
    adress = ("localhost",8081)    
    netron.start("alexnet.onnx",address=adress)
  1. understood by alex

Alexnet consists of 5 convolutional and three fully connected layers, for a total of eight layers. The network structure is as follows:
insert image description here

The meaning of each layer is shown in the figure below:
insert image description here

In the netron visualization display, you can see 5 convolutions and three fully connected layers in detail.

Supplement:
In the netron visual display, you can set the display information in the menu (in the menu). In the illustration, I have set to display all the information, so there are many hide operations.
insert image description here

Pointpillars model summary

insert image description here

  1. A feature encoder network that converts point clouds into sparse pseudo-images;
    • First, the dimensions of the grid (H x W) are drawn on the plane of the top view; then for each point in the column corresponding to each grid, (x, y, z, r, x_c, y_c, z_c, x_p ,y_p) 9 dimensions. Among them, the first three are the real position coordinates of each point, r is the reflectivity, the subscript with c indicates the deviation from the point to the center of the column, and the subscript with p is the deviation of the point relative to the center of the network. The points in each column with more than N are sampled, and those with less than N are filled with 0. Then (D, N, P) D=9, N is the number of points (set value), P=H*W.
    • Then learn the features, use a simplified PointNet to learn C channels from the D dimension, become (C, N, P), and then perform the maximization operation on N to become (C, P), and because P is H *W, we expand it into a pseudo image form, H, W is the width and height, and C is the number of channels.
  2. 2D convolutional base network for processing pseudo-images into high-dimensional feature representations;
    • Contains two sub-networks (1.top-down network, 2.second network). In order to capture feature information at different scales, the top-down network structure is mainly composed of convolutional layers, normalization, and nonlinear layers. The second network It is used to fuse feature information of different scales, which is mainly realized by deconvolution (upsampling). It consists of a 2D convolutional neural network, which is used to extract high-dimensional features from the pseudo-image output by the first part of the network.
  3. Detect the head (detection head, SSD), predict the category and regress the position of the 3D detection frame
    • The SSD detection head is used to realize 3D target detection. Similar to SSD, PointPillars performs target detection in a 2D grid, while the Z-axis coordinates and height are obtained by regression.

VoxelNets model visualization summary

The VoxelNets model is composed of 28 convolutions, 3 transposed convolutions, 26 BNs, 26 layers of relu, and multiple operations such as transposition and reset [BN mainly solves the problem of gradient disappearance in the propagation process].

The input consists of 40000 x 4 voxel_coords and 40000 x 32 x 10 features.

insert image description here

The output consists of 1 x 200 x 380 x 8 hm, 1 x 200 x 380 x 2 rot, 1 x 200 x 380 x 3 dim, 1 x 200 x 380 x 1 height and 1 x 200 x 380 x 2 reg composition.
insert image description here

The main work of VoxelNet:

  1. Solved the extraction problem of point cloud out-of-order data structure.
    Please add a picture description

Use voxel (every certain space in the three-dimensional world is divided into one, and then use the pointnet network to extract the features of the small grid, use the extracted features to represent the small grid, and put it back into the 3D space. In this way, the unordered point cloud data is Turn into high-dimensional feature data and these data become ordered in three-dimensional space.) The
figure below shows the relationship between Voxel and point cloud (picture from github: KITTI_VIZ_3D).
insert image description here

The green cubes in the above picture can be regarded as voxels, and the black points can be regarded as point cloud data generated by lidar. When each point cloud falls into a certain voxel, the point is divided into this voxel.

The overall module is roughly divided into three modules:

insert image description here

分别是Feature Learning Network、Convolutional middle layers、Region proposal network。

1. Feature Learning Network:

1) Voxel division (allocate each voxel to the corresponding position in 3D space):
For a three-dimensional space, the length, width and height are X, Y, and Z respectively, and the size of each voxel is length Vx, width Vy, and height Vz, respectively. The three-dimensional space can be divided into X/Vx voxels on X, and Y/Vy voxels on Y (assuming that X, Y, and Z are all multiples of Vx, Vy, and Vz).
Note: X, Y, Z are 0-70.4; -40-40; -3-1, Vx, Vy, Vz are 0.4, 0.2, 0.2 respectively, and the units are meters. Because the points generated by distant objects in the point cloud space are too sparse, points beyond the X, Y, and Z ranges will be deleted.

2) Grouping (assigning each point to the corresponding Voxel) and Sampling (sampling of the point cloud in the voxel)
After Voxelizing the 3D space, all point clouds in the 3D space need to be assigned to the voxel to which they belong. Due to the characteristics of the lidar itself and the capture of reflected beams, it will be affected by distance, object occlusion, relative attitude between objects and uneven sampling. The resulting point cloud data is very sparse and uneven in the entire three-dimensional space (some voxels don't even have points).
Moreover, a high-precision lidar point cloud will contain more than 100,000 points. If all points are directly processed, it may cause huge calculation and memory consumption, and the large difference in point density will lead to model detection deviation.
Therefore, after the Grouping operation, it is necessary to randomly sample T points in each non-empty voxel. (If it is less than 0, it will be filled with 0, and if it exceeds, only T points will be sampled).
After Grouping, the data can be expressed as (N, T, C), where N is the number of non-empty voxels, T is the number of points in each voxel, and C represents the characteristics of the points.
Note: In the paper, T is 35, Z =Z/Vz... equal to 10, 352, 400 respectively. Z , X , Y indicate how many voxels are on each axis.

3) The previous work of VFE stacking (Stacked Voxel Feature Encoding)
used voxel data after encoding, including encoding each non-empty voxel into a 6-dimensional statistical vector. These statistics are derived from the points inside the voxel. Either integrate the statistics with the local information of the voxel, or use binary encoding to encode the voxel operation, but none of them have achieved good results.
VoxelNet solves the above problems, as shown in the figure:
insert image description here

Each non-empty Voxel is a point set, defined as V={pi=[xi,yi,zi,ri]T belongs to the fourth power of R}, V=1...t; where t is less than or equal to 35, Pi Represents the i-th point in the point cloud data: xi, yi, zi are the X, Y, and Z coordinates of the point in 3D space, and n represents the intensity of light reflected to the radar (with distance, incident angle, object material, etc. related, the value is between 0-1).
First, perform data enhancement operations on each point in the voxel, first calculate the average value of all points in each voxel, count as (V_Cx, V_Cy, V_Cz), and then subtract xi, yi, zi of each point Go to the average value on the corresponding axis to get the offset (xi_offset, yi_offset, zi_offset) of each point to its own voxel center. Then splicing the obtained offset data and the original data together to obtain the input data of the network V = {pi = [xi , yi , zi , ri, xi_offset,yi_offset,zi_offset]T ∈ R^7}i=1…t .

Then use the method proposed by PointNet to transform the points in each voxel into a high-dimensional space through a fully connected layer (each fully connected layer contains FC, RELU, BN). The dimension has also changed from (N, 35, 7) to (N, 35, C1). Then in this feature, take out the point with the largest feature value (Element-wise Maxpool) to get a voxel aggregate feature (Locally Aggregated Feature), which can be used to encode the surface shape information contained in the voxel. This is also mentioned in PointNet. After obtaining the aggregated features of each voxel, use this feature to strengthen the high-dimensional features after FC; stitch the aggregated features into each high-dimensional point cloud feature (Point-wise Concatenate); get (N, 35, 2 *C1).

The author calls the above-mentioned feature extraction module VFE (Voxel Feature Encoding), so that each VFE module contains only one (C_in, C_out/2) parameter matrix. The features output by each voxel through VFE include the high-dimensional features of each point in the voxel and the aggregated local features, then only need to stack the VFE module to realize the interaction between the information of each point in the voxel and the local aggregation point information , so that the final feature can describe the shape information of this voxel.
insert image description here

Note: The parameters of FC in each VFE module are shared. In the implementation of the original paper, a total of two VFE modules were stacked. The first VFE module increased the dimension from the input dimension of 7 to 32, and the second VFE module increased the dimension of the data from 32 to 128.

After Stacked Voxel Feature Encoding, a (N, 35, 128) feature can be obtained, and then in order to obtain the final feature expression of this voxel. It is necessary to perform another FC operation on this feature to fuse the previous point features and aggregation features, and the input and output of this FC operation remain unchanged. That is, the obtained tensor is still (N, 35, 128), and then Element-wise Maxpool is performed to extract the most specific representative point in each voxel, and this point is used to represent the voxel, namely (N, 35, 128) –> (N, 1, 128)

4) Sparse Tensor Representation (representation of sparse features after feature extraction)
In the previous Stacked Voxel Feature Encoding processing, non-empty voxels are processed, and these voxels only correspond to a small part of the 3D space. Here it is necessary to remap the obtained N non-empty voxel features back to the source 3D space, expressed as a sparse 4D tensor, (C, Z', Y', X') –> (128, 10, 400 , 352). This sparse representation greatly reduces memory consumption and computation consumption in backpropagation. It is also an important step implemented by VoxelNet for efficiency.

5) Efficient implementation (voxel complement 0)

In the previous voxel sampling, if there are no T points in a voxel, 0 will be added until the number of points reaches 35, and if there are more than 35 points, 35 points will be randomly sampled. But the specific implementation in the original paper is as follows. The original author designed an efficient implementation for the processing of Stacked Voxel Feature Encoding, as shown in the figure below.
insert image description here

Since the number of points contained in each voxel is different, the author here converts the point cloud data into a dense data structure, so that the subsequent Stacked Voxel Feature Encoding can be used in all points and voxel features on parallel processing.

1. First create a K T 7 tensor (voxel input feature buffer) to store each point or intermediate voxel feature data, where K is the largest number of non-empty voxels, T is the largest number of points in each voxel, 7 is the encoded feature of each point. All points are processed randomly.

2. Traverse the entire point cloud data, if the voxel corresponding to a point is in the voxel coordinate buffer, and the number of points in the corresponding voxel input feature buffer is less than T, directly insert this point into the Voxel Input Feature Buffer; otherwise, directly Ditch this point. If the voxel corresponding to a point is not in the voxel coordinate buffer, you need to directly use the coordinates of the voxel in the voxel coordinate buffer to initialize the voxel, and store the point in the Voxel Input Feature Buffer. This whole operation is done with a hash table, so the time complexity is O(1). The creation of the entire Voxel Input Feature Buffer and voxel coordinate buffer only needs to traverse the point cloud data once, and the time complexity is only O(N). At the same time, in order to further improve memory and computing resources, the number of voxel midpoints is less than the number of m The voxel directly ignores the creation of the modified voxel.

3. After creating the Voxel Input Feature Buffer and voxel coordinate buffer, Stacked Voxel Feature Encoding can directly perform parallel calculations on the basis of points or voxels. After the concat operation of the VFE module, the features of the previously empty points are set to 0, which ensures the consistency of the features of the voxel and the features of the points. Finally, use the content stored in the voxel coordinate buffer to restore the sparse 4D tensor data, and complete the subsequent intermediate feature extraction and RPN layer.

2. Intermediate convolutional layers

After the feature extraction of the Stacked Voxel Feature Encoding layer and the representation of sparse tensors, 3-dimensional convolution can be used for feature extraction between the whole, because the extraction in each VFE before reflects each voxel Information, here use 3D convolution to aggregate the local relationship between voxels, expand the receptive field to obtain richer shape information, and predict the result for the subsequent RPN layer.
Three-dimensional convolution can be represented by ConvMD(cin, cout, k, s, p), cin and cout are the number of input and output channels, k represents the kernel size of three-dimensional convolution, s represents the step size, and p represents the padding parameter. Each three-dimensional convolution is followed by a BN layer and a Relu activation function.

Note: In the original text, three three-dimensional convolutions are used in the Convolutional middle layers, and the convolution settings are respectively

Conv3D(128, 64, 3, (2,1,1), (1,1,1)),
Conv3D(64, 64, 3, (1,1,1), (0,1,1)), The final tensor shape of
Conv3D(64, 64, 3, (2,1,1), (1,1,1)) is (64, 2, 400, 352).
Among them, 64 is the number of channels.
After the Convolutional middle layers, the data needs to be organized into a special whole required by the RPN network, and the tensor obtained by the Convolutional middle layers is directly reshaped into (64 * 2, 400, 352) in height. Then each dimension becomes C, Y, X. The reason for this operation is that in the detection tasks of data sets such as KITTI, objects are not stacked in the height direction in 3D space, and there is no such situation that one car is above another car. At the same time, this also greatly reduces the difficulty of designing the RPN layer in the later stages of the network and the number of anchors in the later stages.

3. Same as pointpillars, or pointpillars refer to this structure - RPN layer

The RPN structure of VoxelNet, after reorganizing the feature map obtained by the previous Convolutional middle layers and tensor, performs multiple downsampling on the feature map, and then performs deconvolution operations on different downsampled features to become the same The size of the feature map. These feature maps from different scales are then stitched together for final detection.

The following figure is the detailed structure of RPN in VoxelNet (the figure is from the original paper)
insert image description here

The RPN structure of VoxelNet, after reorganizing the feature map obtained by the previous Convolutional middle layers and tensor, performs multiple downsampling on the feature map, and then performs deconvolution operations on different downsampled features to become the same The size of the feature map. These feature maps from different scales are then stitched together for final detection. It feels a bit like PAN in the NECK module of image target detection. It's just that there is only one feature map here. Combine information from different scales. Here, each layer of convolution is a two-dimensional convolution operation, and each convolution is followed by a BN and RELU layer. The detailed convolution kernel parameter settings are also in the picture.

The final output is a classification prediction result and an anchor regression prediction result.
Anchor parameter design
In VoxelNet, only one anchor scale is used, unlike the 9 anchors in FrCNN. The length, width and height of the anchor are 3.9m, 1.6m, and 1.56m, respectively. At the same time, unlike in FrCNN, in the real three-dimensional world, each object has orientation information. So VoxelNet adds two orientation information for each anchor, which are 0 degrees and 90 degrees (lidar coordinate system).

Note: In the original paper, the author designed different anchor scales for cars, pedestrians, and bicycles, and pedestrians and bicycles have their own separate network structures (there is only a difference in the settings of Convolutional middle layers). For the convenience of analysis, here only the network design of the car is used as a reference

4. Loss function

In the annotation of a 3D data, it contains 7 parameters (x, y, z, l, w, h, θ), where xyz represents the position of the center point of an object in the radar coordinate system. lwh represents the length, width and height of the object. θ represents the rotation angle of the object around the Z axis (yaw angle). Therefore, the generated anchor also contains corresponding 7 parameters (xa, ya, za, la, wa, ha, θa), where xa, ya, za represent the position of the anchor in the radar coordinate system. la, wa, ha reflect the length, width and height of the anchor. θa represents the angle of this anchor.

Therefore, when encoding the loss function of each anchor and GT, the formula is as follows:
insert image description here

where d^a represents the length of the bottom diagonal of an anchor:
insert image description here
the above directly defines the 7 parameters of each anchor regression; but the total loss function also includes the classification prediction for each anchor, so the total loss function definition as follows:

insert image description here

Among them, pi pos indicates that the anchor is a positive sample after the softmax function, and pj neg indicates that the anchor is a negative sample after the softmax function. ui and ui only need to calculate loss for positive sample anchor regression. At the same time, the background classification and category classification both use the BCE loss function; 1/Npos and 1/Nneg are used to normalize the classification loss of each item. α, β are two balance coefficients, which are 1.5 and 1 respectively in the paper. The final regression loss uses the SmoothL1 function.

5. Data Augmentation

1. In the annotation data of the point cloud, a GTbox has already marked which points are in the box, so these points can be moved or rotated at the same time to create a large amount of changing data; after moving these points, the following collision detection is required, delete bold After the transformation, this GTbox is mixed with other GTboxes, which cannot happen in reality.

2. Zoom in or out for all GTboxes, the scale of zooming in or out is between [0.95, 1.05]; the introduction of scaling can make the network have better generalization performance in detecting objects of different sizes, which is in the image is very common.

3. Randomly rotate all GTboxes. The angle is extracted from the uniform distribution of [-45, 45]. The yaw angle of the rotating object can imitate the situation that the object turns a corner in reality.

6. Result view

insert image description here

Compared with bird's-eye view detection, which only needs to precisely locate objects on a 2D plane, 3D detection needs to precisely locate objects in 3D space, and thus is more challenging. Table 2 summarizes these comparisons. For the Car class, VoxelNet significantly outperforms all other methods in AP at all difficulty levels. Specifically, using only LiDAR, VoxelNet significantly outperforms the state-of-the-art method MV (BV+FV+RGB) [5] based on LiDAR+RGB by 10.68% and 2.78% on easy, medium and hard levels, respectively. and 6.29%. HC-baseline has similar accuracy to the MV [5] method. As with the bird's-eye view evaluation, we also compare VoxelNet with HC-baseline for 3D pedestrian and bicycle detection. Due to the highly variable 3D pose and shape, successful detection of these two classes of objects requires a better 3D shape representation. As shown in Table 2, for the more challenging 3D detection task, the performance of VoxelNet is improved (from 8% for bird’s-eye view to 12% for 3D detection), which shows that VoxelNet is better at capturing 3D shape information than hand-crafted features are more effective.

Algorithm comparison

VoxelNet:

The range of space ( z , y , x ) is [ − 3 , 1 ] × [ − 40 , 40 ] × [ 0 , 70.4 ] [-3,1]\times[-40,40]\times[0, 70.4][−3,1]×[−40,40]×[0,70.4], the resolution is ( 0.4 , 0.2 , 0.2 ). So, divided into ( D , H , W ) = ( 10 , 400 , 352 small voxels.

In a sense, it can be considered that voxels are now new "point clouds".
The number of point clouds in each voxel is T, and T<=35.

  1. VFE layer
    What the VFE layer has to do is to gather the point cloud features in a single voxel.
    insert image description here

In this way, each voxel can be represented by (T,7), where T is the number of point clouds, and 7 is the dimension of a single point cloud.
The VFE layer is FCN + MaxPooling + Concat.
The specific operation is:

(T,7) ---- FCN after [7,16] ---->(T,16) -----BN RELU------->(T,16)

(T,16) ----->MaxPooling on the dimension of T------>(1,16) to get a vector that reflects the global voxel

(1,16) is spliced ​​with FCN (T,16), that is, each individual point cloud vector is spliced ​​with the global vector of the voxel to obtain (T,32)

The VFE is repeated twice, the second time being

(T,32) -----FCN after [32,64]---->(T,64) -----BN RELU------->(T,64)

(T,64) ----->MaxPooling on the dimension of T------>(1,64) to get a vector that reflects the global voxel

(1,64) is spliced ​​with (T,64) after FCN, that is, each individual point cloud vector is spliced ​​with the global vector of the voxel to obtain (T,128)

Finally, MaxPooling is used again, so that (T,128) ----> (1,128), and the feature vector (1,128) used to represent a single voxel is obtained.
The final feature of the entire voxel map is (C, D, H, W) = (128, 10, 400, 352).

  1. After 3D convolution and RPN
    are processed by VFE, they are already regular voxel maps, which can be simply regarded as the extension of 2D images, using 3D convolution to extract features, and RPN for detection.
    (128,10,400,352) ----3D---->(64,2,400,352)------>(128,400,352) -->RPN

Disadvantages:
Obviously, the voxel segmentation dimension is relatively rigid.

SECOND:

  1. Use sparse convolution to calculate 3D convolution
  2. It seems to negate VFE, just simply use the average value, you need to take a good look at it with code (11.5)
  3. Data augmentation methods deserve attention.
  4. Added a loss

Point pillars

insert image description here

Guess you like

Origin blog.csdn.net/weixin_44077556/article/details/128970949