Automatic (Intelligent) Driving Series | (2) Environmental Perception and Recognition (2)

Following the above, this article mainly involves target detection based on lidar, which is mainly divided into two parts: an overview of point clouds and a point cloud neural network.

Table of contents

1. Overview of point cloud

2. Point cloud deep learning

2.1 PointNet and PointNet++

2.2VoxelNet

2.3SECOND

2.4PointPillars


1. Overview of point cloud

In this part, our research goal is to study the lidar target detection, so the research object is the dense point cloud obtained by the lidar.

First of all, the study of point cloud is analogous to the processing of images. Point cloud is disordered. It has no strict coordinate relationship such as pixels, and the concept of adjacent pixels without pixels is sparse, so at first we studied point cloud very much. difficult. For the disorder of the point cloud, we can establish a relationship by building an octree and a kd tree to make it orderly; the point cloud is not adjacent but has the concept of neighborhood, usually using k nearest neighbors (and the closest k of Euclidean distances) points) and radius neighborhood (all points within r). The current point cloud has a variety of representation forms, such as a point cloud composed of a single point (pointCloud), a triangular mesh network point obtained by point triangulation, etc. The common formats of point clouds are .pcd and .ply, etc.

The library we often use to deal with point clouds here is PCL (pointcloud library)

Point Cloud Library | The Point Cloud Library (PCL) is a standalone, large scale, open project for 2D/3D image and point cloud processing. The Point Cloud Library (PCL) is a standalone, large scale, open project for 2D/3D image and point cloud processing. https://pointclouds.org/ contains a variety of operations for point cloud processing.

 Like the image field, we also judge the boundary by the change of the normal vector, which is also divided into single-point features (such as position, intensity, normal, curvature), local features (PFH, FPFH, SHOT, etc.), global features (VFH), etc.

Through these features, we can realize feature point extraction, match, and matching of point clouds, and identify objects based on features. The following figure shows the models with the highest model similarity obtained through VFH clustering query:

 We will stop here for the traditional point cloud.

2. Point cloud deep learning

First put the development map of 3D image classification and segmentation:

At present, there are three mainstream methods for point cloud processing: point wise (such as PointCloud series), voxelization (such as VoxelNet, etc.) and pseudo-image method (such as BEV method)

2017 can be said to be a year of mileage in point cloud deep learning. The emergence of PointNet represents the birth of point cloud deep learning processing methods that use point clouds as processing objects. Apple's VoxelNet represents the first of its kind in voxelization processing.

2.1 PointNet and PointNet++

Let's talk about PointNet first. The following is its network structure, which is relatively simple.

 Input the point cloud, use Transformer-net to ensure that the position of the point cloud is the same, and ensure the corresponding relationship when matching (it will be proved later that the addition of this network has little effect), through a weight shared mlp, the output is n*64 feature map, then adjust the spatial position, then expand the dimension through an mlp, output n*1024, and then extract the global feature of 1024 dimensions through the max pool operation (the most important operation in the article), and then complete the classification task through mlp, if the segmentation task , then directly copy n copies of the extracted global feature, and directly concat the results of n*64, and adjust the output to n*m through two mlp, and each output point has m socre for classification.

PointNet has achieved good results on ShapeNet, but its shortcomings are obvious, and it does not reflect local features well (Point wise). Moreover, using PointNet for each position in the space is not efficient due to the sparsity of the point cloud.

 In the same year, Charles Qi improved PointNet and proposed PointNet++:

 Compared with PointNet, the changes of PointNet++ include constructing local neighborhood relationships through sampling and grouping, instead of directly using maxpooling but using step-by-step downsampling (Set Abstraction) to obtain local and global features at different levels; in the segmentation task combined with Skip connection for upsampling , use the pointnet structure multiple times to output the global score.

In the face of the problem that it may not work well for relatively sparse places, two solutions are proposed:

Actually an Encoder structure: 

MSG: For the center point of each layer, perform feature searches with different radii, so that concentric spheres of different sizes can be obtained, and then the results are stitched together. The biggest problem is that the consumption is too large (epensive) it runs local PointNet at large scale neighborhoods for every centroid point, as described in the original text.

For MRG: it is for the splicing of two layers. As shown in the above figure on the right, it is composed of two vectors, corresponding to the feature points of the two levels (layers), and the integration of features from the sub-regions of the lower layer on the left. On the right, the features of the local (second layer) are extracted directly through a single PointNet, which is the corresponding Multi-Resolution. So when the point cloud density of the local (second layer) is relatively low, then the first layer is more unreliable because it is the extraction of the second layer (more sparse), then we will reduce the weight of the first layer and increase The weight of the second layer; in the opposite case (when the density is high), the first vector can provide better detailed information because it reflects the features of the lower level, and has the ability to recursively recursively at the lower level Check for higher resolution capabilities. This is more efficient and is the method they use.

In the segmentation task, the reverse interpolation of the point cloud is used:

 Then through the skip connection, in addition to the features of the previous layer, there is also local information, so that there is global point wise and local information (features).

Classification task effect:

 Split task effect:

 time:

 The speed is still relatively slow. Subsequent Charles Qi modified it with F-PointNet and so on.

2.2VoxelNet

Next, VoxelNet

It is an end-to-end network that does not require feature engineering. PointNet and ++ are also mentioned in the article, pointing out that they require high calculation and memory. Inspired by the widely used RPN network, they also want to try this method, but the use of RPN network requires dense and organized data. Obviously, the directly obtained lidar point cloud is not satisfactory. By proposing the voxel feature encoding (VFE) layer, they make the points in the voxels entangled with the points, and use multiple VFE layers to extract more refined 3D features.

 Voxel Net divides the point cloud into 3D voxel blocks of equal size in space, uses stacked VFE layers to encode the beauty voxels, and then extracts local features through three-dimensional convolution, converts point clouds into high-latitude voxels, and finally Probe results are generated through the RPN network. As shown below, it is divided into three parts. (It can also be seen from the figure that the sparsity in different spaces is different)

 Its biggest innovation is to create the VFE layer, calculate the local mean, find the center point, and then represent each point through the offset, encode each point through the FC layer, and then obtain the feature vector through the max pooling of element wise , and then concat the encoded V to get the final features.

 The middle convolution layer is composed of 3D convolution, BN, and ReLU. Get the features of voxel wise.

Conv3D(128, 64, 3, (2,1,1), (1,1,1)),
Conv3D(64, 64, 3, (1,1,1), (0,1,1)),
Conv3D(64, 64, 3, (2,1,1), (1,1,1))
The final tensor shape is (64, 2, 400, 352) which compresses the Z dimension, making it become 3D similar to the image format.

RPN network architecture (divided into 3 blocks, performed three times of FCN full convolution layer, and turned into a feature map of the same size through deconvolution):

 Each category has its own corresponding network. This is the structure of one of the categories (such as cars), and the output is the result of classification (channel is 2, and the 2 anchors corresponding to RPN, 90 degrees and 0 degrees are used for cars. Two anchors) and the regression result of the anchor (because of the 3D anchor, 2*7, 7 represents the offset in the six directions of x, y, z, w, h, l and an angular offset (relative to 0 degrees Or 90 degrees)).

People, cars, and riders use different anchor strategies and IoU strategies to distinguish positive and negative samples. Data augmentation was done before input.

 There are two forms on Kitti (the kitti training set is divided into 3712 and 3769 as the training set and test set respectively) as above, only Lidar is used.

Project the 3D bbox into the image:

The running speed is not mentioned in the article because it takes up a lot of memory when calculating the three-dimensional convolution, and only 2FPS is far from being able to achieve real-time processing.

2.3SECOND

SECOND(Sparsely Embedded CONvolutional Detection

This article mainly has three major improvements to voxelnet:

1. Added sparse convolution, which greatly improves the efficiency;

2. The problem of VoxelNet's loss function setting for orientation has been corrected (for the opposite orientation, VoxelNet has given too large a penalty, which is problematic);

3. A new data augmentation method is proposed.

Since the 3D convolution of VoxelNet is time-consuming, and a large number of spatial positions are empty (even more than 90%), this greatly slows down the efficiency. SECOND proposes to use sparse convolution to understand sparse convolution:

An easy-to-understand explanation of the Sparse Convolution process - 知乎

Rule sparse convolution algorithm (traditionally requires GPU and CPU to exchange information, and the efficiency is slow)

 The middle feature extraction layer uses two sparse convolutional layers to obtain data similar to 2D images by compressing the z-axis:

 Loss change: use smoothL1, and substitute sin

 The advantage is that the problem of flipping (0 and pi) is solved, the IoU equation suitable for angle flipping (this project is all positive), and the positive and negative are determined by designing a Classifier (the angle between yaw and z-axis GT, and greater than zero is positive and negative is negative). The commonly used Focal Loss is used in classification problems.

In the data enhancement stage, they established a database, randomly selected some point clouds corresponding to GT for each training, and added them to the original point cloud, which was more abundant and could simulate detection in different environments. But it's possible that what was added was illogical, and they removed overlapping cells via collision detection.

Comparative Results:

 BEV results:

 It can be seen that the speed is much faster.

2.4PointPillars

At the end of this section, we will talk about a network

PointPlillars

Recommended reading:

PointPillars paper analysis and OpenPCDet code analysis_NNNNNathan's blog-CSDN blog_pointpillars code

Also inspired by SECOND, it adopts a sparse structure, which comes from the industry and has become a very mainstream network structure. Sometimes it is used as a judgment of the effect of the point cloud dataset. It is a single stage method:

 PP stands for PointPillars in this article, F stands for F-Pointnet, A stands for AVOD, S stands for SECOND, and V stands for VoxelNet.

The comparison is the BEV performance on Kitti.

Unlike SECOND, PointPillars uses 2D convolution of pseudo-images.

 The biggest highlight of this article is the proposal to use Pillars, that is, pillars.

For lidar point cloud data, the most basic four dimensions are (x, y, z, r) where r is intensity, we divide the space into pillars, calculate the center point in each pillar, and add five A dimension offset representation (this is similar to VoxelNet), the center of the center point x, y, z, and the offset in the x and y directions, plus the original four dimensions, a total of nine dimensions (some code implementations are also The offset of z is calculated for a total of ten dimensions)

In this way, we change N*3 (N point cloud structure) into (D, P, N), D represents the dimension of point cloud features (9 here), P represents a non-empty cube column (index), N Indicates how many points are in a column. In the application, use the simplified PointNet network to extract data features, get a (C, P, N) tensor, then use maxpooling to operate the output features (C, P), and finally reassemble each index corresponding to features are put back into the original PIlllars to get a fake image.

In the 2D BackBone, all the pseudo images corresponding to Pillars that are not 0 are extracted, but the result is still very sparse, and feature extraction and fusion are realized through convolution kernel splicing.

For the detection head, referring to the detection of SSD, the anchor is used, which is divided into three scales (corresponding to cars, pedestrians, and bicycles). Each frame has two angles of 0 degrees and 90 degrees, and the positive and negative IoU settings of Amben Also different. (Some directly use a network when implementing)

LossFun uses a method similar to the previous one, and uses the same sin method as SECOND. The positioning error uses smoothL1, the classification task uses focal loss, and the prior frame classification:

 result:

 The frame rate is greatly improved.

There are many networks in this part, and this part is here for the time being, and there will be developments later: for example, PV-RCNN in 2020, etc., I won’t go into details here, and I may add it when I have the opportunity.

The next section describes other task methods. Welcome everyone to communicate!

Guess you like

Origin blog.csdn.net/m0_46611008/article/details/125716667