One article to understand the PointNet family bucket - a strong point cloud processing neural network

Author: Li Guopu, signed author of 3D vision developer community, CSDN blog expert, Huawei cloud-cloud sharing expert
First release: public account [3D vision developer community]

foreword

PointNet is a model proposed by Charles R. Qi et al. of Stanford University in the article "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation". It can directly process point clouds. A point, learn its corresponding spatial encoding, and then use the features of all points to obtain a global point cloud feature. The global features extracted by Pointnet can complete the classification task well, but the local feature extraction ability is poor, which makes it difficult to analyze complex scenes.

PointNet++ is an improved version based on the PointNet paper by the Charles R. Qi team. Its core is to propose a multi-level feature extraction structure to effectively extract local feature extraction and global features.

F-PointNet extends the application of PointNet to 3D target detection, and can use PointNet or PointNet++ for point cloud processing. Before point cloud processing, it uses image information to obtain some prior search ranges, which can improve efficiency and increase accuracy.

1. PointNet

PointNet: Deep Learning on Point Sets for 3D Classificationand Segmentation

Paper address: https://arxiv.org/abs/1612.00593

Open source code - original paper implementation: https://github.com/charlesq34/pointnet

Open source code-Pytorch implementation: https://github.com/fxia22/pointnet.pytorch 1. PointNet
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Paper address: https://arxiv.org/abs/1612.00593

Open source code - original paper implementation: https://github.com/charlesq34/pointnet

Open source code - Pytorch implementation: https://github.com/fxia22/pointnet.pytorch

insert image description here

1.1 PointNet thought process

1) The input is a collection of all point cloud data of one frame, expressed as a nx3 2d tensor, where n represents the number of point clouds, and 3 corresponds to the xyz coordinates.

2) The input data is first aligned by multiplying it with a transformation matrix learned by T-Net, which ensures the invariance of the model to specific space transformations.

3) After extracting the features of each point cloud data through multiple mlp, then use a T-Net to align the features.

4) Perform a maxpooling operation on each dimension of the feature to obtain the final global feature.

5) For the classification task, the global feature is used to predict the final classification score through mlp; for the segmentation task, the global feature is concatenated with the local features of each point cloud learned before, and then the classification result of each data point is obtained through mlp .

1.2 PointNet network structure

The "global features" it extracts can perform classification tasks well. Let's take a look at the framework structure of PointNet:
insert image description here
The following explains the role of each component in a network.

1)transform:

The first time, T-Net 3x3, aligns the input point cloud: the pose is changed to make the changed pose more suitable for classification/segmentation. The second time, T-Net 64x64, aligns 64-dimensional features.

**2) mlp: **Multi-layer perceptron, used to extract the features of the point cloud, here uses the convolution of shared weights.

**3) max pooling: ** Summarize the information of all point clouds, perform maximum pooling, and obtain the global information of the point cloud.

**4) Segmentation part: **local and global information combination structure (concate, semantic segmentation)

**5) Classification loss: **cross entropy, segmentation loss: classification + segmentation + L2 (transform, orthogonal transformation of the original image)

1.3 T-Net network structure

The input point cloud data is used as an nx3x1 single-channel image, after three convolutions and one pooling, reshape to 1024 nodes, and then two layers of full connections. The network uses ReLU activation functions and batches except for the last layer. standardization.

1.4 Model effect

Classification results on ModelNet40:

insert image description here
Segmentation results on some ShapeNet data sets:
insert image description here
Insufficient: lack of ability to extract local information at different scales

2. PointNet++

PointNet++: Deep Hierarchical Feature Learning on Point Setsin a Metric Space

Paper address: https://arxiv.org/abs/1706.02413

Open source code address: https://github.com/charlesq34/pointnet2

insert image description here
The global features extracted by Pointnet can complete the classification task very well. Since the model is basically single-point sampling, the bottom layer of the code uses 2Dconv, and only maxpooling integrates the overall features, so the local feature extraction ability is poor, which makes it difficult. Analyze complex scenes.

The core of PointNet++ is to propose a multi-level feature extraction structure to effectively extract local features and global features.

2.1 Idea flow

First select some points in the input point set as the center point, and then select the surrounding points around each center point to form an area, and then each area is used as an input sample of PointNet to obtain a set of features, which are the features of this area.

insert image description here
After that, the central point remains unchanged, the area is expanded, and the features obtained in the previous step are sent to PointNet as input, and so on. This process is to continuously extract local features, then expand the local range, and finally obtain a set of global features, and then perform Classification.

2.2 Overall network structure

PointNet++ extracts local features at different scales, and obtains deep features through a multi-layer network structure. PointNet++ is also divided into classification (classification network) and segmentation (segmentation network) according to the task, and the input and output are consistent with the two networks in PointNet.

insert image description here
PointNet++ will first sample (sampling) and divide (grouping) the point cloud, and use the basic PointNet network to perform feature extraction (MSG, MRG) in each small area, and iterate continuously.

For classification problems, PointNet is directly used to extract global features, and full connection is used to obtain the score of each category. For the segmentation problem, high-dimensional point inverse distance interpolation is used to obtain the same number of points as low-dimensional points, and then feature fusion is performed, and PointNet is used to extract features. Compare the difference between the two task networks of PointNet++: After obtaining the highest-level feature, the classification network uses a small PointNet + FCN network to extract the final classification score; "Information fusion, and finally get point-by-point classification semantic segmentation results. ("Skip connection" corresponds to the skip link connection in the above figure; the low-level feature map has a larger resolution and retains richer information, although the overall semantic information is weaker.)

2.3 Network Structure Components

**1) Sampling layer (sampling)** There can be as many as 100k data points in a single frame of lidar. If local features are extracted for each point, the amount of calculation is very huge. Therefore, the authors propose to sample the data points first. The sampling algorithm used by the author is farthest point sampling (FPS). Compared with random sampling, this sampling algorithm can better cover the entire sampling space.

**2) Combination layer (grouping)** In order to extract the local features of a point, it is first necessary to define what the "local" of this point is. The locality of a picture pixel is the pixels around it at a certain Manhattan distance, which is usually determined by the size of the convolution kernel of the convolution layer. Similarly, the part of a point in the point cloud data is composed of other points in the spherical space drawn by a given radius around it. The role of the combination layer is to find out all the points that make up each point after passing through the sampling layer, so as to facilitate subsequent feature extraction for each part.

**3) Feature extraction layer (feature learning)**Because PointNet provides a feature extraction network based on point cloud data, PointNet can be used to perform feature extraction on each part given by the combination layer to obtain local features. It is worth noting that although each part given by the combination layer may be composed of different numbers of points, features with consistent dimensions can be obtained after passing through PointNet (determined by the above K value).

2.4 Uneven point cloud combination grouping method

Unlike image data distributed on a regular pixel grid with uniform data density, the distribution of point cloud data in space is irregular and non-uniform. When the point cloud is uneven, if the same spherical radius is used in each sub-region, the sampling points in some sparse regions will be too small. The author proposes two solutions: multi-scale grouping (MSG) and multi-resolution grouping (MRG).

1) Multi-scale combination MSG: Set multiple radii for a selected center point to group, and concatenate (concat) the features extracted from each area through PointNet as the feature of the center point. This approach will A lot of features overlap, and the result will be able to retain and highlight (marginal superposition) more local key features, but it is difficult to share the weights calculated in different ranges in this way, and the calculation amount will become much larger.

insert image description here
2) Multi-resolution combined MRG: MRG avoids a large number of calculations, but still retains the ability to adaptively aggregate information according to the distribution characteristics of points. Concat the features extracted on different feature layers (resolutions). Taking picture b as an example, the final concat contains two parts of features on the left and right, which are extracted from the bottom layer and high layer respectively. For low level point clouds, they are grouped and passed through A pointnet and high level are concat, the idea is to skip the layer connection in the feature extraction.

When the local point cloud area is sparse, the reliability of the features extracted by the upper layer may be worse than that of the bottom layer, so consider increasing the weight of the bottom layer features. Of course, more features can be extracted when the point cloud density is higher. This method optimizes the problem of feature extraction directly on the sparse point cloud, and is also more efficient than MSG.

Which to choose: When the density of a local region is low, the first vector may not be as reliable as the second vector, because the subregion for which the first vector is computed contains sparser points and suffers more from undersampling. In this case, the second vector should be weighted higher.

On the other hand, when the density of the local area is high, the first vector provides information of finer details because of its ability to recursively express higher resolution inspection at a lower level.

2.5 Model effect

Category comparison:
insert image description here
Segmentation comparison:
insert image description here

summary

Complex scene point clouds are generally processed by PointNet++, while simple scene point clouds are processed by PointNet. If only point cloud classification and segmentation are analyzed from the perspective of two tasks, the classification task only needs the feature information after the max pooling operation, while the segmentation task requires more detailed local context information.

3. F-PointNet

FrustumPointNets for 3D Object Detection from RGB-D Data

Paper address: https://arxiv.org/pdf/1711.08488.pdf

Open source code: https://github.com/charlesq34/frustum-pointnets

insert image description here
F-PointNet is also a solution to directly process point cloud data, but this method faces challenges, such as: how to effectively locate the possible position of the target in 3D space, that is, how to generate 3D candidate boxes, if the global search will consume a lot of calculations force and time.

F-PointNet uses image information to obtain some prior search ranges before point cloud processing, which can improve efficiency and increase accuracy.

3.1 Basic idea

We first use a 2D detector operating on an RGB image, where each 2D bounding box defines a 3D cone region. Then based on the 3D point clouds in these frustum regions, we implement 3D instance segmentation and modeless 3D bounding box estimation using PointNet/PointNet++ networks. To sum up the ideas, as follows:

  1. Image-based 2D object detection.

  2. Generates cone regions from an image.

  3. In-cone, point cloud instance segmentation using PointNet/PointNet++ networks.

It uses image information to obtain some prior search range before point cloud processing, which can not only improve efficiency, but also increase accuracy. Take a look at the picture below:

In this picture, the upper left corner means to calibrate the image and point cloud information first (this belongs to the external parameter calibration of the sensor, which is performed before perception; by obtaining the rotation matrix and translation vector between the two sensors, you can get the mutual location relationship).

insert image description here
The lower left corner is the bounding box (BoundingBox) of the object detected by the target detection algorithm. After the bounding box is obtained, take the camera as the origin and extend along the direction of the bounding box to form a cone (the right half of the picture above). The word frustum in the title of the thesis means cone.

Then when using the point cloud to identify the object, it only needs to be identified in this cone, which greatly reduces the search range.

3.2 Model framework

The model structure is as follows: (click on the picture to enlarge it)

insert image description here
The network is divided into three parts. The first part is to use the image to detect the target and generate the cone area. The second part is to segment the point cloud instance in the cone. The third part is the regression of the point cloud object bounding box.

3.3 Generate cone area based on image

insert image description here
Since the detected target is not necessarily in the exact center of the image, the axis of the generated cone does not necessarily coincide with the coordinate axis of the camera, as shown in (a) in the figure below. In order to make the network more rotation invariant, we need to do a rotation so that the Z axis of the camera coincides with the axis of the cone. It is shown in (b) in the figure below.

insert image description here

3.4 Point cloud instance segmentation within the cone

Instance segmentation uses PointNet. Only one object is extracted in a cone, because this cone is generated by the bounding box in the image, and there is only one complete object in a bounding box.

The rotation invariance was mentioned when generating the cone. After the segmentation step is completed here, the translation invariance needs to be considered, because after the point cloud segmentation, the origin of the segmented object and the origin of the camera must not coincide, and we deal with The object is a point cloud, so the origin should be translated into the object, as shown in (c) in the figure below.
insert image description here

3.5 Generating accurate bounding boxes

Network structure for generating accurate bounding boxes:
insert image description here
From this structure, it can be seen that before generating bounding boxes, a T-Net is required. The function of this thing is to generate a translation. The reason for doing this step is because the above The center of the object obtained in one step is not completely accurate, so in order to estimate the bounding box more accurately, the center of mass of the object is further adjusted here, as shown in (d) in the figure below.

insert image description here
The following is the bounding box regression. For a bounding box, there are seven parameters in total, including:
center point: cx , cy , cz c_{x}, c_{y}, c_{z}cxcycz
Length, width and height: l, w, hl, w, hl w h
input:θ \thetai

The final total residual is the sum of the above target detection, T-Net and bounding box residuals, and the loss function can be constructed accordingly.
C pred = C mask + △ C t − net + △ C box − net C_{pred}=C_{mask}+\triangle C_{t-net}+\triangle C_{box-net}Cp r e d=Cmask+Ctnet+Cboxnet

3.6 Key points of PointNet

3.6.1 F-PointNet uses 2D RGB images
F-PointNet uses 2D RGB images for the following reasons:

1. At that time, the 3D target detection based on pure 3D point cloud data was not effective for small target detection. Therefore, F-PointNet first performs 2D target detection based on 2D RGB to locate the target, and then uses the corresponding point cloud data cone to perform bbox regression based on the 2D target detection results to achieve 3D target detection.

2. Using pure 3D point cloud data, the amount of calculation will be particularly large, and efficiency is also one of the advantages of this method.

Use a mature 2D CNN target detector (Mask RCNN) to generate a 2D detection frame and output a one-hot classification vector (that is, classification based on a 2D RGB image).

3.6.2 Cone frame generation

The 2D detection frame combines the depth information to find the nearest and farthest planes containing the detection frame to define the frustum proposal of the 3D frustum area. Then collect all 3D points in the frustum proposal to form a frustum point cloud.

3.7 Experimental results

Compared with other models:
insert image description here
insert image description here
model effect:
insert image description here

3.8 Advantages

(1) Abandoning the global fusion improves the detection efficiency; and realizes the precise positioning of the 3D proposal by dimension (2D-3D) through the 2D detector and 3D Instance Segmentation PointNet, which greatly shortens the search time for the point cloud. The figure below shows that the search range is reduced from 9m 55m to 12m 16m through 3d instance segmentation.

insert image description here
(2) Compared with 3D detection in BEV (Bird's Eye view), F-PointNet directly processes raw point cloud without any dimensional information loss. Using PointNet can learn more comprehensive spatial geometric information, especially in small objects performed well in detection. The picture below is from Hao Su's course at the beginning of 2018, and the current KITTI ranking has slightly changed.

(3) Using the mature 2D detector to classify proposals (one-hot class vector, labeling) has played a certain guiding role and can greatly reduce the difficulty of learning PointNet for three-dimensional space objects.

3.9 Model Code

Open source code: GitHub - charlesq34/frustum-pointnets: Frustum PointNets for 3D Object Detection from RGB-D Data
Operating environment of the author's code:
System: Ubuntu 14.04 or Ubuntu 16.04
Depth framework: TensorFlow1.2 (GPU version) or TensorFlow1.4 ( GPU version)
other dependent libraries: cv2, mayavi, etc.

Copyright statement: This article is authorized by the special author of Obi Zhongguang 3D Vision Developer Community to publish it originally. Without authorization, it may not be reproduced. This article is only for academic sharing. The copyright belongs to the original author. If any infringement is involved, please contact to delete the article.

The 3D vision developer community is a sharing and communication platform created by Obi Zhongguang for all developers, aiming to open 3D vision technology to developers. The platform provides developers with free courses in the field of 3D vision, Obi Zhongguang exclusive resources and professional technical support. Click to join the 3D vision developer community , discuss and share with developers~

Or you can follow the official public account 3D vision developer community on WeChat to get more knowledge about dry goods.

Guess you like

Origin blog.csdn.net/limingmin2020/article/details/123630112