3D point cloud semantic segmentation—PointNet++

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Charles R. Qi Li Yi Hao Su Leonidas J. Guibas
Stanford University
       The previous article introduced PointNet and the corresponding pytorch code implementation to complete the semantic segmentation task on the S3DIS data set. This method of directly processing disordered point clouds is comparable to the traditional solution based on manually defined features or voxels. It has indeed made great progress, but we can also see from its network structure that after using the multi-layer perceptron for feature mapping, maximum pooling is used to extract global features. Features after two MLPs, but the obtained features are still single-scale. PointNet++ can extract local features at different scales .
        
Look directly at the overall network structure:

Hierarchical point set feature learning (hierarchical feature learning): It is stacked by a series of set abstraction layers (point set abstraction layers). Each abstraction layer includes three parts: 1) Sampling - Farthest Point Sampling (FPS) Selecting the center of the cluster means that when selecting the center of the cluster, the current point to be selected is the farthest point from all previously selected points. This ensures that the sampling covers all points as much as possible. This method is not like CNN volume. The product is unchanging, but depends on the data itself; 2) Combination - combine points in the neighborhood according to the sampled center to form a cluster. The author proposed two solutions, one is KNN, which combines the fixed K points closest to the center. Point, one is Ball query, which is spherical retrieval. It selects points in the spherical neighborhood of a given radius. K is not fixed, but an upper limit is set, because Ball query can better guarantee the scale of feature extraction in each layer of set abstraction. , so Ball query was chosen; 3) PointNet extracts features, taking each cluster as a unit, local coordinates the xyz coordinates in the cluster and inputs them to PointNet, and the extracted features are continuously spliced ​​to the back of the coordinate dimensions.

For a set abstraction layer, the input is (N,d+C), that is, N points, each point contains d-dimensional coordinates and C-dimensional features; after sampling and combining the farthest points, N1 center points are obtained, and each cluster ( Neighborhood) contains K points; for each cluster, pointnet is used for feature extraction, and the K points are aggregated into one feature, that is, one cluster corresponds to one feature, N1 clusters get N1 features, and the new ones obtained in the process Features are continuously spliced ​​to the d-dimensional coordinate information and previously obtained features, so the output is (N1, d+C1). The obtained output is then used as the input of the next set abstraction.

For the classification network, the finally obtained features can be directly flattened and input into the fully connected layer for classification.

For the segmentation network , point-by-point classification is required, so there is an upsampling process. The author calls this part FP (Feature Propagation) - the feature propagation from the downsampled points to the original points, mainly through interpolation and skip connections. , the interpolation here uses the weighted average of the reciprocal distance of the k nearest neighbor points, which is the following formula, p and k are empirical parameters, set to 2 and 3 respectively in the experiment, d() is the distance function.

Skip connection is to splice the output features obtained by the previous set abstraction layers with the interpolated point features. In fact, according to my understanding, it is a bit inappropriate to call this place interpolation. In fact, during the upsampling process, the xyz position information of the point is directly that of the unsampled layer, and then for each point of the unsampled layer, after sampling Find the k nearest neighbor points (k=3) among the points, and then use the characteristics of these k points to perform a weighted sum according to distance as the characteristics of this point in the unsampled layer, so the characteristics of the point after upsampling (interpolation) The dimension is the same as the feature dimension before upsampling. For example, in the orange upsampling layer in the network, the feature dimension before upsampling is d+C2, and the interpolated feature is still d+C2, and then spliced ​​together. The previous C1 finally gets d+C1+C2.

The obtained features are then passed through unit pointnet (similar to the 1*1 convolution in CNN, which only changes the feature dimension and does not change the number of points, which is actually mlp). This process is repeated until the features are propagated to the original point set.

PointNet++: Robust to non-uniform sampling density

       At the beginning, we said that PointNet++ can extract local features at different scales . This feature is particularly important for point clouds, because point clouds will inevitably be close to the sensor during the actual collection process. The density of various parts of the point cloud is inconsistent. Specifically, the following two different scale feature fusion schemes are proposed - MSG and MRG, that is, setting different Ball query radii to splice the extracted features, or splicing features obtained from different set abstraction layers. The author combines this feature with The density-adaptive structure is called Pointnet++.

This is the author's comparison of the performance of several different structures on classification tasks for different data densities:

Among them, DP refers to the random dropout of the input, and the word "vanilla" is quite interesting. I have heard something similar to "vanilla convolution" before. Generally speaking, if the word vanilla is used as an adjective, it can mean "common "Original; traditional; general" and other meanings, but its original meaning is "n. Vanilla; adj. Vanilla flavor". Since the most common and original flavor of ice cream is vanilla ice cream, this meaning is derived. There is actually a bit of language and culture here.

Comparing MSG and MRG, the former performs slightly better, but because it needs to select neighborhoods with different radii and then input pointnet, the calculation amount is very large; while the latter only needs to splice the results of different layers in the original network. So the amount of calculation is much smaller.

Semantic segmentation experiment:

       In order to compare with PointNet, the same S3DIS data set is still used for semantic segmentation. For the data set introduction and data processing part, you can refer to the previous article , and the code is also in the same warehouse .

Also read the code in the order of model files, data processing files, training files, and test files. Since the data processing, training, and testing files are basically the same as PointNet, we will not go into details here. We mainly look at the model files.

Model files: models\pointnet2_utils.py, models\pointnet2_sem_seg_msg.py

The main components of the network, PointNetSetAbstractionMsg and PointNetFeaturePropagation, and some functions needed to build these two components are defined in pointnet2pointnet2_utils.py :

farthest_point_sample: Farthest point sampling: input point coordinates and the number of points to be sampled, and return the index of the sampling point in the input data. For each batch, a point is randomly initialized as the first point to be sampled, and then a table is established to calculate the distance of all points from this point (including its distance from itself), and the table that stores distance information is continuously updated and maintained. Table, and then to determine whether any point is the farthest point (whether to sample or not), just compare the minimum value of the distance between it and the previous sampled point. The point with the largest minimum value is the farthest point.

For example, to determine which point 1 and 2 is the farthest point, the distance value compared is the distance between each of them and the closest point among the previously sampled points.

 index_points input point data and index, and retrieve the data corresponding to the index. (The farthest point sampling and ball neighborhood search return indexes, so this function is needed to convert the index into data).

query_ball_point ball neighborhood search : input the radius, the number of points in each neighborhood, the coordinate data of all points, the coordinate data of the ball center sampled from the farthest point, and return the index of the points combined into clusters. Calculate the distance from each point to the center point. Points whose distance is greater than a given radius are uniformly assigned a large index (such as N-1), and then sorted. The indexes of points outside these radii will be ranked at the back, and then at the front. Select K points (i.e., the K points closest to the center point). If the index appears N-1 (i.e., outside the radius), change these points to the point closest to the center point (i.e., there are insufficient points within the radius). K is replaced by the nearest point).

PointNetSetAbstractionMsg: The input parameters are: the number of sampling center points, sampling radius, the number of points in each cluster, input feature dimensions, and the number of neurons in each layer of mlp; output the sampled points and features. It contains three processes internally: 1) FPS farthest point sampling to obtain the sampling center point; 2) ball query ball neighborhood search, searching for the nearest K points in the center point neighborhood with a given radius (if there are less than K, copy the nearest That point complements); 3) PointNet: Each cluster is input to PointNet, and after mlp and maxpooling, the features of K points of each cluster are finally aggregated into one feature.

PointNetFeaturePropagation feature propagation: The input parameters are: input feature dimension, number of neurons in each layer of mlp (unit pointnet). Output the points and features after upsampling (feature propagation). It contains three processes internally: 1) Interpolation: The position information of the midpoint of the layer after upsampling is the information of the layer to be upsampled that has not been downsampled in the SA module. For each point, look for its location after sampling. The nearest k points in the layer are weighted to find the features (the features here contain the three-dimensional information of xyz); 2) skip connection, the calculated features are spliced ​​with the original features of the undownsampled layer; 3) unit pointnet : That is mlp.

models\pointnet2_sem_seg_msg.py defines the model and loss function to

torch.rand(6, 9, 2048) #B C N

As an input example, illustrate the model structure and the input and output of each layer:

Model training

The warehouse provides pre-training models that do not use msg (i.e. ssg) for training, which can be used directly for testing, but the results are not ideal. Since the paper said that msg and mrg will have better effects, why not use msg with The model is trained by itself

python train_semseg.py --model pointnet2_sem_seg_msg --test_area 5 --log_dir pointnet2_sem_seg

The model is relatively large. Setting the batch_size to 32 during the training process requires about 11G of video memory, and one epoch takes two hours on Titan X.

Model testing

python test_semseg.py --log_dir pointnet2_sem_seg_msg --test_area 5 --visual

After training myself, I found that the results were still not very satisfactory (it stands to reason that the mIoU should be at least 0.5 or above). When I checked the logs of the warehouse author, I found that even with the ssg solution, the tested mIoU exceeded 50%, and I only discovered it during the test. My area 5 only contains 67 rooms, but in fact area 5 in the data set should contain 68 rooms (the original eval.txt log in the warehouse also shows 68 rooms), and the cause of the problem was not found. The following is the less than ideal final test result, which is almost the same as PointNet.

Pick a few rooms at random to visualize:

                   Raw                                        Ground Truth                                Prediction 

        

Guess you like

Origin blog.csdn.net/weixin_42371376/article/details/118291592