PointNeXt: Revisiting PointNet++ with improved training and model scaling strategies

Original link:https://www.techbeat.net/article-info?id=3669
Author: Gordon
Insert image description here

:::

Paper link:
https://arxiv.org/abs/2206.04670
Code link (open source):
https://github.com/guochengqian/pointnext

Summary

PointNet++ is one of the most influential neural network models in the field of point cloud understanding. Although PointNet++ has been surpassed in performance by some of the latest methods such as PointMLP and Point Transformer, we found that theperformance improvements of these methods largely come from better training strategies > (data augmentation and optimization methods), and innovations in larger models rather than model architecture.

In this work, we re-explore PointNet++ through a systematic study of training strategies and model scaling strategies. Our main contributions are as follows:

  1. Weproposed a set of more effective training strategies, which significantly improved the performance of PointNet++ on various data sets. For example, without changing the network structure, the accuracy (OA) of PointNet++ on ScanObjectNN can be increased from 77.9% to 86.1%, even surpassing the current best model PointMLP;

  2. We introduced inverted residual bottleneck and separable MLPs into PointNet++ and proposed a variant structurePointNeXt to achieve more efficient Model scaling. PointNeXt surpasses SOTA in both point cloud classification and segmentation tasks. On the ScanObjectNN classification data set, PointNeXt achieved 87.8% accuracy (OA), 2.3 points higher than the SOTA model PointMLP, and was 10 times faster in inference speed. On the S3DIS semantic classification data set, PointNeXt reached 74.3% mIOU (6-fold), surpassing the SOTA model Point Transformer (73.5% mIoU).

Introduction

Most work in the field of 3D point clouds focuses on developing elegant modules to extract local details of point clouds, such as pseudo-grid convolution in KPConv [3] and self-attention layers in Point Transformer [4]. These newly proposed methods are far superior to the classic point cloud understanding network PointNet++ in various tasks, giving people the illusion that the PointNet++ network is too simple and cannot learn complex point cloud representations. In this work, we found that the reason that affects the performance of PointNet++ is not its network module, but its outdated training and model scaling strategy.

First, we find that most of the performance gains of the SOTA method originate from the improved training strategy (i.e., data augmentation and optimization techniques). For example, randomly discarding color information during the training process can improve the performance on S3DIS by 5 points of mIoU. Unfortunately, compared to the improvement of the neural network structure, the progress of the training strategy Rarely mentioned and studied publicly.

Secondly, another major performance gain of the SOTA method comes from the increase in model size. However, we found that simply increasing the number of convolutions and channel size of PointNet++ cannot improve the accuracy of the model. Therefore, model scaling strategy, that is, how to effectively expand the depth (using more convolutional layers) and breadth (using larger channel size) of the model, is a topic worth studying.

Based on the above two points, this article has made the following contributions:

  • We conducted the first systematic study of training strategies in the point cloud field and showed thatPointNet++'s performance can be greatly improved simply by adopting improved training strategies. For example, the OA on the ScanObjectNN object classification task can be increased by 8.2 points, and the mIoU on the S3DIS semantic segmentation can be increased by 13.6 points. The improved training strategy is general and can be easily applied to improve the performance of other networks (such as PointNet [1], DGCNN [9], and PointMLP [17]).

  • We propose PointNeXt, the next version of PointNets. Compared with PointNet++, PointNeXt can achieve greater accuracy improvement after expanding the model scale. PointNeXt surpassed SOTA on all studied tasks, including 3D object classificationScanObjectNN OA 87.8% (PointMLP 85.7 %), semantic segmentationS3DIS 6-fold mIoU 74.9% (Point Transformer 73.5%), object part segmentationShapeNetPart mIoU 87.2% (CurveNet 86.8%), and is faster than SOTA in inference speed.

Background knowledge: PointNet++

Our PointNeXt is built on PointNet++. PointNet++ uses U-Net structure, including encoder (Encoder) and decoder (Decoder). The Encoder uses a series of Set Abstraction (SA) modules to hierarchically abstract the features of the point cloud, while the Decoder uses the same number of feature propagations to upsample the features. Among them, the SA module consists of a downsampling layer (Subsampling), a neighbor query layer (Grouping), a set of shared multi-layer perceptrons (MLPs) for feature extraction, and a Reduction layer for aggregating features within neighbors. Among them, the combination of Grouping, MLPs, and Rduction layers can be expressed as:

x i l + 1 = R j : ( i , j ) ∈ N { h Θ ( [ x j l ; p j l − p i l ] ) } , ( 1 ) \mathbf{x}_i^{l+1} =\mathcal{R}_{j:(i, j)\in \mathcal{N}}\left\{h_\mathbf\Theta\left([\mathbf{x}_j^l; \mathbf{p}_j^l - \mathbf{p}_i^l]\right)\right\}, \qquad (1) xil+1=Rj:(i,j)N{ hΘ([xjl;pjlpil])},(1)

Naka R \mathcal{R} R This is the Reduction level (eg max-pooling), the reduction point i i i's residence ( { j : ( i , j ) ∈ N } \{j:(i, j)\in \ mathcal{N}\} { j:(i,j)N} ) Special expedition to China. p i l \mathbf{p}_i^l pil , x and l \mathbf{x}_i^l xil , x j l \mathbf{x}_j^l xjl Partition l t h l^{th} lth layer input coordinates, input features, and points i i i 目次 j j j Special expedition in your local area. h Θ h_\mathbf\Theta hΘ represents shared MLPs, and its input is x j l \mathbf{x}_j^l xjl(pjlpil) Cascade at feature latitude.

Methodology: From PointNet++ to PointNeXt

In this section, we demonstrate how to improve the performance of PointNet++ through more advanced training strategies and model scaling strategies. We introduce them in two sections: (1) Training strategy modernization; (2) Network architecture modernization.

1.png

Figure 1 PointNeXt network structure.

:::


PointNeXt has the same Set Abstraction and Feature Propagation modules as PointNet++. The red box mark is PointNeXt's improvement based on PointNet++, including adding an extra layer of MLP at the model input, the Inverted Residual MLP (InvResMLP) module for scaling the model architecture, and the decoder part using a channel size that is symmetrical with the encoder.

Modernizing training strategies

In this chapter, we systematically and quantitatively study each data augmentation and optimization strategy. In this chapter, we briefly describe our research methods. The specific training strategy can be found in the subsequent ablation experiment chapter.

data augmentation

Data augmentation is one of the most important methods to improve the performance of neural networks, and PointNet++ uses a simple combination of data augmentation such as random rotation, scaling, translation, and jitter and applies it to different data sets. Some of the latest methods use stronger data augmentation methods. For example, KPConv[3] randomly drops part of the color information during training. In this work, we collect common data augmentation methods used in recent methods and quantitatively study the effect of each data augmentation method on each dataset through overlay experiments. For each data set, we propose a set of improved data augmentation methods, which can significantly improve the performance of PointNet++.

Optimization Strategy

Optimization technology mainly includes loss function, optimizer, learning rate schedulers, and hyperparameters. With the development of machine learning theory, modern neural networks can be trained with theoretically better optimizers (such as AdamW) and better loss functions (CrossEntropy with label smoothing). Cosine learning rate decay has also been widely used in recent years because compared to step decay, its parameter adjustment is simpler and the effect is not bad. In this work, we quantify the impact of each optimization strategy on PointNet++ through stacking experiments. Likewise, for each dataset, we propose a set of improved optimization techniques that can further improve network performance.

Model architecture modernization: small modifications → big improvements

receptive field scaling

The receptive field is an important design of a neural network. There are at least two ways to increase the receptive field.

  • Use a larger radius to query neighbors (radius of ball query),

  • Adopt a hierarchical structure

Since PointNet++ has adopted a hierarchical structure, here we mainly study the impact of different query radii on performance. We found that the initial radius has a great impact on the results, and the optimal query radius is different on different data sets. Furthermore, we find that the relative coordinates Δ p = p j l − p i l \Delta_p = \mathbf{p}_j^l - \mathbf{p}_i^l Dp=pjlpil makes network optimization more difficult, causing performance degradation. Therefore, we propose to use relative coordinates to query the radius to achieve Δ p \Delta_p Dpof normalization. The improved formula (1) is as follows:

x i l + 1 = R j : ( i , j ) ∈ N { h Θ ( [ x j l ; ( p j l − p i l ) / r l ] ) } . ( 2 ) \mathbf{x}_i^{l+1} ={\mathcal{R}}_{j:(i, j) \in \mathcal{N}}\left\{h_\mathbf\Theta\left([\mathbf{x}_j^l; (\mathbf{p}_j^l - \mathbf{p}_i^l) / {r^l]}\right)\right\}. \qquad (2) xil+1=Rj:(i,j)N{ hΘ([xjl;(pjlpil)/rl])}.(2)

If there is no normalization, the relative coordinates Δ p \Delta_p DpThe value of will be very small (less than the radius). This requires the network to learn larger weights for Δ p \Delta_p Dp. This makes optimization difficult, especially given that the regularization method of weight decay limits the size of the network weights.

Model scaling

PointNet++ is a relatively small network, and the model size used by PointNet++ for classification and segmentation is less than 2M. The current network parameters are generally above 10M [3, 4]. Interestingly, we found that neither using more SA modules nor using a larger channel size significantly improves accuracy, but instead causes a significant decrease in thoughput < a i=2>. This is mainly caused by vanishing gradients and overfitting. Therefore, in this subsection, we study how to extend PointNet++ in an effective and efficient manner.

We proposeInverted Residual MLP (InvResMLP) module to achieve efficient and practical model scaling. This module is built on the SA module, as shown in the middle of Figure 1. There are three differences between InvResMLP and SA modules:

  • To alleviate the vanishing gradient problem [21] (especially when the network is deeper), we add residual connections between the input and output of the module

  • In order to reduce the amount of calculation, we introduceseparable MLP and enhance point-wise feature extraction. The 3-layer MLPs in SA are all calculated based on neighborhood features. InvResMLP divides MLPs into 1 layer that acts on neighborhood features (between the Grouping layer and the Reduction layer), and the remaining two layers act on point features (after the Reduction layer)

  • Introductioninverted bottleneck design [23] expands the output channel of the second MLP by 4 times to improve the feature extraction capability

Based on PointNet++, combined with InvResMLP and the macro-architectural changes shown in Figure 1, we proposed PointNeXt. We denote the channel size of stem MLP as C and the number of InvResMLP modules as B. By changing the values ​​of C and B, PointNeXt can be scaled at the breadth and depth levels.

When B = 0, only one SA module is used per stage and no InvResMLP module is used. The SA module has 2 MLP layers, and a residual connection (Residual Connection) is added inside each SA module.

When B ≠ 0, the InvResMLP module is appended after the original SA module. In this case, the SA module has only 1 MLP to reduce computing resources. The configuration of our PointNeXt series is summarized below:

  • PointNeXt-S: C = 32, B = 0

  • PointNeXt-B: C = 32, B = (1, 2, 1, 1)

  • PointNeXt-L: C = 32, B = (2, 4, 2, 2)

  • PointNeXt-XL: C = 64, B = (3, 6, 3, 3)

experiment

PointNeXt surpasses existing SOTA methods on all studied datasets and performs well in both performance and efficiency. In S3DIS semantic segmentation, PointNeXt-XL surpassed Point Transformer [4] with mIoU/OA/mACC=74.9%/90.3%/83.0% and is faster in inference speed. In terms of ScanObjectNN classification, PointNeXt-S surpasses the current SOTA method PointMLP [17], and the inference speed is ten times faster. In the partial segmentation of ShapeNetPart, the widened model PointNeXt-S (C=160) reaches 87.2 Instance mIoU, surpassing SOTA CurNet. It is worth mentioning that PointNeXt is the first point-based point cloud recognition algorithm to cross the 87% threshold.

2.png

Table 1 Results of semantic segmentation on S3DIS dataset (6-fold cross-validation). The improvement of PointNeXt compared to PointNet++ is marked in green.

3.png

Table 2 Classification results on the ScanObjectNN data set (hardest variant).

4.png

Table 3 Segmentation results on the ShapeNetPart dataset.

:::


ablation experiment

Tables 4, 5 and 6 show the overlay experimental results for each training strategy (data enhancement, optimization method) and scaling strategy on the ScanObjectNN, S3DIS, and ShapeNetPart data sets respectively.

Through overlay experiments, we quantified the impact of each strategy on model performance improvement, demonstrated the step-by-step improvement process of model performance, and provided some inspiration for future research on training strategies and scaling strategies.

For data enhancement, we have the following conclusions:

  • Data scaling, such as using all point clouds as input in S3DIS and point cloud resampling in ScanObjectNN (randomly sampling 1024 points from 2048 points as input during the training phase) can steadily improve network performance.

  • Height appending, using point-by-point height as an additional input feature, can improve network performance, especially for object classification tasks.

  • Color dropping, randomly dropping color information, can greatly improve the classification results on S3DIS (+5.9 mIoU)

  • Larger models favor stronger data enhancement methods. For example, randomly rotating point clouds on S3DIS will lose the accuracy of small models (PointNet++, PointNeXt-S), but can improve the accuracy of large models (PointNeXt-{B, L, XL}). Accuracy

For the optimization method, we have the following conclusions:

  • AdamW + SmoothCrossEntropy is a stronger optimization combination

  • Cosine Learning Rate Decay is easier to adjust parameters than Step Decay and can also achieve SOTA performance.

For the scaling strategy, we have the following conclusions:

  • Each model improvement proposed (each component in InvResMLP, and the improvement of the macro network structure) improves the performance of the model;

  • The improved model is more scalable. Compared with naive scaling (increasing the depth and breadth of PointNet++), our model has significant performance and inference speed improvements.

6.png

Table 4 Overlay experiments applying training and scaling strategies sequentially on the ScanObjectNN classification dataset.

5.png

Table 5 Overlay experiments on sequentially applying training and scaling strategies on the S3DIS segmentation dataset. We use light green, purple, yellow, and pink background colors to represent data augmentation, optimization techniques, receptive field scaling, and model scaling respectively.

7.png

Table 6. Overlay experiment of sequentially applying training and scaling strategies on the ShapeNetPart segmentation dataset.

:::


In addition, as shown in Table 7, we also applied the training strategy to different neural networks and conducted experiments on the generalizability of the training strategy. Wediscovered the generalizability of improved training strategies. Our training strategy can be used on other algorithms, such as PointNet [1], DGCNN [9], and PointMLP [17], and significantly improve their performance on the ScanObjectNN classification task.

8.png

Table 7 Improves the generalizability of training strategies. Experiments were conducted with different methods on the ScanObjectNN classification dataset using the improved training strategy.

:::


We performed ablation experiments on model scaling using the best performing model, PointNeXt-XL, as a baseline. We studied the impact of each component of InvResMLP and stage ratio on performance on S3DIS Area 5. At the same time, we compared the performance of navide scaling.
9.png

Table 8 Ablation experiments on model structure changes in S3dis Area 5. - Represents removing the module from baseline TP stands for Throughput.

:::


Discussion and conclusion

In this paper, we illustrate that with better training and model scaling strategies, the classic PointNet++ model can surpass SOTA. We quantify the effectiveness of each data augmentation and optimization technique and propose a series of advanced training strategies. These strategies can be used not only on PointNeXt but also on other representative algorithms (such as PointNet, DGCNN, PointMLP), and significantly improve the performance of these algorithms. We further propose a scaling-friendly network architecture named PointNeXt based on PointNet++ fine-tuning. PointNeXt has demonstrated SOTA performance on multiple data sets, easy scalability, and maintained fast inference speed. We expect that the findings of this work will help researchers pay more attention to training and model scaling strategies and inspire more research in similar directions.

references

[1] Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652-660).

[2] Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30.

[3] Thomas, H., Qi, C. R., Deschaud, J. E., Marcotegui, B., Goulette, F., & Guibas, L. J. (2019). Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6411-6420).

[4] Zhao , H. , Jiang , L. , Jia , J. , Torr , P. H. , & Koltun, V. (2021). Point transform. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16259-16268).

[5] Uy, M. A., Pham, Q. H., Hua, B. S., Nguyen, T., & Yeung, S. K. (2019). Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1588-1597).

[6] Armeni, I., Sener, O., Zamir, A. R., Jiang, H., Brilakis, I., Fischer, M., & Savarese, S. (2016). 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1534-1543).

[7] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., … & Yu, F. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.

[8] Li, Y., Bu, R., Sun, M., Wu, W., Di, X., & Chen, B. (2018). Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31.

[9] Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., & Solomon, J. M. (2019). Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5), 1-12.

[10] Li, G., Muller, M., Thabet, A., & Ghanem, B. (2019). Deepgcns: Can gcns go as deep as cnns?. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9267-9276).

[11] Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., … & Markham, A. (2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11108-11117).

[12] Qiu, S., Anwar, S., & Barnes, N. (2021). Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1757-1767).

[13] Tang, L., Zhan, Y., Chen, Z., Yu, B., & Tao, D. (2022). Contrastive Boundary Learning for Point Cloud Segmentation. arXiv preprint arXiv:2203.05272.

[14] Qiu, S., Anwar, S., & Barnes, N. (2021). Dense-resolution network for point cloud classification and segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3813-3822).

[15] Qiu, S., Anwar, S., & Barnes, N. (2021). Geometric back-projection network for point cloud classification. IEEE Transactions on Multimedia.

[16] Cheng, S., Chen, X., He, X., Liu, Z., & Bai, X. (2021). Pra-net: Point relation-aware network for 3d point cloud analysis. IEEE Transactions on Image Processing, 30, 4436-4448.

[17] Ma, X., Qin, C., You, H., Ran, H., & Fu, Y. (2022). Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv preprint arXiv:2202.07123.

[18] Xiang, T., Zhang, C., Song, Y., Yu, J., & Cai, W. (2021). Walk in the cloud: Learning curves for point clouds shape analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 915-924).

[19] Qian, G., Hammoud, H. A. A. K., Li, G., Thabet, A., & Ghanem, B. (2021). Anisotropic Separable Set Abstraction for Efficient Point Cloud Representation Learning. arXiv preprint arXiv:2110.10538.

[20] Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., … & Jia, J. (2022). Stratified Transformer for 3D Point Cloud Segmentation. arXiv preprint arXiv:2203.14508.

[21] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[22] Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.

[23] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).

-The End-

关于が“门”

Jiangmen is a new venture capital institution focused on discovering, accelerating and investing in technology-driven startups. It covers Jiangmen Innovation Services, Jiangmen-TechBeat Technology Community and Jiangmen Venture Capital Fund.

Jiangmen was founded at the end of 2015. The founding team is composed of the original founding team of Microsoft Ventures in China. It has selected and in-depth incubated 126 innovative technology startups for Microsoft.

If you are a start-up in the technology field and not only want to obtain investment, but also want to obtain a series of continuous and valuable post-investment services, you are welcome to send or recommend projects to my "door":

[email protected]

おすすめ

転載: blog.csdn.net/hanseywho/article/details/125641137