Overview of the application of Transformer in 3D point clouds (detection/tracking/segmentation/noise reduction/completion)

1 abstract

Transformer has always been at the core of natural language processing (NLP) and computer vision (CV). The great success of NLP and CV has inspired researchers to explore the use of Transformer in point cloud processing. But, how does Transformer cope with the irregularity and disorder of point clouds? How suitable is the Transformer for different 3D representations such as point clouds or voxels? How well does Transformer handle various 3D processing tasks? To date, there has been no systematic investigation of research on these issues. The paper provides a comprehensive overview of the Transformer algorithm for 3D point cloud analysis. First, the theory of Transformer structure is introduced and its application in the 2D/3D field is reviewed. Then, three different taxonomies (i.e., based on implementation, data representation, and task) are proposed, which can classify current Transformer-based methods from multiple perspectives. In addition, the paper shows related results of variants and improvements of the self-attention mechanism in 3D. In order to prove the superiority of Transformer in point cloud analysis, the paper conducts a comprehensive comparison of various Transformer-based classification, segmentation and target detection methods. Finally, three potential research directions are proposed to provide useful references for the development of 3D Transformer.

2 Introduction

Transformers in encoders or decoders are now the main neural network structure in NLP. Given their powerful ability to model long-range dependencies, they have been successfully applied in CV [1]-[3] fields such as autonomous driving, visual computing, intelligent monitoring, and industrial inspection. A standard Transformer encoder usually consists of six main components, as shown in the following figure:

  • Input (word) embedding;

  • position encoding;

  • self-attention;

  • Normalized;

  • feed-forward;

  • Jump link.

picture

As for the Transformer decoder, it is usually designed to be the mirror image of the encoder, except that it additionally takes the latent features of the encoder as input. However, for 3D point cloud applications, the decoder can be specifically designed (i.e. not a pure Transformer) for intensive prediction tasks, such as part segmentation and semantic segmentation in 3D point cloud analysis. 3D vision researchers often adopt PointNet++ [4] or a convolutional backbone that contains a Transformer module.

Due to its excellent global feature learning capabilities and replacement and equivariant operations, Transformer is essentially suitable for point cloud processing and analysis. Currently, researchers have proposed many 3D Transformer backbones for point cloud classification and segmentation (see figure below) [7], [12], [34], [37], [69], [70], and detection [35] , [54], tracking [56]-[58], registration [59]-[63], [71], [72], completion [50], [66]-[68], [73], [74], to name a few. Furthermore, 3D Transformer networks are also used in various practical applications, such as structure monitoring [75], medical image analysis [37], and autonomous driving [76], [77]. Therefore, it is necessary to conduct a systematic investigation on 3D Transformer.

picture

The paper designs three different taxonomies, as shown in the figure below: 1) based on implementation 2) based on data representation and 3) based on tasks. In this way, we are able to classify and analyze Transformer networks from multiple perspectives. But these taxonomies are not mutually exclusive. Take Point Transformer (PT) [7] as an example: 1) In terms of Transformer implementation, it belongs to the category of local Transformer and operates in the local neighborhood of the target point cloud; 2) In terms of data representation, it belongs to the category of multi-scale point cloud-based Transformer. , hierarchically extracting geometric and semantic features; 3) In terms of 3D tasks, it is specially designed for point cloud classification and segmentation. Additionally, the paper investigates different variants of self-attention in 3D point cloud processing.

picture

The main contributions of the paper are as follows:

  • This paper is the first work dedicated to comprehensively investigating Transformers in point clouds to implement 3D vision tasks;

  • The paper investigates a series of self-attention variants in point cloud analysis. It introduces a new self-attention mechanism, aiming to improve the performance and efficiency of 3D Transformer;

  • The paper provides comparison and analysis of Transformer-based methods on several 3D vision tasks, including 3D shape classification and 3D shape/semantic segmentation, as well as 3D object detection on several public benchmarks;

  • This paper introduces readers to the latest advances in SOTA methods and Transformer-based point cloud processing methods.

3 Transformer implementation

In this section, the paper broadly classifies Transformers in 3D point clouds from multiple perspectives. First of all, in terms of operating scale, 3D Transformers can be divided into: Global Transformers and Local Transformers. The operating scale represents the scope of the algorithm relative to the point cloud, such as the global domain or the local domain. Secondly, in terms of operation space, 3D Transformers can be divided into Point-wise Transformers and Channel-wise Transformers. The operational scale represents the dimension in which the algorithm computes, such as a spatial dimension or a channel dimension. Finally, the paper reviews efficient Transformer networks designed to reduce computational complexity.

3.1 Operation scale

Depending on the operation, 3D Transformer can be divided into two parts: Global Transformer and Local Transformer. The former means applying the Transformer module to all input point clouds to extract global features, while the latter means applying the Transformer module to local patches to extract local features.

3.1.1 Global Transformers

Many existing algorithms [8], [10]-[12], [31], [33], [37], [38], [52], [81] focus on Global Transformer. For the Global Transformer module, each output feature can be connected to all input features. It is equivariant in the arrangement of inputs and is able to learn global contextual features [12].

Following PointNet [5], PCT [12] was proposed as a pure global Transformer network. Taking 3D coordinates as input, PCT first proposes a neighborhood embedding structure to map point clouds into high-dimensional feature space. This operation can also incorporate local information into the embedded features. These features are then input into four stacked global Transformer modules to learn semantic information. Finally, global features are extracted through full global Max and Average pooling for classification and segmentation. In addition, PCT's improved self-attention module is named Offset-Attention.

Compared with the single scale of PCT, a Cross-Level, Cross-Scale, and Cross-Attention Transformer network was proposed in [31], named 3CROSSNet. First, the Farthest Point Sampling (FPS) algorithm [4] is performed on the original input point cloud to obtain three point cloud subsets with different resolutions. Secondly, stacked multiple shared multi-layer perception (MLP) modules are utilized to extract local features of each sampled point cloud. Third, the Transformer module is applied to each point cloud subset for global feature extraction. Finally, the Cross-Level Cross-Attention (CLCA) module and the Cross-Scale Cross-Attention (CSCA) module are proposed to establish connections between different resolution point cloud subsets and different level features to achieve long-range inter-layer and layer-by-layer Internal dependency modeling.

For other related work [33], [5], please refer to the paper.

3.1.2 Local Transformers

Compared with the global Transformer, the local Transformer [7], [34], [35], [42], [85], [86] aims to achieve local patches rather than feature aggregation in the entire point cloud.

PT [7] adopts PointNet++ [4] hierarchical architecture for point cloud classification and segmentation. It focuses on local patch processing and replaces the shared MLP module in PointNet++ with the local Transformer module. PT has five local Transformers that operate on progressively downsampled point cloud sets. Each module is applied to the K-nearest neighbor (KNN) neighborhood of the sampled point cloud. Specifically, the self-attention operator used by PT is vector attention [87] instead of scalar attention. The former has been shown to be more effective for point cloud processing as it supports channel attention instead of assigning a single weight to the entire feature vector.

Pointformer was proposed in [35] to combine local and global features extracted by Transformer for 3D target detection. Pointformer mainly contains three modules: Local Transformer (LT) block, Global Transformer (GT) block and Local-Global Transformer (LGT) block. First, LT uses dense self-attention operations in each centroid point neighborhood generated by FPS [4]. Secondly, taking the entire point cloud as input, the GT module aims to learn global context-aware features through a self-attention mechanism. Finally, the LGT block adopts a multi-scale cross-attention module to establish connections between the local features of LT and the global features of GT.

Inspired by Swin Transformer [21], [51] proposed Stratified Transformer for 3D point cloud segmentation. It segments the point cloud into a set of non-overlapping cubic windows through 3D voxelization and performs local Transformer operations in each window. For other related work [88], please refer to the paper.

3.2 Operation space

According to the operation space, 3D Transformers can be divided into two categories: Point-wise and Channel-wise Transformers. The former measures the similarity between input point clouds, while the latter assigns attention weights along channels [40]. Generally speaking, the attention maps of these two Transformers can be expressed as:

picture

3.2.1 Point-wise Transformers

Point-wise Transformers aim to study the spatial correlation between point clouds and represent the output feature map as a weighted sum of all input features.

The difference between global Transformer and local Transformer lies in the spatial operation scale, that is, the entire point cloud or local patch, so all the above methods [7], [8], [10]-[12], [31], [33]-[35] , [37], [38], [42], [51], [52], [81] can be considered as Point-wise Transformer.

Point-wise Transformers are also widely used in other tasks. [36] proposed an encoder-decoder Transformer Network (TD-Net) for point cloud denoising. The encoder consists of a coordinate-based input embedding module, adaptive sampling module and four stacked point-wise self-attention modules. The outputs of the four self-attention modules are concatenated together as the input to the decoder. In addition, TD-Net uses an adaptive sampling method that can automatically learn the offset of each sampled point cloud generated by FPS [4]. The decoder is used to build the underlying manifold based on the extracted high-level features. Finally, the denoised point cloud can be reconstructed through manifold sampling. Other related algorithms [37], [10], [52] can refer to the paper.

3.2.2 Channel-wise Transformers

Compared with point-wise Transformers, channel-wise Transformers [38]-[41], [69] focus on measuring the similarity between different feature channels. It improves contextual information modeling by emphasizing the role of cross-channel interactions [39].

[40] proposed a back-projection module for local feature capture using the idea of ​​error-correcting feedback structure. They designed a Channel-wise Affinity Attention (CAA) module to obtain better feature representation. Specifically, the CAA module consists of two modules: the Compact Channel-wise Comparator (CCC) module and the Channel Affinity Estimator (CAE) module. CCC can generate similarity matrices in channel space. CAE further calculates an affinity matrix, in which elements with higher attention values ​​represent lower similarities between the corresponding two channels. This operation can improve the attention weight and avoid aggregating similar or redundant information. Therefore, each channel of the output feature has sufficient intersection with other different channels and is beneficial to model learning.

3.3 Efficient Transformers

Despite its great success in point cloud processing, standard Transformers often result in high computation and memory consumption due to a large number of linear operations. Given N input point clouds, the computational and memory complexity of the standard self-attention module is N quadratic, ie. This is the main drawback of applying Transformer on large-scale point cloud datasets.

Recently, several 3D Transformers are studying improving the self-attention module to improve computational efficiency. For example, Centroid Transformer [42] takes N point cloud features as input and outputs M point cloud features (M < N). Therefore, key information in the input point cloud can be summarized by a smaller number of outputs, called centroids. Specifically, it first constructs M centroids from N input points by optimizing a general "soft K-means" objective function. Then use M centroids and N input point clouds to generate Query and Key matrices respectively. The size of the attention graph is reduced from N×N to M×N, so the computational cost of self-attention is reduced from . To further save computational costs, the authors used the KNN approximation. This operation essentially converts the global Transformer into a local Transformer. In this case, the similarity matrix is ​​generated by measuring the relationship between each query feature vector and its K neighboring key vectors instead of N vectors. So the computational cost can be further reduced to . Similarly, PatchFormer [89] also attempts to reduce the size of the attention graph.

Light-weight Transformer Network (LightTN) [43] adopts a different approach to reduce computational cost. LightTN aims to simplify the main components in the standard Transformer, improving efficiency while maintaining the performance of the Transformer. First, the position encoding is removed, because the input 3D coordinates already contain position information, saving this part of the calculation. Secondly, a small-sized shared linear layer is utilized as the input embedding layer. Compared with the computationally saving neighbor embedding in [12], the dimensionality of the embedded features is reduced by half, thus further reducing the computational effort. Third, a single-head autocorrelation layer is proposed as a self-attention module. Since the attention map is only generated by the input autocorrelation parameters, the self-attention module is also named the autocorrelation module, which can be expressed as:

picture

4 Data representation

There are many forms of 3D data representation, such as point clouds and voxels, which can be used as input to the 3D Transformer. Since point clouds can be represented by or converted to voxels, several voxel-based methods such as 3D Transformer can also be performed on point clouds. According to different input formats, the paper divides 3D Transformers into Voxel-based Transformers and Point-based Transformers.

4.1 Voxel-based Transformers

Unlike images, 3D point clouds are usually unstructured and cannot be processed directly using traditional convolution operators. However, 3D point clouds can be easily converted into 3D voxels, whose structure is similar to images. Therefore, some Transformer-based works [45]–[47], [51] explore converting 3D point clouds into voxel-based representations. The most general voxelization method can be described as follows [91]: The bounding box of the point cloud is first regularly divided into 3D cuboids by rasterization. Keep the voxels containing the point cloud and generate a voxel representation of the point cloud.

Inspired by sparse convolution [92], [93], Mao et al. [46] first proposed the voxel Transformer (VoTr) backbone for 3D object detection. They proposed the Submanifold Voxel module and Sparse Voxel module to extract features from non-empty voxels and empty voxels respectively. In these two modules, based on the multi-head self-attention mechanism (MSA), local attention and expanded attention operations are implemented to maintain low computational consumption for a large number of voxels. The proposed VoTr can be integrated into most voxel-based 3D detectors. In order to solve the calculation problem of the Transformer of the voxel-based outdoor 3D detector, the Voxel Set Transformer (VoxSeT) [47] is proposed to detect outdoor objects in a set-to-set manner. Based on the low-rank characteristics of the self-attention matrix, a Voxel-based Self-Attention (VSA) module is designed by assigning a set of trainable “latent codes” to each voxel, which is inspired by Set Transformer [94]. For other related algorithms [48], [49], please refer to the paper.

4.2 Transformers based on point clouds

Since voxels are in a regular format and point clouds are not, conversion to voxels results in a certain loss of geometric information [4], [5]. On the other hand, since the point cloud is a raw representation, it contains the complete geometric information of the data. Therefore, most Transformer-based point cloud processing frameworks fall into the category of point cloud-based Transformers. Structures are usually divided into two major categories: uniform-scale structures [12], [33], [50], [63], [81] and multi-scale structures [7], [8] ], [37], [39] ], [51], [52].

4.2.1 Unified scale

Unified scale structures usually keep the scale of point cloud features constant during data processing. The number of output features of each module is consistent with the number of input features. The most representative work is PCT [12]. After the input embedding stage, PCT's four global Transformer modules are directly stacked together to refine point cloud features. There is no aggregation operation of hierarchical features, which facilitates dense prediction tasks such as point cloud segmentation. Feeding all point clouds into Transformer is beneficial to global feature learning. However, due to the lack of local neighborhood information, uniform-scale Transformers are often weak in extracting local features. Furthermore, directly processing the entire point cloud results in high computational and memory consumption.

4.2.2 Multi-scale

Multi-scale Transformers refer to Transformers that adopt a progressive point cloud sampling strategy in the feature extraction process, also known as hierarchical Transformers. PT [7] is a pioneering design that introduces multi-scale structures into pure Transformer networks. The Transformer layer in PT is applied to a progressive (sub)sampled point cloud. On the one hand, the sampling operation can speed up the calculation of the entire network by reducing the parameters of the Transformer. On the other hand, these hierarchical structures usually come with KNN-based local feature aggregation operations. This local feature aggregation is beneficial for tasks that require fine semantic perception, such as segmentation and completion. The highly aggregated local features in the last layer of the network can be used as global features for point cloud classification. In addition, there are many multi-scale Transformer networks [8], [37], [51], [52], which utilize EdgeConv [13] or KPconv [88] for local feature extraction and Transformer to extract global features. Therefore, they can combine the powerful local modeling ability of convolution and the excellent global feature learning ability of Transformer to obtain better semantic feature representation.

5 3D tasks

Similar to image processing [29], tasks related to 3D point clouds can also be divided into two major categories: high-level tasks and low-level tasks. High-level tasks involve semantic analysis, which focuses on converting 3D point clouds into human-understandable information. Low-level tasks such as denoising and completion focus on exploring basic geometric information. They are not directly related to human semantic understanding, but can indirectly facilitate high-level tasks.

5.1 Advanced tasks

In the field of 3D point cloud processing, high-level tasks usually include: classification and segmentation [7], [11], [12], [32]-[34], [37], [39], [40], [42] , [44], [45], [51], [85], [86], [89], [95]–[99], target detection [35], [46], [47], [53] –[55], [69], [77], [100]–[102], tracking[56]–[58], registration[59]–[63], [71], [72], [103 ] etc.

5.1.1 Classification and segmentation

Similar to image classification [104]–[107], 3D point cloud classification methods aim to classify given 3D shapes into specific categories, such as chairs, beds, and sofas for indoor scenes, and pedestrians, cyclists for outdoor scenes. and cars. In the field of 3D point cloud processing, since the encoder of the segmentation network is usually developed from the classification network, the paper introduces these tasks together.

Xie et al. [11] introduced the self-attention mechanism into the point cloud recognition task for the first time. Inspired by the success of shape context [108] in shape matching and object recognition, the authors first convert the input point cloud into a form of shape context representation. The representation consists of a set of concentric shell bins. Based on the proposed novel representation, they then introduced ShapeContextNet (SCN) to extract point cloud features. In order to automatically capture rich local and global information, the dot product self-attention module is further used, resulting in Attentional ShapeContextNet (A-SCN).

Inspired by self-attention networks in image analysis [87], [109] and NLP [83], Zhao et al. [7] designed a point cloud Transformer layer based on vector attention. The Point Transformer block is constructed in a residual manner based on the Point Transformer layer. PT's encoder is built using only Point Transformer blocks, point-wise transformations, and pooling operations for point cloud classification. In addition, PT also uses U-Net structure to implement point cloud segmentation, in which the decoder and encoder are symmetrical. It proposes a Transition Up module to restore the original point cloud with semantic features from the downsampled point cloud set. Furthermore, skip connections are introduced to facilitate backpropagation. With these carefully designed modules, PT became the first semantic segmentation model to achieve more than 70% mIoU (70.4%) on Area 5 of the S3DIS dataset [110]. As for the shape classification task on the ModelNet40 data set, Point Transformer also achieved an overall accuracy of 93.7%.

5.1.2 Target detection

Thanks to the popularity of 3D point cloud scanners, 3D object detection is becoming an increasingly popular research topic. Similar to the 2D object detection task, the 3D object detector aims to output 3D bounding boxes. Recently, [15] proposed the first Transformer-based 2D object detector DETR. It combines Transformer and CNN and abandons non-maximum suppression (NMS). Since then, Transformer-related works have also shown a booming trend in the field of point cloud-based 3D target detection.

Based on VoteNet [113], [53], [114] introduced Transformer's self-attention mechanism into the 3D target detection task in indoor scenes for the first time. They proposed Multi-Level Context VoteNet (MLCVNet) to improve detection performance by encoding contextual information. In the paper, each point cloud patch and vote cluster are regarded as tokens in Transformers. The self-attention mechanism is then used to enhance the corresponding feature representation by capturing the relationship between point cloud patches and vote clusters. Due to the integrated self-attention module, MLCVNet achieves higher performance than the baseline model on both ScanNet [80] and SUN RGB-D dataset [79]. PQ-Transformer [100] attempts to detect 3D objects and predict room layout simultaneously. With the room layout estimation and refined features of the Transformer decoder, PQ-Transformer achieves 67.2% [email protected] on ScanNet.

The above methods adopt a manual grouping scheme to obtain the characteristics of target candidates by learning from point clouds in corresponding local areas. However, [54] believed that point cloud grouping within a limited area often limits the performance of 3D object detection. Therefore, they proposed a group-free framework using the attention mechanism in Transformers. The core idea is that the features of candidate objects should come from all point clouds in a given scene, rather than a subset of point clouds. After obtaining candidate targets, [54] first utilizes a self-attention module to capture contextual information between candidate targets. Then a cross-attention module is designed to use the information of all point clouds to refine the target features. With an improved attention stacking scheme, [54] achieved 69.1% [email protected] on the ScanNet dataset. For other related algorithms [55], [69], [115], please refer to the paper.

5.1.3 Target tracking

3D object tracking takes as input two point clouds, a template point cloud and a search point cloud. It outputs a 3D bounding box of the target (template) in the search point cloud. That involves feature extraction of point clouds and feature fusion between template and search point clouds.

[56] believed that most existing tracking methods do not consider the attention changes of the target area during the tracking process. That is, searching different regions in the point cloud should contribute different importance to the feature fusion process. Based on this observation, they proposed a LiDAR-based 3D target tracking network TRansformer network (LTTR). This method is able to improve feature fusion of template and search point clouds by capturing changes in attention over tracking time. Specifically, a Transformer encoder is first constructed to improve the feature representation of template and search point clouds respectively. Cross-attention is then used to build a Transformer decoder, which can fuse features from the template and search point clouds by capturing the relationship between the two point clouds. Benefiting from Transformer-based feature fusion, LTTR achieves an average accuracy of 65.8% on the KITTI tracking dataset. [57] also proposed a Point Relation Transformer (PRT) module to improve feature fusion in the coarse-to-fine Point Tracking TRansformer (PTTR) framework. Similar to LTTR, PRT uses self-attention and cross-attention to encode relationships within and between point clouds, respectively. The difference is that PRT uses Offset-Attention [12] to mitigate the impact of noisy data. Finally, PTTR surpassed LTTR by 8.4% and 10.4% in average success rate and accuracy respectively, becoming the new SOTA on the KITTI tracking benchmark.

It is different from the above two methods that focus on the feature fusion step. [58] introduced a Point-Track-Transformer (PTT) module to enhance feature representation after the feature fusion step. The accuracy of PTTNet on KITTI's Car is improved by 9.0% compared to P2B.

5.1.4 Point cloud registration

Given two point clouds as input, the purpose of point cloud registration is to find a transformation matrix for alignment.

The Deep Closest Point (DCP) model proposed in [59] introduces the Transformer encoder into the point cloud registration task. The unaligned point cloud is first fed into a feature embedding module, such as PointNet [5] and DGCNN [13], to transform 3D coordinates into feature space. A standard Transformer encoder is then used to perform contextual aggregation between the two features. Finally, DCP utilizes a differentiable singular value decomposition (SVD) layer to compute the rigid transformation matrix. DCP is the first work to use Transformer to improve point cloud feature extraction in registration. Similarly, STORM [60] also uses Transformer to refine point-wise features extracted by EdgeConv [13] to capture long-term relationships between point clouds. Finally, it achieved better performance than DCP on the ModelNet40 data set. Similarly, [61] utilizes multi-head self-attention and cross-attention mechanisms to learn contextual information between target and source point clouds. Their method focuses on handling outdoor scenes, such as the KITTI dataset [117].

Recently, [72] argued that explicit feature matching and outlier filtering via RANSAC could be replaced by using attention mechanisms in point cloud registration. They designed an end-to-end Transformer framework named REGTR to directly find point cloud correspondences. In REGTR, point cloud features from the KPconv [88] backbone are fed into several multi-head self-attention and cross-attention layers for comparing source and target point clouds. Through the above simple design, REGTR becomes a SOTA point cloud registration method on the ModelNet40 [112] and 3DMatch [118] datasets. Similarly, GeoTransformer [71] also utilizes self-attention and cross-attention to find robust superpoint correspondences. In terms of registration Recall, both REGTR and GeoTransformer achieved 92.0% on the 3DMatch data set. However, GeoTransformer outperforms REGTR by 10.2% on the 3DLoMatch [119] dataset. For other related algorithms [62], please refer to the paper.

5.1.5 Point cloud video understanding

The 3D world around us is dynamic and time-continuous, which cannot be fully represented by traditional single frames and fixed point clouds. In contrast, point cloud video (a set of point clouds captured at a fixed frame rate) may be a better representation of dynamic scenes in the real world. Understanding dynamic scenes and dynamic targets is important for applying point cloud models to the real world. Point cloud video understanding involves processing time series of 3D point clouds, so Transformer may be a good choice for processing point cloud videos because it is good at handling global remote interactions.

Based on the above discussion, [64] proposed P4Transformer to process point cloud videos to achieve action recognition. In order to extract local spatiotemporal features of point cloud videos, the input data is first represented by a set of spatiotemporal local regions. Point cloud 4D convolution is then used to encode the features of each local region. The Transformer encoder is then introduced to receive and integrate the features of local areas by capturing the long-range relationships of the entire video. P4Transformer has been successfully applied to point cloud 3D action recognition and 4D semantic segmentation tasks. On many benchmarks (for example, the 3D action recognition dataset MSR-Action3D [120] and NTU RGB+D 60 [121] and the 4D semantic segmentation dataset Synthia 4D [93]), it has achieved higher results than methods based on PointNet++. The results demonstrate the effectiveness of Transformers in point cloud video understanding.

5.2 Low-level tasks

The input data for low-level tasks is usually raw scanned point clouds with occlusions, noise, and uneven density. Therefore, the ultimate goal of low-level tasks is to obtain high-quality point clouds, which may be beneficial for high-level tasks. Some typical low-level tasks include point cloud downsampling [43], upsampling [38], denoising [36], [65], completion [50], [66]-[68], [73], [74] ], [123], [124].

5.2.1 Downsampling

Given a point cloud with N points, the downsampling method aims to output a smaller size point cloud with M points while retaining the geometric information of the input point cloud. Taking advantage of the powerful learning capabilities of Transformers, LightTN [43] first removes the positional encoding and then uses a small-sized shared linear layer as the embedding layer. Furthermore, the MSA module is replaced by a single-head autocorrelation layer. Experimental results show that the above strategy significantly reduces the computational cost. With only 32 points sampled, a classification accuracy of 86.18% can still be achieved. Furthermore, the lightweight Transformer network is designed as a detachable module that can be easily plugged into other neural networks.

5.2.2 Upsampling

In contrast to downsampling, upsampling methods aim to recover the lost fine geometric information [125]. We expect the upsampled point cloud to reflect the true geometry and lie on the target surface given the sparse point cloud representation. PU-Transformer [38] is the first work to apply Transformer to point cloud upsampling. [38] designed two novel modules. The first module is the Positional Fusion (PosFus) module, designed to capture information related to local position. The second is the Shifted Channel Multi-head Self-Attention (SC-MSA) module, which is designed to solve the problem of lack of connection between different head outputs in traditional MSA. Experimental results show that PU-Transformer demonstrates the great potential of Transformer-based models in point cloud upsampling.

5.2.3 Noise reduction

Denoising takes point clouds destroyed by noise as input and uses local geometric information to output clean point clouds. The first related work, TDNet [36], uses each point cloud as a word token, proving that NLP Transformer [6] is suitable for point cloud feature extraction. The Transformer-based encoder maps input point clouds to high-dimensional feature space and learns semantic relationships between point clouds. By extracting features from the encoder, a latent manifest of the noisy input point cloud can be obtained. Finally, a clean point cloud can be generated by sampling each patch manifold.

Another type of point cloud denoising method is to filter out noise points directly from the input point cloud. For example, some lidar point clouds may contain a large number of virtual (noisy) point clouds produced by specular reflections from glass or other reflective materials. In order to detect these reflection noise point clouds, [65] first projects the input 3D LiDAR point cloud into a 2D range image. The Transformer-based autoencoder network is then used to predict the noise mask and then obtain the reflection noise point cloud.

5.2.4 Completion

In most 3D practical applications, it is often difficult to obtain a complete point cloud of an object or scene due to occlusion from other objects or self-occlusion. This problem makes point cloud completion an important low-level task in the field of 3D vision.

[66] first proposed PoinTr to convert point cloud completion into a set-to-set translation task. Specifically, the authors claim that the input point cloud can be represented by a set of local point clouds, called "point cloud proxies". It takes a series of point cloud proxies as input and carefully designs a geometry-aware Transformer block to generate point cloud proxies for missing parts. In a coarse-to-fine manner, the point cloud is finally completed based on predicted point cloud proxies using FoldingNet [126]. For other related algorithms [67], [68], please refer to the paper.

6 3D self-attention variants

Based on the standard self-attention module, there are many variants designed to improve the performance of Transformers in 3D point cloud processing.

6.1 Point-wise variants

PA [10] (a below) and A-SCN [11] (b below) use different residual structures in their Transformer encoders. The former strengthens the connection between the module's output and input, while the latter establishes the relationship between the module's output and the Value matrix. Relevant experiments show that residual connections promote model convergence [11].

Inspired by the Laplacian matrix in graph convolutional networks [82], PCT [12] further proposed the Offset-Attention module (Figure c below). This module calculates the offset (difference) between the self-attention (SA) feature and the input feature X through matrix subtraction, similar to the discrete Laplacian operation.

PT [7] (Figure 4(d)) introduced the vector subtraction attention operator in its Transformer network, replacing the commonly used scalar dot product attention. Compared with scalar attention, vector attention is more expressive because it supports adaptive modulation of individual feature channels instead of entire feature vectors. This representation seems to be very beneficial in 3D data processing [7]. For other related algorithms [8], [43], [38], [44], please refer to the paper.

picture

6.2 Channel-wise variants

Dual Transformer Network (DT-Net) [39] proposes channel-based MSA, which applies the self-attention mechanism to the channel space. As shown in Figure a below, unlike the standard self-attention mechanism, channel-wise MSA multiplies the transposed Query matrix and Key matrix. Therefore, attention maps can be generated to measure the similarity between different channels.

As shown in Figure b below, the CAA module [40] uses a similar method to generate the similarity matrix between different channels. Furthermore, it designs a CAE module to generate affinity matrices that strengthen the connections between different channels and avoid aggregating similar or redundant information. The Transformer-Conv module proposed in [41] learns the latent relationship between feature channels and coordinate channels. As shown in Figure c below, the Query matrix and Key matrix are generated from the coordinates and features of the point cloud respectively.

picture

7 Comparison and analysis

This section provides an overall comparison and analysis of 3D Transformer on several mainstream tasks such as classification, component segmentation, semantic segmentation and target detection.

7.1 Classification and segmentation

3D point cloud classification and segmentation are two fundamental but challenging tasks in which Transformers play a key role. Classification best reflects the ability of neural networks to extract salient features. The following table shows the classification accuracy of different methods on the ModelNet40 [112] dataset. For fair comparison, the input data and input size are also shown.

As can be seen from the table, the recent surge in Transformer-based point cloud processing methods began in 2020, when the Transformer structure was first used for image classification in the ViT paper [17]. Transformers quickly took the lead in this task due to their powerful global information aggregation capabilities. The classification accuracy of most 3D Transformers is about 93.0%. The latest PVT [45] pushes the limit to 94.0%, exceeding most non-Transformer algorithms in the same period. As an emerging technology, Transformer's success in point cloud classification demonstrates its huge potential in the field of 3D point cloud processing. The paper also shows several state-of-the-art non-Transformer based methods as reference. It can be seen that the recent classification accuracy of non-Transformer-based methods has exceeded 94.0%, with the highest being 94.5%, implemented by PointMLP [141]. The various attention mechanisms used in the Transformer method are universal and have great potential for breakthroughs in the future. We believe that applying innovations in general point cloud processing methods to the Transformer method can achieve state-of-the-art results. For example, the geometric affine module in PointMLP can be easily integrated into Transformer-based networks.

picture

For part segmentation, the ShapeNet part segmentation dataset [142] results are used for comparison. Use the commonly used part-average Intersection-over-Union as a performance indicator. As shown in Table 2, all Transformer-based methods achieve approximately 86% pIOU, except ShapeContextNet [11], which is an early model released before 2019. Note that Stratified Transformer [51] achieves the highest pIOU of 86.6%. It is also the best model in the semantic segmentation task on the S3DIS semantic segmentation dataset [110] (Table 3).

picture

7.2 Target detection

Transformers are still rarely used in point cloud 3D object detection. Recently there are only a few methods based on Transformer or Attention. One reason may be that object detection is more complex than classification. Table 4 summarizes the performance of related algorithms on two datasets: SUN RGB-D [79] and ScanNetV2 [80]. VoteNet [113] is also used as a reference, which is a pioneering work in 3D object detection. All Transformer-based methods perform better than VoteNet in terms of AP@25 in the ScanNetV2 dataset. Pointformer [35] and MLCVNet [53] are based on VoteNet and achieve similar performance. They all utilize the self-attention mechanism in Transformers to enhance feature representation. GroupFree3D [54] does not utilize the local voting strategy in the above two methods, but directly aggregates semantic information from all point clouds in the scene to extract the features of the target. Its performance of 69.1% shows that feature aggregation through self-attention mechanism is more effective than local voting strategy. 3DETR [55], as the first end-to-end Transformer-based 3D object detector, achieved second place in the ScanNetV2 dataset, with 65.0%.

picture

8 Discussion and conclusion

8.1 Discussion

As with 2D computer vision, Transformers have shown potential in 3D point cloud processing. From the perspective of 3D tasks, Transformer-based methods mainly focus on high-level tasks such as classification and segmentation. The paper believes that the reason is that Transformer is better at extracting global context information by capturing long dependencies, which corresponds to semantic information in high-level tasks. On the other hand, low-level tasks such as denoising and sampling focus on exploring local geometric features. From a performance perspective, 3D Transformers improve the accuracy of the above tasks and outperform most existing methods. But for some tasks, there is still a gap between them and state-of-the-art non-Transformer based methods. This shows that just using a Transformer as the backbone is not enough. Other innovative point cloud processing techniques must be employed. Therefore, although 3D Transformer is developing rapidly, as an emerging technology, it still needs further exploration and improvement.

Based on the characteristics of Transformer and its successful application in the 2D field, the paper points out several potential future directions of 3D Transformers.

8.1.1 Patch-wise Transformers

3D Transformers can be divided into two groups: Point-wise and Channel-wise Transformers. In addition, referring to the exploration of Transformers in 2D image processing [87], Point-wise Transformers can be further divided into Pair-wise and Patch-wise Transformers according to the operation form. The former calculates the attention weight of the feature vector through the corresponding point cloud pair, while the latter combines the information of all point clouds in a given patch.

Currently, there is little research on patch-wise Transformer in the field of 3D point cloud processing. Considering the advantages of patch-wise and its excellent performance in image processing, we believe that introducing patch-wise Transformers into point cloud processing is beneficial to performance improvement.

8.1.2 Adaptive Set Abstraction

PointNet++ [4] proposes a Set Abstraction (SA) module to hierarchically extract semantic features of point clouds. It mainly uses FPS and query ball grouping algorithms to implement sampling point cloud search and local patch construction respectively. However, the sampled point clouds generated by FPS tend to be uniformly distributed in the original point cloud, ignoring the geometric and semantic differences between different parts. For example, the geometry of an aircraft tail is more complex and pronounced than the fuselage. Therefore, the former requires more sampled point clouds to describe. In addition, query ball grouping focuses on searching neighborhood point clouds based on Euclidean distance, ignoring the semantic feature differences between point clouds, which makes point clouds with different semantic information easily grouped into the same local patch. Therefore, developing adaptive collection abstractions is beneficial to improve the performance of 3D Transformer. Recently, there are several Transformer-based methods exploring adaptive sampling in the 3D domain [43]. But few have fully exploited the rich short-term and long-term dependencies generated by the self-attention mechanism. In the field of image processing, the Deformable Attention Transformer (DAT) proposed in [143] generates deformed sampling points by introducing an offset network. It achieves impressive results in benchmark tests with low computational effort. It would be meaningful to propose a hierarchical Transformer adaptive sampling method based on the self-attention mechanism. In addition, inspired by 2D superpixels [144], we believe that it is feasible to use attention maps in 3D Transformers to obtain "superpoints" for point cloud oversegmentation [145], and convert point cloud-level 3D data into neighborhood-level data. . Therefore, this adaptive clustering technique can be used to replace the query ball grouping method.

8.1.3 Self-supervised Transformer Pre-training

The success of Transformers on NLP and 2D image tasks largely comes not only from their excellent scalability but also from large-scale self-supervised pre-training [83]. Vision Transformer [17] conducted a series of self-supervised experiments, demonstrating the potential of self-supervised Transformer. In the field of point cloud processing, despite significant progress in supervised point cloud methods, point cloud annotation is still a labor-intensive task. Limited labeled datasets hinder the development of supervised methods, especially for point cloud segmentation tasks. Recently, a series of self-supervised methods have been proposed to deal with these problems, such as generative adversarial networks (GAN) [146] in the 2D domain, autoencoders (AE) [147], [148] and Gaussian mixture models (GMM) [ 149]. These methods use autoencoders and generative models to achieve self-supervised point cloud representation learning [96] and demonstrate the effectiveness of self-supervised point cloud methods. However, currently there are few self-supervised Transformers applied to 3D point cloud processing. With the increasing availability of large-scale 3D point clouds, it is worth exploring self-supervised 3D Transformers for point cloud representation learning.

8.2 Summary

The Transformer model has attracted widespread attention in the field of 3D point cloud processing and has achieved impressive results in various 3D tasks. This article provides a comprehensive review of Transformer-based networks recently applied to point cloud-related tasks, such as point cloud classification, segmentation, object detection, registration, sampling, denoising, completion and other practical applications. The paper first introduces the theory of Transformer and describes the development and application of 2D and 3D Transformer. Then, the paper uses three different taxonomies to classify the methods in the existing literature into multiple groups and analyze them from multiple perspectives. In addition, the paper describes a series of self-attention variants designed to improve performance and reduce computational cost. This paper provides a brief comparison of the reviewed methods in terms of point cloud classification, segmentation and object detection. Finally, the paper proposes three potential future research directions for the development of 3D Transformer. It is hoped that this survey will provide researchers with a comprehensive understanding of 3D Transformers and stimulate their interest in further innovating research in this field.

9 Reference

[1] Transformers in 3D Point Clouds: A Survey

Reprinted on the public account: The Heart of Autonomous Driving

Guess you like

Origin blog.csdn.net/abcwsp/article/details/127433394