FDGCNN <Paper>

Title: Faster Dynamic Graph CNN: Faster Deep Learning on 3D Point Cloud Data
Faster Dynamic Graph CNN: Faster Deep Learning on 3D Point Cloud Data

 

Abstract:

  • Due to the unstructured and disordered features of Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), it is difficult to apply point cloud data as input to Convolutional Neural Networks (CNNs).
  • In this study, this problem is addressed by arranging point cloud data in a canonical space with a graph CNN .
  • The proposed graph CNN works dynamically at each layer of the network and learns global geometric features by capturing neighbor information of points .
  • Furthermore, by recalibrating the information of each layer with a squeezed excitation module , we achieve a good trade-off between performance and computational cost, and design a residual-type skip-connection network to efficiently train deep models.
  • Using the proposed model, we achieve state-of-the-art performance on classification and segmentation on benchmark datasets (i.e., ModelNet40 and ShapeNet), while enabling model training 2 to 2.5 times faster than other similar models.

I.Introduction

The main contributions are as follows:

  • Using attention recalibration blocks on edge convolution blocks can improve the expressiveness of edge features and point feature maps.

  • Using skip-dense networks , we learn models with a higher number of layers faster.

  • We conduct experiments on the proposed model and achieve state-of-the-art performance on benchmark datasets, learning 2 to 2.5 times faster than other similar models.

II.Related Works

A. Deep neural network architecture
B. 3D data representation
C. Geometric deep learning

III.Method

  • Our proposed model, which is significantly influenced by the DGCNN model [12] . Based on edge convolutions, we capture geometric features (or edge features) between points, and learn edge features.
  • The DGCNN model is built on top of a multi-layer perceptron (MLP), while we build deeper and faster networks by adding our own skip network and recalibration blocks.

A. Classification Model

1) Pipeline Model

Our proposed classification model: The classification model receives n points as input and computes edge feature maps through a spatial transformation block and an edge convolution block. The output edge feature maps are recalibrated by the SE module, and the recalibrated feature maps are aggregated. The aggregated feature maps are finally passed through the skip dense network to output the classification score of the label.

  •  The spatial transform block [spatial transform block] and the edge convolution block [edge convolution blocks] are the main elements of the backbone model.
  • The spatial transformation block aims to align the point cloud input to a typical space by applying an estimated 3×3 matrix. To estimate this 3×3 matrix, a tensor concatenating the coordinate differences between each point and k neighbors is used (Fig. 3(a)).
  • The k-nearest neighbors and the difference in coordinates of the point are concatenated. Therefore, as shown in Fig. 3(a), the size of the feature map after the k-NN map is n×k×(3+3)=n×k×6. The edge convolution block computes edge features for each point and applies an aggregate function, outputting a tensor with n×n shape. Here, n is the number of points taken as input and an is the size of the applied MLP (Fig. 3(b))

  •  The detailed description of the edge convolution block is as follows. Assume that an F-dimensional point cloud dataset X={p1,p2,⋯,pn}⊆RF is input. For most 3D point cloud data, F=3, pi=(xi,yi,zi). F increases when information such as texture or color is added. Based on this X, we configure a directed graph G=(V,E), including V={p1,p2,⋯,pn} as a vertex set, and E⊆V×V represents an edge set. The edge set E is expressed as follows:

  • Among them, fΘ is a nonlinear function, RF×RF→RF′. Θ is a learnable parameter. Based on this configuration V and E, G is constructed as a k-nearest neighbor graph and reflected in the edge convolution block. The function fΘ defines the representation of edge features as follows: 

  • This asymmetric function combines a global shape structure centered at pi with a local neighborhood centered at pj-pi. Finally, the edge features of the l-th channel are represented by MLP as: 

  • After constructing a k-NN graph G for an n-point set X, we perform an edge convolution process with G as input. In edge convolution, we apply a symmetric aggregation function g to the features of all edges connected to each vertex. Through this process, edge features become unaffected by the envelope. The edge convolution result xi′ from the i-th point xi can be expressed as follows: 

  • This symmetric function takes n vectors as input and outputs a new vector that is robust (or invariant) to the order of the inputs. Therefore, given an F-dimensional point cloud with n points, passing through the edge convolution block will produce a point cloud with the same number in F′ dimension. Methods such as attention, long-term short-term memory (LSTM), average pooling, and maximum pooling can all be used to select order-invariant functions g; In the comparison of methods, Max pooling has the highest accuracy] The maximum pooling function g was selected. Therefore, the edge convolution result after (4) is as follows:

  • Because the edge feature function f is a symmetric function, it is invariant to permutation, and the feature aggregation function g, which is max-pooling in our model, is also invariant to permutation. Therefore, the result pi' of (5) is also invariant to the input pj.
  • Furthermore, according to (6), when each point is moved by T, its edge features are preserved. For ϕl=0, edge features are completely translation invariant. In this case, the model only utilizes the features (or edge features) between points, while ignoring the geometric information of each point. Therefore, for ϕl≠0, by considering both pi and pj-pi as input values, the model can consider local region information while maintaining the original shape information.

2) Se Module

  • In the CNN structure, each convolutional filter learns the local features of the image or feature map— the information combination of the local receptive field. By combining these through the active function (through the active function), we derive non-linear relationships and, using the same method as pooling, reduce large features so that they can be seen at once. As a result, CNNs have been able to outperform humans in areas such as image classification because of their ability to efficiently manage global receptive field relationships.
  • The SE module models the dependencies among convolutional features to further enhance the expressiveness of existing CNNs. The SE module consists of a squeeze operation that summarizes the complete information of each feature map and a boost operation that expands the importance of each feature map . The SE module consists of a squeeze operation that summarizes the complete information on each feature map and an excitation operation that scales the importance of each feature map. With the SE module, compared with the increase in the number of parameters, the improvement of model performance is very obvious, At the same time, it ensures that the complexity of the model and calculation will not increase significantly.
  • The extrusion operation is actually an extrusion of a feature. Only import information is extracted from each channel. Only import information is extracted from each channel. The concept of extracting core information is important in sub-networks where local receptive fields are very small. We use Global Average Pooling (GAP), which is one of the most common methods to extract core information. GAP enables global spatial information to be compressed into a channel descriptor. The concept of extracting core information is important in the sub-network, where the local receptive field is very small. We use global average pooling (GAP), one of the most common methods for extracting core information. The GAP enables global spatial information to be compressed into a channel descriptor.
  • After squeezing the core information, the module is recalibrated through acquisition operations to calculate the dependencies between channels.
  • Fscale(⋅,⋅) is a channel-wise multiplication, and X~ is the feature map of H, W, and C sizes before the extrusion operation. Ultimately, the scale value after the incentive operation is between 0 and 1, so it is scaled according to the importance of the channel.
  • In this study, the SE operation is applied to each feature map via an edge convolution block, and then combined into a point cloud feature. Furthermore, a deeper feature map can be constructed by adding channel-specific weighted SE operation outputs at each step. Through this process, high-dimensional point cloud data can be processed more efficiently, and high learning speed and performance improvement can be expected with negligible additional computation.

3) Skip-Dense Network

 We take the output of the above backbone network as the input of the skip dense network [65] (Fig. 6(a)). A skip dense network consists of stacked fully connected layers with skip connections, expressed as follows:

In (7), Il is the skip secret input of layer l, BNγ, β are batch normalization, and γ and β are the parameters of batch normalization. The next step is the ReLU activation function and the fully connected layer. W and b are the parameters of the fully connected layer. α is a coefficient that adjusts the proportion of skipped connections. This pure skip-dense network increases the depth of the model while improving performance, but at the cost of greatly increasing the number of parameters and computational complexity. Therefore, SE modules are applied to skip dense networks to improve learning speed and performance (Fig. 6(b)).

 B. Part Segmentation Model part segmentation model

Figure 5 shows the segmentation model. Segmentation models are similar to classification models, but additionally consider a classification vector. The difference between a segmentation model and a classification model is that the classification vectors are aggregated into a recalibrated feature map. By considering label vectors and bundling point cloud features and segmentation labels of points into the same feature map, the effect of learning local and global information occurs simultaneously. Finally, the model predicts an n×p segmentation label.

IV.Experiment

Guess you like

Origin blog.csdn.net/cocapop/article/details/130274247
Recommended