[Paper reading] PointContrast self-supervised pre-training model semantic segmentation 2020

Paper: https://link.springer.com/chapter/10.1007/978-3-030-58580-8_34

Code: The original is not maintained, the upgraded PointContrast, https://github.com/facebookresearch/ContrastiveSceneContexts

1 Introduction

At present, the main reasons why 3D self-supervised learning lags behind 2D are as follows:

  • Lack of large-scale and high-quality data: Compared with 2D images, 3D data is more difficult to collect, labels are more expensive, and the diversity of sensing devices may cause huge domain gaps.
  • Lack of a unified backbone (backbone), unlike the Resnet in 2D, which is a fine-tuned and pre-trained backbone network.
  • Lack of a comprehensive dataset and high-level tasks for evaluation (evaluation)

Thoughts on the above questions:

Investigate unsupervised pre-training and supervised fine-tuning to advance 3D scene understanding.

  • Choose a large dataset (ScanNet) for pre-training

  • Determine the backbone architecture backbone (sparse residual U-Net) that can be shared across many different tasks;

  • Unsupervised metrics for evaluating pre-trained backbone networks (Methods for evaluating pre-training effects) Hardest-contrastive loss + PointInfoNCE loss

  • Defines a set of evaluation methods for different downstream tasks

    –>Semantic segmentation: S3DIS, ScanNetV2, ShapeNetPart, Synthia 4D
    –>Target detection: SUN RGB-D, ScanNetV2

Author's approach:

Pre-training using Scannet

Backbone network: U-Net

Two contrastive losses in the pre-training target : Hardest-contrastive loss [10], and PointInfoNCE [42] (in 2D)

Target datasets and downstream tasks , including: semantic segmentation on S3DIS [2], ScanNetV2 [11], ShapeNetPart [77] and Synthia 4D [52]; and SUN rgb-d [57, 55, 32, 70] and ScanNetV2 Object detection on .

Corpus Furthermore, we find relatively little advantage of pre-training under supervision. This means that future efforts in collecting data for pre-training should be biased towards scale rather than precise annotation.

Author's contribution:
  • The transferability of high-level representations of 3D point clouds across different scenarios is evaluated for the first time.
  • Experimental results show that unsupervised pre-training can effectively improve the performance of different downstream tasks and data sets
  • With powerful unsupervised pre-training, the best results are achieved on 6 different benchmarks.
  • The success of this paper can promote more research in related fields.

2. Related work

3D representation learning

Deep neural networks are data hungry. This makes the ability to transfer learned representations between datasets and tasks very powerful.

Here, we hope to advance research in 3D representation learning by focusing on transferability to higher-level tasks in more complex scenes.

Deep Architecture for Point Cloud Processing

In this work, we focus on learning useful representations for point cloud data. . . In this work, we use the unified U-Net [51] architecture built with the Minkowski engine as the backbone network in all experiments and show that it can transfer gracefully between tasks and datasets.

3. Method

(Section 3.1) Pilot Study: Does ShapeNet Pre-training Work?

Previous unsupervised learning on point clouds has focused on ShapeNet . ShapeNet itself is a single-object CAD dataset. The author did a supervised pre-training on ShapeNet and fine-tuned on the downstream dataset S3DIS. The experimental results show that pre-training hardly brings any benefits to downstream tasks, but it seems to have a certain reaction.
insert image description here
We highlight two main reasons and solutions for shapenet pre-training that are not helpful downstream:

  • Domain gap between source and target data: Objects in ShapeNet are synthetic, scale normalized, pose aligned and lack scene context. This makes pre-training and fine-tuning data distributions very different. For the domain gap: can be trained in the dataset scannet with complex scenes , instead of the "clean" dataset like ShapeNet

  • Point-level representations are important: In 3D deep learning, local geometric features, such as those encoded by points and their neighbors, have been shown to be important for 3D tasks [47, 48]. Directly training object instances to obtain global representations may not be sufficient. For this problem pointcontrast, in order to capture point-level information, corresponding tasks and frameworks are needed. This framework is not only based on instance/global representation , but also can capture dense/local features at point level .

(Section 3.2) Review (FCGF) local feature learning fully convolutional geometric features

In the paper, the author uses a sparse tensor to represent 3D data, uses Minkowski convolution instead of traditional convolution, proposes ResUNet to extract the features of each point in the input point cloud, and proposes a new loss for the full volume Accumulated metric learning. The network requires no data preprocessing (to extract simple features) and no patch input, and is capable of producing SOTA-discriminative high-resolution features. The author verified the representation ability and feature extraction speed of FCGF (Fully Convolutional Geometric Features) in the 3DMatch dataset and the KITTI dataset.

Advantages of FCGF:

(1) fully convolutional design and (2) point-level metric learning. Designed with a fully convolutional network (FCN) [38], FCGF operates on the entire input point cloud (e.g., a complete indoor or outdoor scene) without cropping the scene into patches as in previous work; in this way, local Descriptors can aggregate information from a large number of neighboring points (up to the size of the receptive field). Thus, point-level metric learning becomes natural. FCGF uses a U-Net architecture with full-resolution output (ie, for N points, the network outputs N associated feature vectors), and defines positive/negative pairs for metric learning at the point level. Despite having a fundamentally different goal in mind, FCGF provides inspiration for a possible solution to the design challenge of the pretext task: a fully convolutional design will allow us to pre-train on target data distributions involving complex scenes with large numbers of points , and we can directly Define agent tasks on points. In this perspective, we pose the question: Can we repurpose FCGF as a pretext task for advanced 3D understanding?

We adopt the best performing FCGF model published in [10], which achieves: 0.958 high registered Feature Matching Recall (FMR). However, this model does not perform well for S3DIS segmentation. On the other hand, point-contrastive models with the best performance for segmentation achieve lower FMR when applied to the registration task. We conclude that low-level and high-level tasks in 3D may require different design choices.
insert image description here

(Section 3.3) Unsupervised pre-training PointContrast as a proxy task

In the design of the self-supervised task, the article refers to FCGF, and performs two transformations (rotation, translation and scaling) on ​​a point cloud. Through transformation, the network can adapt to the variance of the change, which is helpful for migration to any data set . Then use contrastive loss to compare and learn the points of the point cloud under the two perspectives, minimize the distance between matching point pairs, and maximize the distance between unmatched points. The network needs to learn the invariance under geometric transformation .

Algorithmic flow: This proxy task is to compare (at point level) between two transformed point clouds. Conceptually, given a point cloud x sampled from some distribution, we first scan to generate two point clouds x1 and x2, which are aligned in the same world coordinates (guaranteed 30% overlap). Then perform random rigid body transformations T1 and T2 on x1 and x2 respectively, and further change the two point clouds into different perspectives to increase the difficulty of the task, and then encode the two point clouds respectively for comparative learning and training.
insert image description here
The official code below shows that what is directly processed is point cloud data, not pictures.
insert image description here
The network trained by this method can capture the local information of the point, and can adapt to a large variation variance, so that a series of downstream tasks can be performed.

(Section 3.4) designed two contrastive loss functions

Hardest-Contrastive Loss : Referring to the best loss function in FCGF, it is a margin-based contrastive loss function based on hard case mining.

PointInfoNCE Loss : Referring to InfoNCE in CV, it considers comparative learning as a classification problem and uses softmax loss for modeling.

Compared with Hardest-Contrastive Loss, PointInfoNCE training is more stable.

(Section 3.5) Sparse residual U-Net as a shared backbone

We use the Sparse Residual U-Net (SR-UNet) architecture in this work. It is a 34-layer U-Net [51] architecture with an encoder network of 21 convolutional layers and a decoder network of 13 convolutional/deconvolutional layers. It follows the 2D ResNet basic block design, with each conv/deconv layer in the network followed by batch normalization (BN) [31] and ReLU activations. The entire U-Net architecture has 37.85 million parameters. We provide more information and network visualizations in the appendix. The SR-UNet architecture, originally designed in [9], shows significant improvements over previous methods on the challenging ScanNet semantic segmentation benchmark. In this work, we explore whether this architecture can be used as a unified design for a pre-training task and a different set of fine-tuning tasks.

Appendix Visualizing SR-UNET Structure

We use the SR-UNet architecture as a shared backbone for both pre-training and fine-tuning tasks. For segmentation and detection tasks, both encoder and decoder weights are fine-tuned; for classification downstream tasks, only the encoder network is kept and fine-tuned
insert image description here

(Section 3.6) Pre-training dataset

For local geometric feature learning methods, including FCGF [10], training and evaluation are usually performed on domain- and task-specific datasets, such as KITTI odometry [17] or 3DMatch [81]. Common registration datasets are usually limited in scale (training samples collected from only a few dozen scenes) or generality (focus on one specific application scenario, such as indoor scenes or LiDAR scans of autonomous vehicles), or both. Both. To facilitate future research on 3D unsupervised representation learning, in our work we leverage the ScanNet dataset for pre-training, aiming to address the scale issue . ScanNet is a collection of about 1500 indoor scenes. ScanNet was created using a lightweight RGB-D scanning program, the largest of its kind currently available .

Here, we create a dataset of point cloud pairs on top of ScanNet. Given a scene x, we extract pairs of partial scans x1 and x2 from different views . More precisely, for each scene, we first subsample the RGB-D scan from the raw ScanNet video every 25 frames and align the 3D point cloud in the same world coordinates (by using the estimated camera posture). We then collect point cloud pairs from the sampled frames and require that the two point clouds in a pair have at least 30% overlap . We sampled a total of 870K point cloud pairs . Since partial views are aligned in a ScanNet scene , the correspondence map M between two views can be directly computed with nearest neighbor search . Although ScanNet only captures indoor data distributions, as we will see in Section 4.4, it is surprisingly generalizable to other target distributions. We provide additional visualizations for the pre-training dataset in the appendix

Appendix: Visualizing point cloud pairs in scannet

Each row is a randomly sampled scene. Each column is a pair of different point clouds sampled from the same scene. Different colors correspond to two different views (partial scans). At least 30% of the points overlap in both views.
insert image description here

fine-tuning downstream tasks

The most important motivation for representation learning is to learn features that transfer well to different downstream tasks. There may be different evaluation protocols to measure the usefulness of the learned representations . For example, detection with a linear classifier [19], or evaluation in a semi-supervised setting [27]. The supervised fine-tuning strategy, which uses pre-trained weights as initialization and further refines them on the target downstream task, is arguably the most practical way to evaluate feature transferability . With this setup, good functionality may directly lead to improved performance on downstream tasks.

From this perspective, in this section we perform an extensive evaluation of the effectiveness of the PointContrast framework by fine-tuning the pretrained weights on multiple downstream tasks and datasets . Our goal is to cover a range of high-level 3D understanding tasks of varying nature, such as semantic segmentation, object detection and classification. In all experiments, we use the same backbone network pretrained on the proposed ScanNet pair dataset (Section 3.6) using PointInfoNCE and the Hardest-Constrastive loss objective.

4.2 S3DIS segmentation

SetUp . The Stanford Large 3D Indoor Space (S3DIS) [2] dataset consists of 3D scans of 6 large indoor areas collected from 3 office buildings. Scans are represented as point clouds and annotated with semantic labels for 13 object categories. Among the datasets used for evaluation here, S3DIS is probably the most similar to ScanNet. Transferring features to S3DIS represents a typical scenario for fine-tuning: **The downstream task dataset is similar, but much smaller than the pre-training dataset. ** For the commonly used benchmark split ("Test it in region 5"), there are only about 240 examples in the training set. We follow [9] for preprocessing and use standard data augmentation. See the appendix for details.

Result: A 2.7% mIoU gain was achieved using the Hardest-Contrastive loss, and the PointInfoNCE variant achieved an average improvement of 2.1% mIoU
insert image description here
Appendix Detailed PointContrast pre-training

In our experiments, the T1 and T2 transformations applied to the two views x1 and x2 involve random rotations (0 to 360°) along arbitrary axes (applied independently to the two views). We apply scale augmentation to both views (0.8 to 1.2 times the input scale). We have experimented with other enhancements such as translation, point coordinate jittering, and point loss, but did not find a noticeable difference in fine-tuning performance.

For the hardest contrastive loss, the positive sample size is 1024, and the hardest negative sample size is 256. More details can be found in [10]. For the PointInfoNCE loss, we provide detailed PyTorch-like pseudocode (and explanatory notes) in Algorithm 2.

Appendix: S3DIS Segmentation Experimental Details

We use the same hyperparameter settings. Specifically, we train the model with 8 V100 GPUs and have data parallelism for 10,000 iterations. The batch size is 48. Batch normalization is applied independently on each GPU. We use the SGD momentum optimizer with an initial learning rate of 0.8. We use a polynomial LR scheduler with a power factor of 0.9. The weight decay is 0.0001 and the voxel size is 0.05 (5 cm). We use the same data augmentation techniques in [9], such as hue/saturation augmentation and dithering, and scale augmentation (0.9× to 1.1×). In Table 9, we show a detailed per-category performance breakdown of our model and previous methods.
insert image description here

Guess you like

Origin blog.csdn.net/shan_5233/article/details/128117149