Dry goods! Unsupervised 3D object segmentation learned from dynamic point clouds

click on the blue word

44407289032769d1d0a4565beaf03746.jpeg

Follow us

AI TIME welcomes every AI enthusiast to join us!

The content comes from CVer Computer Vision

7466e9cf0eff3345c32906ce69a11671.gif

f7033181ad176980820756e83018d368.png

Song Ziyang:

The Hong Kong Polytechnic University is a second-year Ph.D. student in the Department of Computing, supervised by YANG Bo. His research interest is unsupervised 3D scene understanding

This paper designs a general unsupervised 3D object segmentation method that can segment multiple objects: this method is trained on completely unlabeled point cloud sequences, and learns 3D object segmentation from motion information; after training, it can Object segmentation directly on single-frame point clouds. To this end, this paper proposes an unsupervised 3D object segmentation method OGC (Object Geometry Consistency).

d60d3bf3ca703aa090bea45f47313299.jpeg

OGC: Unsupervised 3D Object Segmentation from Rigid Dynamics of Point Clouds

Paper: https://arxiv.org/abs/2210.04458

Code: https://github.com/vLAR-group/OGC

Effects on object part segmentation and indoor and outdoor object segmentation tasks (without any manual labeling):

2fb534da3f629b5333c40c64fb6efa3d.jpeg

01

Introduction

3D point cloud object segmentation is one of the key issues in 3D scene understanding, and it is also the basis for applications such as autonomous driving and intelligent robots. However, the current mainstream methods are all based on supervised learning and require a large amount of manually labeled data, and manual labeling of point cloud data is very time-consuming and labor-intensive.

02

Motivation

This paper aims to seek an unsupervised 3D object segmentation method. We found that motion information holds promise for helping us achieve this goal. As shown in Figure 1 below, within the blue/orange circle in the left image, all points on a car move forward together, while other points in the scene remain stationary. Then in theory, based on the movement of each point, we can separate the points belonging to the car from other points in the scene to achieve the effect in the picture on the right.

3ab6e4393b972fc66bf1870b43bb10a9.jpeg

Figure 1. Using motion information to segment the motivation of objects

The idea of ​​using motion information to segment 3D objects has been explored in some existing works. For example, [1] and [2] utilize traditional methods of sparse subspace clustering to segment moving objects from point cloud sequences; SLIM [3] proposes the first learning-based approach to segment moving foreground and stationary background. However, existing methods have limitations in one or more of the following aspects:

1) Only applicable to specific scenarios, not universal;

2) It can only realize the second-class segmentation between the moving foreground and the static background, and cannot further distinguish multiple objects in the foreground;

3) (Limitations in almost all existing methods) A multi-frame point cloud sequence must be used as input, and only moving objects can be segmented. But in theory, after we use motion information to learn to identify certain objects, when these objects appear in a single-frame point cloud in a static state, we should still be able to identify them.

In response to the above problems, we hope to design a general unsupervised 3D object segmentation method that can segment multiple objects: this method is trained on completely unlabeled point cloud sequences, and learns 3D object segmentation from motion information; after After training, it is possible to perform object segmentation directly on a single-frame point cloud. To this end, this paper proposes an unsupervised 3D object segmentation method OGC (Object Geometry Consistency). The main contributions of this paper include the following three points:

1) We propose the first general-purpose unsupervised 3D object segmentation framework OGC, which does not require any manual annotation during the training process, and learns from the motion information contained in the point cloud sequence; after training, it can directly perform object segmentation on a single frame point cloud. segmentation.

2) As the core of the OGC framework, we design a set of loss functions that can effectively utilize motion information to provide supervisory signals for object segmentation, taking the consistent geometric shape of objects in motion as constraints.

3) We have achieved very good results in object part segmentation and indoor and outdoor object segmentation tasks

03

Method

3.1

Overview

As shown in Figure 2 below, our framework consists of three parts:

1) An object segmentation network (orange part), which estimates the object segmentation mask from a single frame point cloud;

2) A self-supervised scene flow estimation network (green part), which estimates the motion between two frames of point clouds (scene flow);

3) A set of loss functions (blue part), using 2) estimated motion to provide supervision signals for 1) output object segmentation mask.

During the training process, three parts need to work together; after training, only the object segmentation network of 1) needs to be retained, which can be used to segment single-frame point clouds.

d24017b43bc150a51c7d44be74a8b20e.jpeg

Figure 2 OGC schematic diagram

For the object segmentation network and scene flow estimation network in the OGC framework, we can directly use the existing network structure, as shown in Figure 3 below. Specifically:

d14c212347a0dcd46537cd93241852b5.jpeg

3.2

OGC Losses

The key to the OGC framework is how to use motion information to provide supervisory signals for object segmentation. To this end, we design the following loss function:

1) Dynamic loss: The motion of most objects in the real world can be described by rigid body transformation. Therefore, in this loss function, we require that each estimated object be divided into masks, and the motion of the points contained in it must obey the same rigid body transformation:

18176efaf0864a5856fa12468271b079.png

In the above formula,

efc4ef3f5f43fdd4befaea92c35fcc9b.png

Represents the rigid body transformation fitted on each object segmentation mask. If a mask actually contains two objects with different motion directions, the motion of the points on these two objects must not obey the same rigid body transformation. At this time, the rigid body transformation forcibly fitted by the points on these two objects is not consistent with the actual motion of these points, and this mask will be punished by the loss function. It can be seen that dynamic loss can help us distinguish objects with different directions of motion. However, if the points that actually belong to the same object are divided into two pieces, that is, "over-segmentation", dynamic loss cannot punish this situation.

2) Smoothness loss: The points on the object are generally connected together in space, otherwise the object will break. Based on this fact, we propose a smoothness prior to the object segmentation mask, requiring that points adjacent to each other in a local area be assigned to the same object:

9cbca0206eb6fd0c7536f0c70a927ca7.jpeg

In the above formula, H represents the number of points contained in the domain of a certain point. It can be seen that dynamic loss and smoothness loss play a countermeasure against each other: the former distinguishes points according to the direction of motion; the latter aggregates adjacent points according to the neighbor relationship in space to offset the potential "over-segmentation" problem . Together, these two loss functions provide sufficient supervisory signals for segmenting moving objects in a scene.

3) Invariance loss: We hope to fully generalize the learned segmentation of moving objects to static objects with similar shapes. To this end, we require object segmentation networks to be able to discriminate (segment) the same object indiscriminately when faced with the same object in different poses. Specifically, we apply two different spatial transformations (rotation, translation and scaling) v1 and v2 to the same scene, so that the poses of objects in the scene change, and then we require the segmentation results of the scene to remain unchanged:

b386a6b3375ff9b9f434951ccc4772de.png

Invariance loss can effectively generalize the segmentation strategy learned from moving objects to static objects of different poses.

3.3

Iterative Optimization

When we learn to segment objects from motion information, we can theoretically use the estimated object segmentation to improve the quality of motion (scene flow) estimation, and then learn better to segment objects from more accurate motion information. To achieve this goal, we propose the "object segmentation-motion estimation" iterative optimization algorithm shown in Figure 4 below: In the initial stage, we estimate the motion through the FlowStep3D network. In each round, we first learn object segmentation from the current estimated motion information; then use our Object-aware ICP algorithm to improve the quality of motion estimation based on the estimated object segmentation, and send the improved motion estimation to into the next round.

e69bf93e526de8faef717307ada1ae55.jpeg

Figure 4 Schematic diagram of iterative optimization algorithm for "object segmentation-motion estimation"

The Object-aware ICP algorithm used in the iterative process can be regarded as an extension of the traditional ICP algorithm to multi-object scenarios. For details of the algorithm, please refer to Appendix A.2 of the original text.

04

Experiments

Evaluation on Synthetic Datasets

We first evaluate the effect of OGC on object part segmentation and indoor object segmentation tasks on the SAPIEN dataset and our own synthesized OGC-DR/OGC-DRSV dataset. As can be seen from the following two tables, on high-quality synthetic data sets, OGC not only leads the traditional unsupervised motion segmentation and clustering methods, but also achieves the effect close to or even surpasses the fully supervised method.

abd88b77ed3042eecda620387e2e0bc6.png

Figure 5 Comparison of quantitative results of different methods on the SAPIEN dataset

d3db1c0f35be59eba5c8d951fd17ef51.png

Figure 6 Comparison of quantitative results of different methods on the OGC-DR/OGC-DRSV dataset

Evaluation on Real-World Outdoor Datasets

Next, we evaluate the performance of OGC on the extremely challenging outdoor object segmentation task. First, we evaluate on the KITTI Scene Flow (KITTI-SF) dataset. KITTI-SF contains 200 pairs of point clouds for training and 200 single-frame point clouds for testing. The experimental results are shown in the table below: Our method achieves excellent performance close to that of fully supervised methods.

3efebe74c3269e16cb7a602fd58b4ddd.png

Figure 7 Comparison of quantitative results of different methods on the KITTI-SF dataset

In practical applications, sometimes sequence data containing motion cannot be collected, but we can generalize the OGC model trained in similar scenes. Here, we take the OGC model trained on the above KITTI-SF dataset and directly use it to segment the single-frame point cloud in the KITTI Detection (KITTI-Det) and SemanticKITTI datasets. Note: The point clouds in KITTI-Det and SemanticKITTI are collected by radar, which is much sparser than the point cloud collected by the binocular camera in KITTI-SF, and the data scale of KITTI-SF (3769 frames) and SemanticKITTI (23201 frames) Both are much larger than KITTI-SF. The experimental results are shown in the following two tables: The OGC model we trained on KITTI-SF can directly generalize to sparse radar point cloud data, and achieve an effect close to that of the fully supervised method.

6d0b566969ccf340134094e42106ad56.png

Quantitative results comparison of Figure 8 on the KITTI-Det dataset (* indicates that the model is trained on KITTI-SF)

0bd00d69fd405036834e516fdcd86e6f.png

Comparison of quantitative results of Figure 9 on the SemanticKITTI dataset (* indicates that the model is trained on KITTI-SF)

Ablation Studies

We conducted ablation experiments on the core technology of the OGC framework on the SAPIEN dataset:

1) Loss function design: As can be seen from the chart below, the combination of the three loss functions of OGC can bring the best results. If the dynamic loss is removed, all points will be assigned to the same object; if the smoothness loss is removed, there will be an "over-segmentation" problem.

2) Iterative optimization algorithm: It can be seen that as the number of iterations increases, higher-quality motion estimation does lead to better object segmentation performance.

a51268e4312ca2e7d160e5094e0e3141.png

Figure 10 Ablation experiments on the SAPIEN dataset (figure-left and table-top: loss function design; table below: iterative optimization algorithm)

05

Summary

To conclude, we present the first framework for unsupervised 3D object segmentation on point clouds. At the core of this framework is a set of loss functions based on the consistency of object geometry, which leverages motion information to effectively supervise object segmentation. Our method is trained on completely unlabeled point cloud sequences. After training, it can be directly used to segment single-frame point clouds, and has shown very good results in various task scenarios. In the future, OGC can be further expanded:

1) When there is a small amount of labeled data, how to combine the unsupervised OGC model with these labeled data to achieve better performance;

2) How to use multi-frame information for better segmentation when there are multiple frames as input.

References

[1] U. M. Nunes and Y. Demiris. 3D motion segmentation of articulated rigid bodies based on RGB-D data. BMVC, 2018.

[2] C. Jiang, D. P. Paudel, D. Fofi, et al. Moving Object Detection by 3D Flow Field Analysis. TITS, 22(4):1950–1963, 2021.

[3] S. A. Baur, D. J. Emmerichs, F. Moosmann, et al. SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation. ICCV, 2021.

[4] B. Cheng, A. G. Schwing, and A. Kirillov. Per-Pixel Classification is Not All You Need for Semantic Segmentation. NeurIPS, 2021.

[5] Y. Kittenplon, Y. C. Eldar, and D. Raviv. FlowStep3D: Model Unrolling for Self-Supervised Scene Flow Estimation. CVPR, 2021.

carry

Awake

Click "Read the original text" to jump to 58:41 to view the replay!

Recommendation of wonderful articles in the past

0cbcf24a9b89f38bb03f68d17490b9ca.jpeg

Remember to follow us! New knowledge every day!

 About AI TIME 

AI TIME originated in 2019. It aims to promote the spirit of scientific speculation, invite people from all walks of life to explore the essential issues of artificial intelligence theory, algorithms and scene applications, strengthen ideological collisions, and link global AI scholars, industry experts and enthusiasts. In the form of debate, explore the contradiction between artificial intelligence and the future of human beings, and explore the future of the field of artificial intelligence.

So far, AI TIME has invited more than 1,000 speakers at home and abroad, held more than 500 events, and more than 5 million people watched it.

ff28a95d440d6591749fdbd88572d759.png

I know you

look in

oh

~

874d6e5ce409b5189b60778bc3523b03.gif

Click to read the original text  to view the replay!

Guess you like

Origin blog.csdn.net/AITIME_HY/article/details/130212257