Classic Literature Reading--Bidirectional Camera-LiDAR Fusion (Camera-LiDAR Bidirectional Fusion New Paradigm)

0. Introduction

For lidar and vision cameras, multimodal fusion between the two is very important, and this article " Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion " proposes a multi-stage two-way fusion framework, and built two models, CamLiRAFT and CamLiPWC, based on the two architectures of RAFT and PWC. The related code can be found in https://github.com/MCG-NJU/CamLiFlow . Let's take a closer look at the details of this article.

1. Main contributions

This paper focuses on the multimodal fusion of Camera and LiDAR. The specific task is the joint estimation of 2D optical flow and 3D optical flow (3D optical flow is also called scene flow, Scene Flow). There are four main contributions in this paper:

  1. We propose a bidirectional and multi-stage camera-LiDAR fusion pipeline for optical flow and scene flow estimation. Our process is general and can be applied to various network architectures.
  2. We instantiate two types of bidirectional fusion pipelines, one based on a pyramidal coarse-to-fine architecture (called CamLiPWC), and the other based on a recursive full-field transform (called CamLiRAFT).
  3. We design a learnable fusion operator (Bi-CLFM) to align and fuse image and point features in a bidirectional manner via learnable interpolation and bilinear sampling. A gradient separation strategy is also introduced to prevent one modality from dominating the training.
  4. On FlyingThings3D and KITTI, our method achieves state-of-the-art performance in both camera-LiDAR 3 and LiDAR-only settings. Experiments on Sintel also demonstrate its strong generalization performance and ability to handle non-rigid body motion.

2. Related work

This article is relatively new work, and the author believes that it is necessary to review related work so that readers can understand how this field has developed.

2.1 Optical flow method (2D)

Optical flow estimation aims to predict dense 2D motion for each pixel from a pair of frames . We roughly divide related optical flow estimation methods into two categories: traditional methods and convolutional neural network (CNN) based methods.
Traditional Approaches : Traditional approaches formulate optical flow estimation as an energy minimization problem. Horn and Schunck [18] proposed variational methods to estimate optical flow by imposing a trade-off between data terms and regularization terms. Black and Anandan [5] introduced a robust framework to address oversmoothing and noise sensitivity. Other methods improve data items [68] and matching costs.

CNN-based methods : Since Krizhevsky et al. [27] showed that convolutional neural networks perform well on large-scale image classification, many researchers have started to explore CNN-based methods for various computer vision tasks. FlowNet [13] is the first end-to-end trainable CNN for optical flow estimation, employing an encoder-decoder architecture. FlowNet2 [21] stacks multiple FlowNets into a larger network. PWC-Net [50] and some other methods [19], [20], [46], [63] use iterative refinement of a coarse-to-fine pyramid. These coarse-to-fine methods tend to miss small and fast-moving objects that disappear at the coarse level. To address this issue, RAFT [51] constructs a 4D cost volume for all pixel pairs and iteratively updates the optical flow at high resolution. In this paper, we implement our bidirectional fusion pipeline based on two typical optical flow architectures: PWC-Net [50] and RAFT [51].

2.2 Field flow method (3D)

The field flow method is similar to the optical flow method, except that field flow is a motion field defined in 3D space, while optical flow is defined in 2D space. Some methods estimate pixel-level dense field flow from RGB-D input, while others focus on estimating sparse field flow from point clouds.

Field flow of RGB-D frames : RGB-D field flow refers to estimating dense 3D motion for each pixel from pairs of stereoscopic or RGB-D frames . Similar to optical flow, traditional methods explore variational optimization and discrete optimization, and treat field flow as an energy minimization problem . Recent methods divide field flow estimation into multiple subtasks and build modular networks of one or more submodules for each subtask. Specifically, DRISF [36] estimates optical flow, depth and segmentation from two consecutive stereo images, and uses a Gauss-Newton solver to find the optimal 3D rigid motion. RigidMask [65] predicts segmentation masks for the background and multiple rigid-body moving objects, which are then parameterized by 3D rigid transformations. Despite remarkable progress, their submodules are independent of each other, unable to exploit complementary features of different modalities. RAFT-3D [52] explores feature-level fusion and concatenates image and depth maps into RGB-D frames at an early stage, and then uses a unified 2D network to iteratively update the dense field of pixel-level SE3 motion. However, this "early fusion" makes it difficult for 2D CNNs to exploit 3D structural information.

Field Flow for Point Clouds : PointNet is a pioneering work in deep learning research on point sets that can directly process 3D points (e.g. from LiDAR). Since then, researchers [15], [26], [32], [33], [41], [55], [56], [59] have explored point-based methods for field flow estimation. Based on PointNet++ [44], FlowNet3D [32] uses a flow embedding layer to represent the motion of points. FlowNet3D++ [55] achieves better performance by adding geometric constraints. Inspired by bilateral convolutional layers, HPLFlowNet [15] projects points onto a Permutohedral lattice. PointPWCNet [59] introduces a learnable point cloud cost volume and estimates field flow in a coarse-to-fine fashion. FLOT [41] regards field flow estimation as a graph matching problem between corresponding points in adjacent frames, and solves the problem using optimal transport. PV-RAFT [56] proposes point-voxel correlation fields to capture the local and long-range dependencies of point pairs. FlowStep3D [26] designs a recurrent architecture that learns to iteratively refine field flow predictions. However, these methods do not take advantage of the color features provided by images, which limits performance.

2.3 Camera-LiDAR Fusion

Cameras and LiDAR have complementary properties that facilitate many computer vision tasks, such as depth estimation, field flow estimation, and 3D object detection. These methods can be classified as outcome-level and feature-level fusions.

Result-level fusion : Some researchers constructed modular networks and performed result-level fusion. FPointNet [42] uses off-the-shelf 2D object detectors to restrict the search space for 3D object detection, which significantly reduces computation and improves runtime. IPOD [66] replaces 2D object detectors with 2D semantic segmentation and uses point-based proposal generation. PointPainting [54] projects LiDAR point clouds into the output of a semantic segmentation network of images and attaches a class score to each point. However, the performance of result-level fusion is limited by submodules, since the entire network depends on its results. In contrast, our method utilizes feature-level fusion and can be trained in an end-to-end manner .

Feature-level fusion : Another approach is feature-level fusion. PointFusion [60] utilizes a 2D object detector to generate 2D boxes, and then uses a CNN and point-based network to fuse image and point features for 3D object detection. MVX-Net [49] uses pretrained 2D Faster R-CNN to extract image features and VoxelNet to generate final boxes. Points or voxels are projected onto the image plane and the corresponding features are fused with 3D features. [30] utilized continuous convolutions to fuse image features onto BEV feature maps of different resolutions. BEVFusion [34] transforms camera features into BEV space using lift-and-shoot operations and fuses the two modalities using a BEV encoder. TransFusion [2] follows a two-stage pipeline: queries are generated from LiDAR features and interact with 2D and 3D features respectively. CMT [62] explores cross-modal transformers to implicitly align multimodal features using coordinate encodings. Different from previous work, we propose a multi-stage bidirectional fusion process, which not only fully exploits the characteristics of each modality, but also maximizes the complementarity among modalities.

3. Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM) detailed method

This section introduces the Bidirectional Camera-LiDAR Fusion Module (Bi-CLFM), which can fuse dense image features and sparse point features in a bidirectional manner (2D to 3D and 3D to 2D). As shown in Figure 3, Bi-CLFM takes the image feature F 2 D ∈ RH × W × C 2 D F_{2D}∈\mathbb{R}^{H×W×C_{2D}}F2D _RH×W×C2D _, point feature G 3 D = { gi ∣ i = 1 , . . . , N } ∈ RN × C 3 D G_{3D}=\{g_i|i=1,...,N\}∈\mathbb{ R}^{N×C_{3D}}G3D _={ gii=1...N}RN×C3D _和点位置 P = { p i ∣ i = 1 , . . . , N } ∈ R N × 3 P=\{p_i|i=1,...,N\}∈\mathbb{R}^{N×3} P={ pii=1...N}RN × 3 as input, whereNNN represents the number of points. The output contains fused image and point features. Therefore, Bi-CLFM performs bidirectional fusion without changing the spatial structure of input features, and can be easily plugged into any point-image fusion structure. For each direction (2D to 3D or 3D to 2D), features are first aligned to have the same spatial structure using bilinear grid sampling and learnable interpolation. Next, aligned features are adaptively fused based on selective kernel convolution [29]. Furthermore, we introduce a gradient separation strategy to address the problem of scale mismatch gradients.

3.1 Feature Alignment

Since image features are dense and point features are sparse, we need to align the features of the two modalities before fusion. Specifically, image features need to be sampled to become sparse, while point features need to be interpolated to become dense.

insert image description here

2D->3D : For 2D-to-3D alignment, we first project the points onto the image plane to sample the corresponding 2D features, where non-integer coordinates are handled using bilinear interpolation. Then use a 1×1 convolution to align the channel dimension of the feature obtained by the sample with the input point feature dim.

3D->2D : The alignment from 3D to 2D is similar. We project the points onto the image plane and use a new learnable interpolation module (described below) to obtain dense feature maps from sparse point features. A 1×1 convolution is then also used to align the channel dimensions of the interpolated point features with the input image features dim.

Learnable interpolation : For each pixel in the dense feature map, we find the k nearest projected points in the image plane. Next, we use a ScoreNet to generate weights for adjacent features based on the coordinate offset, and ScoreNet will assign a score in the (0,1) interval to each feature. Finally, the adjacent features are weighted according to the scores, and aggregated using symmetric operations such as max or sum.

3.2 Adaptive Feature Fusion

After aligning the features, we need to fuse them. The easiest way is concat or add, but they are not adaptive enough. Here we use the SKNet-based method for adaptive feature fusion, which can adaptively select the channels that need to be fused.
insert image description here

Figure 4. Details of learnable interpolation. For each target pixel, we find the k closest points around it. A lightweight MLP is then used followed by a sigmoid activation function to weight neighboring features.

3.3 Gradient truncation

When doing multimodal fusion, you may encounter the problem of mismatching the gradient sizes of the two modalities. If left untreated, it could result in a modality-dominated training. We analyzed the gradient of the two modes and found that the gradient of 2D is about 40 times larger than that of 3D! Therefore, we truncate the gradient from another modality in Bi-CLFM so that the modalities are not affected by each other.

4. PWC pipeline(CamLiPWC)

PWC-Net [50] is designed according to simple and well-established principles, including pyramid processing, warping and using cost volumes. Optical flow computed at a coarse level is upsampled and warped to a finer level. As shown in Figure 6, we introduce two branch networks based on the PWC architecture, namely CamLiPWC.

4.1 Basic Architecture

We use IRR-PWC [20] as the image branch. The only difference is that we replace bilinear upsampling with a learnable convex upsampling module [51] that produces the final prediction from the desired level. The point branch is based on PointPWC-Net [59] with two main modifications. First, we increase the level of the pyramid to match the image branch. Therefore, the point pyramid has 6 levels of 8192, 4096, 2048, 1024, 512 and 256 points respectively. Second, the weights of the decoder are shared among all pyramid levels. According to IRR-PWC, iterative residual refinement with weight sharing can reduce the number of parameters and increase accuracy.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/129925553