ICCV 2023 | Interpretation of Highlights of Papers Selected by Megvii Research Institute

Recently, the International Conference on Computer Vision ICCV (International Conference on Computer Vision) announced the results of paper acceptance in 2023. A total of 8068 papers were submitted for this conference, and the acceptance rate was 26.8%. ICCV is the top academic conference in the global computer field, held every two years, and ICCV 2023 will be held in Paris, France in October this year. This year, 14 papers from Megvii Research Institute were selected, covering pure vision 3D object detection, multi-modal 3D detection, image matching, optical flow estimation, 3D point cloud registration and other fields. The following is an overview of the selected papers:

01

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

PETRv2: A unified vision-only 3D perception framework

PETRv2 is a unified framework for pure visual 3D perception. Based on PETR, PETRv2 first extends the 3D position encoding in PETR for timing modeling, and realizes timing alignment of object positions between different frames. In order to be applicable to multi-task learning (such as BEV segmentation and 3D lane detection), PETRv2 designs specific query vectors for different tasks and uses a unified Transformer decoder for decoding. PETRv2 achieves state-of-the-art performance in 3D object detection, BEV segmentation, and 3D lane detection, and shows strong robustness to noise. We also conduct a detailed robustness analysis of the PETR framework. We hope that PETRv2 will serve as a robust baseline for 3D perception.

fbd5a2629a2e6789813c6f83650afedf.png

Keywords: 3D position encoding, multitasking, lane markings, robustness

Paper link: https://arxiv.org/pdf/2206.01256.pdf

Code link: https://github.com/megvii-research/PETR.git

02 

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

StreamPETR: An Object-Centric Temporal Modeling Framework for Vision-only 3D Detection

We propose a vision-only 3D object detection framework for long temporal modeling - StreamPETR. The algorithm is designed for video streams, trained with a selectable finite number of frames, and can adapt to longer time frames or even infinite frames during testing. StreamPETR will use the memory queue composed of target queries as an efficient timing representation, and use the attention mechanism for efficient timing modeling, which can greatly improve the detection performance of single-frame detectors with almost no additional computational cost. On the nuScenes list, StreamPETR is the first online vision-only 3D object detection algorithm with comparable performance to lidar.

cfce50b0dd1e2fb320f65574661e3f0b.png

Keywords: Time Series Modeling, Sparse Object Query, Fast

Paper link: https://arxiv.org/pdf/2303.11926.pdf

Code link: https://github.com/exiawsh/StreamPETR

 03

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Cross-Modal Transformer: Fast and Robust Multimodal Fusion 3D Detection Framework

We propose a fast and robust 3D detector - Cross Modal Transformer (CMT). Our model retains the design of DETR, and the features of different modalities are only fused at the token level, and the fusion method is the simplest concat. Our single-model architecture achieves a state-of-the-art detection result of 74.1% NDS on the nuScenes test set, and the inference speed exceeds all existing schemes. In addition, our model is very robust against sensor damage and jitter issues. Even if the entire LiDAR is damaged at runtime, our model can still maintain the inference accuracy of the pure vision model.

261a99fe4bdbb6b1cdb45345849943cf.png

Keywords: fast, robust, sensor failure, high precision

Paper link: https://arxiv.org/pdf/2301.01283.pdf

Code link: https://github.com/junjie18/CMT

 04

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

OnlineRefer: A Simple Online-Reference Video Object Segmentation Framework

The RVOS task aims to segment video objects using language instructions, and the current mainstream solution is the offline model. In this article, we broke the previous cognition that only the offline model is suitable for RVOS, and gave an online baseline called OnlineRefer. Based on Deformable DETR, this method uses the prediction frame of the previous frame as the reference point (query propogation) of the current frame to segment the target frame by frame. Our work performs simple query propogation on a single frame detector, and achieves SOTA performance on Refer-Youtube-VOS and Refer-DAVIS17. We also expect this work to provide inspiration for the application of Segment Anything Model (SAM) in the video field .

9fac3da52c26bbe2753ac5c42a79e03b.jpeg

Keywords: video segmentation, prompt word segmentation, SAM

Paper link: https://arxiv.org/abs/2307.09356

Code: https://github.com/wudongming97/OnlineRefer

 05

Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching

Robust and Efficient Stereo Matching via Uncertainty-Guided Adaptive Image Warping

For the depth estimation problem in binocular vision, stereo matching technology based on correlation is the current mainstream solution. However, there is a problem in the existing technology that it is difficult to use a set of fixed parameter models to maintain stable performance in a variety of complex scenarios. Therefore, we conducted an in-depth study on the robustness of the stereo matching algorithm, proposed an adaptive image warping module based on uncertainty guidance, and designed a new stereo matching framework CREStereo++, which effectively improved the robustness of the model. This algorithm won the championship in the Robust Vision Challenge 2022 competition, and its lightweight version also performed better than other algorithms of the same computational magnitude on the KITTI dataset.

1a160d86134cde915f6e2464be679f27.png

Keywords: stereo matching, adaptive, robust task

Paper link: https://arxiv.org/abs/2307.14071

 06

OccNet: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions

Matching Networks with Occlusion: Robust Image Matching Networks Based on 3D Occupancy Estimation

Image matching methods mostly ignore the occlusion relationship between objects due to camera motion and scene structure. To solve this problem, we propose an image matching method called OccNet, which uses 3D occupancy models to describe the occlusion relationship between objects and find matching points in the occlusion area. With the inductive bias encoded in the occupancy estimation (OE) module combined with the occlusion awareness (OA) module, OccNet can greatly simplify the process of starting a multi-view consistent 3D representation. We evaluate the performance of OccNet on real-world and simulated datasets, and the experimental results show that OccNet outperforms the existing state-of-the-art methods regardless of occlusion scenes.

d6d6f19305592df2872c71619a7adfca.png

It can not only match visible points, but also match lines in the graph (occluded points)

Keywords: matching, occlusion, occupancy estimation, 3D, pose

 07

DOT: A Distillation-Oriented Trainer

DOT: A Distillation-Oriented Optimizer

Knowledge distillation transfers the knowledge in the large model to the small model, and its loss function often includes task-specific loss and distillation loss. We found that after introducing the distillation loss, the task loss of the student model is even larger. This is an unintuitive tradeoff. We speculate that this is due to under-optimization of the distillation loss, as the task loss of the teacher model is lower than that of the student model, and a lower distillation loss brings the student closer to the teacher, resulting in better convergence of the task loss. Aiming at the insufficient optimization of distillation loss, this paper proposes a distillation-oriented optimizer DOT. DOT considers the task and the gradient of the distillation loss separately, and then applies a large momentum to the distillation loss to speed up its optimization. We show experimentally that DOT breaks this trade-off, i.e. both losses are fully optimized.

624d0e99e9b9549e73135746da0299ca.png

Keywords: knowledge distillation, optimization algorithm, momentum method

Paper link: https://arxiv.org/abs/2307.08436

 08

Cumulative Spatial Knowledge Distillation for Vision Transformers

Gradient Space Knowledge Distillation for ViT

Extracting knowledge from CNN is a double-edged sword for ViT. The image-friendly local inductive bias of CNN helps ViT to learn faster and better, but brings two problems: (1) The network design of CNN and ViT are completely different, resulting in different semantic levels of intermediate features, making the Spatial methods of knowledge transfer are inefficient. (2) Extracting knowledge from CNN limits the convergence of the network in the later training process, because the ability of ViT to integrate global information is inhibited by CNN's local inductive bias supervision. To this end, we propose Gradient Spatial Knowledge Distillation (CSKD). CSKD distills the spatial knowledge of CNN to the corresponding token of ViT without introducing intermediate features. CSKD leverages the Gradient Knowledge Fusion (CKF) module, which introduces the global responses of CNNs and gradually emphasizes their importance during training. CKF exploits the local inductive bias of CNN in the early training stage and fully exploits the global ability of ViT in the later stage.

 5d5a1bfb5c4c13a07d3236cda984e47a.png

Keywords: knowledge distillation, heterogeneous network, inductive bias

Paper link: https://arxiv.org/abs/2307.08500

 09

Supervised Homography Learning with Realistic Dataset Generation

Supervised learning of homography matrices generated from real datasets

This paper proposes an iterative framework, including a generation phase and a training phase, to generate realistic training data for supervised homography learning. In the generation stage, given a set of unlabeled image pairs, a pre-estimated principal plane mask and a homography between image pairs are used to generate GT image pairs with realistic motion. In the training phase, the generated data are refined and used to train the network through the proposed two modules, CCM and QAM. The trained network will be used to update the pre-estimated homography matrix in the next stage; through this iterative strategy, data quality and network performance can be gradually improved simultaneously.

114eea1004d44e47b1a165b121c3dbfd.jpeg

Keywords: Homography Estimation, Supervised Learning, Data Generation

Paper link: https://arxiv.org/abs/2307.15353

 10

MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion

MEFLUT: Multi-exposure fusion based on unsupervised 1D lookup table

This paper introduces a new method for multi-exposure image fusion (MEF). We find that fusion weights for exposed images can be encoded as a 1D lookup table (1D LUT), which takes pixel intensity values ​​as input and outputs corresponding fusion weights. We learn an independent 1D LUT for each exposure image, and then all pixels at different exposures can independently query the corresponding 1D LUT for high-quality and efficient fusion. To learn these 1D LUTs, we introduce an attention mechanism into multiple dimensions of the constructed MEF network to significantly improve the fusion quality. Secondly, unlike previous methods that rarely consider actual deployment, we build 1D LUTs through the already trained network. In actual deployment, only 1D LUTs need to be deployed instead of the entire network. This method can be free from any platform constraints. , can be deployed with high quality and high efficiency. Additionally, we collected a new MEF dataset containing 960 samples. We conduct extensive experiments on collected datasets as well as publicly available datasets to verify the effectiveness of our method.

a0d30e6b6927b5b609bdc9cddf464018.png

Keywords: multiple exposure images, high dynamic range, unsupervised, fast, efficient

 11

Learning Optical Flow from Event Camera with Rendered Dataset

Optical Flow Learning for Event Cameras Based on Rendering Data 

This paper proposes a high-quality dataset with accurate event data and optical flow labels based on computer graphics rendering technology, called MDR. In addition, this paper proposes a plug-and-play adaptive adjustment module ADM, which is used to adjust the input event data to the best density interval, and cooperate with the optical flow estimation network to obtain more accurate estimation results. Experiments show that our MDR dataset can facilitate the learning of event camera-based optical flow estimation, and that previous optical flow estimation networks can consistently improve their performance when trained on our dataset. In addition, mainstream optical flow estimation pipelines equipped with our ADM module can further improve performance.

210c5c33e42fef3ad0f9e65c819da385.png

Keywords: event camera, optical flow, synthetic dataset

Paper link: https://arxiv.org/abs/2303.11011

  12

GAFlow: Incorporating Gaussian Attention into Optical Flow

GAFlow: Optical Flow Estimation with Gaussian Attention Mechanism

In this paper, we propose a novel optical flow estimation method that introduces Gaussian attention into the optical flow model (GAFlow) to emphasize local features during representation learning and enforce motion correlation during matching. Specifically, this paper proposes a Gaussian Constrained Layer (GCL) and a Gaussian Guided Attention Module (GGAM), these Gaussian-based modules can be naturally integrated into existing optical flow frameworks. The Gaussian constraint layer can be plugged into the existing Transformer module to strengthen the feature learning of the local neighborhood containing fine-grained structural information; the Gaussian guided attention module not only inherits the neighborhood characteristics of the Gaussian distribution, but also focuses on the In scene-dependent dynamically learnable regions. Experiments show that GAFlow achieves better performance on both generalization tests and online benchmarks.

fd9f0543e045d5386defb3acbd30e720.png

Keywords: optical flow, Gaussian attention

  13

Explicit Motion Disentangling for Efficient Optical Flow Estimation

Efficient Optical Flow Estimation Based on Explicit Motion Decoupling

This paper proposes a new optical flow estimation framework EMD-Flow, which separates global motion learning from local optical flow estimation, so that global matching and local refinement can be processed with fewer computing resources. The network contains two new modules: Multiscale Motion Aggregation (MMA) and Confidence-Guided Optical Flow Propagation (CFP), which take full advantage of cross-scale matching information and self-contained confidence maps to handle densely matched Uncertainty, generate a denser initial optical flow. Finally, with a lightweight decoding module to handle small displacements, an efficient and stable optical flow estimation framework is realized. Experiments demonstrate that EMD-Flow achieves a better balance between performance and runtime on standard optical flow datasets.

1a37e31f5e77d9ae615d77da54625147.png

Keywords: optical flow, motion decoupling, efficient model

   14

SIRA-PCR: Sim-to-Real Adaptation for 3D Point Cloud Registration

SIRA-PCR: 3D point cloud registration based on synthetic-to-real domain adaptation

We construct the first large-scale indoor scene synthesis dataset for 3D point cloud registration, named FlyingShapes, based on the simulated indoor scene dataset 3D-FRONT. At the same time, we also propose a generative domain adaptation pipeline from synthetic data to real data, named SIRA. Among them, an adaptive resampling module is used to eliminate the low-level distribution differences between synthetic and real point cloud data. Through this method, our trained model achieves state-of-the-art registration results on the indoor scene dataset 3DMatch and outdoor scene dataset ETH, achieving registration recall rates of 94.1% and 99.0%, respectively.

89420c6eda3f652dd06537458e25e534.png

Keywords: Point Cloud Registration, Domain Adaptation, Synthetic Datasets

414dfb7f26c286a4d791f6968ad1f6fb.gif

Guess you like

Origin blog.csdn.net/Megvii_tech/article/details/132157903