【3D 目标检测】2019 CVPR Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

CVPR 2019

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

3D object detection

  • 2D monocular images
  • autonomous driving scenarios

Proposal

  • lift the 2D images to 3D representations using learned neural network
    3D representations using state-of-the-art GANs
  • leverage existing networks workding directly on 3D data to perform 3D object detection and localization
    3D data for ground plane estimation using recent 3D networks

Results

  • highter results than many methods working on actual 3D inputs acquired from physical sensors

  • a late fusion of the output of the network trained on

    • generated 3D image
    • real 3D image

improve performance

Introduction

Two approaches have been widespread for 3D object detection problems

Our Results are of importance as

  • (i) only using monocular images at inference
    the efforts that are directed towards collecting high quality 3D data can help in scenarios where explicit 3D data cannot be acquired at test time.
  • (ii) the method can be used as a plug-and-play module
    with any existing 3D method which works with BEV images, allowing operations with seamless switching between RGB and 3D scanners while leveraging the same underlying object detection platform.

This paper refers to the following methods

Related work

Object Detection in 3D

Images

The approaches for 3D object detection

Inferring 3D using RGB images

Methods

Generating 3D data from 2D

Image to image translation

Our work addresses the specific task of 3D object detection by translating RGB images to BEV

我们的工作是通过将RGB图像转换为BEV来解决三维目标检测的具体任务

近年来,图像翻译因其在风格转换中的应用而受到关注,如pix2pix① 或 ② 的最新成果。

虽然3D对象检测可能不如完整准确的3D场景生成具有挑战性,但是对于自动驾驶用例来说,3D对象检测仍然是一个非常具有挑战性和相关的任务。在这里,我们将生成3D数据作为中间步骤,但是我们并没有像 ①、 ② 那样关注生成的3D数据的质量,而是直接从单眼图像中设计和评估我们的方法。

[CVPR, 2017] Image-to-image translation with conditional adversarial networks
[NeurIPS, 2017] Toward multimodal image-to-image translation

Approach

A. Generating BEV images from 2D images (BirdGAN)

based on the GAN

BirdGAN

用于训练GAN的数据质量对最终性能有很大影响,提出并实验了两种训练GAN生成BEV图像的方法

  • take all the objects in the scene
  • take only the ‘well defined’ objects in the scene
    motivated by the fact that the point clouds become relatively noisy, and possibly uninformative for object detection, as the distance increases due to very small objects and occlusions

RGB —— only shows the front view

the top mounted LiDAR point cloud —— front、back and sides view

我们适当地裁剪激光雷达点云,只剩下两种模式中的对应信息。还删除了远处的BEV点,因为它们在RGB图像中高度遮挡(如红色箭头对象)

B. BirdNet 3D object detection

  • Input
    extracted from the full LIDAR point cloud
    • 3 channel BEV image consisting of height , density , intensity of the point
    • ground plane estimation for determining the height of the 3D bounding boxes

Proposed pipeline with BirdNet

  • BirdGAN
    translated the 2D RGB images into 3 channel BEV image
    3 channel BEV are height , density , intensity of the point

  • Image to 3D network
    like the [NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision

    • input
      3 channel RGB
    • generate either the point clouds or their voxelized version as the 3D model
    • 3D model used to obtain the ground planes estimation
      for constructing the 3D bounding boxes around the detected objects
  • The BEV detections are then converted to 3D detections using the ground plane estimation

C. MV3D as base architecture

Proposed pipeline with MV3D

  • BirdGANs

    • input
      2D RGB image
    • translate to
      (m+2) channel BEV images
  • Image to Depth Net

    • input
      2D RGB image
    • to obtain the corresponding depth image
    • use the depth map image to obtain the 3D point cloud
    • generate LiDAR FV (Front View) image
  • MV3D

    • input
      RGB , FV , BEV images
    • obtain 3D detection

The difference between MV3D and BirdNet

  • the format of BEV

    • BirdNet

      takes a 3 channel BEV image ( i.e. height, intensity, density )

    • MV3D

      pre-processes the height channel to encode more detailed height information

      • divides the point cloud into M M M slice

      • compute a height map for each slice

      • giving a BEV image of M + 2 M+2 M+2 channels

      • use multiple independently trained BirdGANs to generate the M height channels of the BEV image is better than directly generating the M + 2 channel BEV image

D. Ground plane estimation

The ground plane estimation

  • BirdNet
    uses the ground plane,
    i.e. the bottom-most points, to estimate the height of the object for constructing the 3D bounding boxes

  • MV3D
    obtains the 3D localizations by projecting the 3D bounding boxes to the ground plane.
    The ground plane estimation is an important step here, especially for MV3D, as it governs the size of the projected objects on the BEV impacting the quality of 3D object localization.

Two ways to obtain the ground plane

Methods of this paper

  • choose the former paradigm with PTN and reconstruct the 3D object/scene

  • The ground plane is then estimated by fitting a plane using RANSAC [31].
    RANSAC [ICCAS,2014] Robust ground plane detection from 3d point clouds

Experiments

Dataset

KITTI

training : 7, 481 images
testing : 7, 518 images

validation

Training data for BirdGAN

training BirdGAN on two types of training data

  • w/o clipping
    use the data in the field of view of RGB images i.e. 90° in the front view

  • clipping
    In KITTI dataset, the objects that are far, are difficult to detect mostly due to occlusion
    using only the nearby objects for training the GANs, i.e. we remove the BEV image data corresponding to points which are more than 25 meters away and use these modified BEV images to train the GAN based translation model

A. Quantitative results

BEV Object Detection

  • MV3D在真实数据和生成数据的情况下都优于BirdNet
  • 剪切数据方法比不剪切数据训练的相应网络的性能提高了10- 25%
    低噪声训练通过学习更好的质量BEV生成器来提高测试时的性能

3D Object Detection

Generalization Ability

在 AVOD上证明了所提出的方法可以作为drop-in替换

2D Object Detection

It can be observed that even with entirely generated data the method also performs close to the base networks for 2D object detection

B. Qualitative Results

  • actual BEV images for compared methods
    (first three columns)

  • generated BEV images for the proposed methods

  • 从第一和第二列都可以看出来 outs MV3D 能够使用生成的BEVs的图像检测结果与用真实的BEV图像的MV3D结果非常接近

  • 第二列可以看出 ours BirdNet 对遮挡高度敏感(杆子影响了车的检测)

C. Ablation Studies

the impact of various channels within BEV on 3D object detection and localization

分析生成的数据是否可以通过增加真实数据来提高检测和定位性能

  • 首先尝试将地面真实训练图像和生成的BEV图像合并成一个公共数据集去训练
    性能下降了

    这可能是由于网络无法优化合并的数据集。对同一图像,一个是真实的,另一个是生成的。他们会有不同的统计数据,在一起训练时可能会混淆检测器

  • 分别训练两个独立的网络
    将它们的输出与一个连接操作(如拼接或平均值)相结合

    • pretrained on ground truth

    • pretrained on generated BEV

    • 均值操作

性能提高了,hard drops(可能是因为图片包含严重的遮挡,generated BEVs 降低性能)

Conclusion

  • 使用GANs从2D图像生成3D数据可以使性能接近最先进的3D对象检测器
  • 提出了两种生成机制,以配合两种不同的最新三维目标检测体系结构,并提出了训练策略,有较好的检测性能
  • 分别用真实和生成的3D数据训练的网络的后期融合,可以提高他们各自的性能

猜你喜欢

转载自blog.csdn.net/qq_31622015/article/details/103002098