【3D 目标检测】2019 CVPR Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

CVPR 2019

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

3D object detection

  • 2D monocular images
  • autonomous driving scenarios

Proposal

  • lift the 2D images to 3D representations using learned neural network
    3D representations using state-of-the-art GANs
  • leverage existing networks workding directly on 3D data to perform 3D object detection and localization
    3D data for ground plane estimation using recent 3D networks

Results

  • highter results than many methods working on actual 3D inputs acquired from physical sensors

  • a late fusion of the output of the network trained on

    • generated 3D image
    • real 3D image

improve performance

Introduction

Two approaches have been widespread for 3D object detection problems

Our Results are of importance as

  • (i) only using monocular images at inference
    the efforts that are directed towards collecting high quality 3D data can help in scenarios where explicit 3D data cannot be acquired at test time.
  • (ii) the method can be used as a plug-and-play module
    with any existing 3D method which works with BEV images, allowing operations with seamless switching between RGB and 3D scanners while leveraging the same underlying object detection platform.

This paper refers to the following methods

Related work

Object Detection in 3D

Images

The approaches for 3D object detection

Inferring 3D using RGB images

Methods

Generating 3D data from 2D

Image to image translation

Our work addresses the specific task of 3D object detection by translating RGB images to BEV

Our work is to solve the specific task of 3D target detection by converting RGB images to BEV

In recent years, image translation has attracted attention due to its application in style conversion, such as the latest achievements of pix2pix① or ②.

Although 3D object detection may not be as challenging as complete and accurate 3D scene generation, 3D object detection is still a very challenging and related task for autonomous driving use cases. Here, we will generate 3D data as an intermediate step, but we do not pay attention to the quality of the generated 3D data like ① and ②, but design and evaluate our method directly from monocular images.

[CVPR, 2017] Image-to-image translation with conditional adversarial networks
[NeurIPS, 2017] Toward multimodal image-to-image translation

Approach

A. Generating BEV images from 2D images (BirdGAN)

based on the GAN

BirdGAN

The quality of the data used to train GAN has a great impact on the final performance. Two methods of training GAN to generate BEV images are proposed and tested.

  • take all the objects in the scene
  • take only the ‘well defined’ objects in the scene
    motivated by the fact that the point clouds become relatively noisy, and possibly uninformative for object detection, as the distance increases due to very small objects and occlusions

RGB —— only shows the front view

the top mounted LiDAR point cloud —— front、back and sides view

We crop the lidar point cloud appropriately, leaving only the corresponding information in the two modes. The BEV points in the distance are also deleted because they are highly occluded in the RGB image (such as the red arrow object)

B. BirdNet 3D object detection

  • Input
    extracted from the full LIDAR point cloud
    • 3 channel BEV image consisting of height , density , intensity of the point
    • ground plane estimation for determining the height of the 3D bounding boxes

Proposed pipeline with BirdNet

  • BirdGAN
    translated the 2D RGB images into 3 channel BEV image
    3 channel BEV are height , density , intensity of the point

  • Image to 3D network
    like the [NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision

    • input
      3 channel RGB
    • generate either the point clouds or their voxelized version as the 3D model
    • 3D model used to obtain the ground planes estimation
      for constructing the 3D bounding boxes around the detected objects
  • The BEV detections are then converted to 3D detections using the ground plane estimation

C. MV3D as base architecture

Proposed pipeline with MV3D

  • BirdGANs

    • input
      2D RGB image
    • translate to
      (m+2) channel BEV images
  • Image to Depth Net

    • input
      2D RGB image
    • to obtain the corresponding depth image
    • use the depth map image to obtain the 3D point cloud
    • generate LiDAR FV (Front View) image
  • MV3D

    • input
      RGB , FV , BEV images
    • obtain 3D detection

The difference between MV3D and BirdNet

  • the format of BEV

    • BirdNet

      takes a 3 channel BEV image ( i.e. height, intensity, density )

    • MV3D

      pre-processes the height channel to encode more detailed height information

      • divides the point cloud into M M M slice

      • compute a height map for each slice

      • giving a BEV image of M + 2 M+2 M+2 channels

      • use multiple independently trained BirdGANs to generate the M height channels of the BEV image is better than directly generating the M + 2 channel BEV image

D. Ground plane estimation

The ground plane estimation

  • BirdNet
    uses the ground plane,
    i.e. the bottom-most points, to estimate the height of the object for constructing the 3D bounding boxes

  • MV3D
    obtains the 3D localizations by projecting the 3D bounding boxes to the ground plane.
    The ground plane estimation is an important step here, especially for MV3D, as it governs the size of the projected objects on the BEV impacting the quality of 3D object localization.

Two ways to obtain the ground plane

Methods of this paper

  • choose the former paradigm with PTN and reconstruct the 3D object/scene

  • The ground plane is then estimated by fitting a plane using RANSAC [31].
    RANSAC [ICCAS,2014] Robust ground plane detection from 3d point clouds

Experiments

Dataset

KITTI

training : 7, 481 images
testing : 7, 518 images

validation

Training data for BirdGAN

training BirdGAN on two types of training data

  • w/o clipping
    use the data in the field of view of RGB images i.e. 90° in the front view

  • clipping
    In KITTI dataset, the objects that are far, are difficult to detect mostly due to occlusion
    using only the nearby objects for training the GANs, i.e. we remove the BEV image data corresponding to points which are more than 25 meters away and use these modified BEV images to train the GAN based translation model

A. Quantitative results

BEV Object Detection

  • MV3D is better than BirdNet in both real data and generated data
  • The performance of the cut data method is 10-25% higher than that of the corresponding network trained without the cut data.
    Low noise training improves the performance of the test by learning a better quality BEV generator

3D Object Detection

Generalization Ability

Proved on AVOD that the proposed method can be used as a drop-in replacement

2D Object Detection

It can be observed that even with entirely generated data the method also performs close to the base networks for 2D object detection

B. Qualitative Results

  • actual BEV images for compared methods
    (first three columns)

  • generated BEV images for the proposed methods

  • It can be seen from the first and second columns that outs MV3D can use the image detection results of the generated BEVs and the MV3D results of the real BEV images are very close.

  • The second column shows that our BirdNet is highly sensitive to occlusion (the pole affects the detection of the car)

C. Ablation Studies

the impact of various channels within BEV on 3D object detection and localization

Analyze whether the generated data can improve detection and positioning performance by adding real data

  • First try to merge the ground real training image and the generated BEV image into a common data set for training. The
    performance is reduced.

    This may be because the network cannot optimize the merged data set. For the same image, one is real and the other is generated. They will have different statistics and may confuse the detector when training together

  • Train two independent networks separately and
    combine their output with a concatenation operation (such as splicing or averaging)

    • pretrained on ground truth

    • pretrained on generated BEV

    • Mean operation

Improved performance, hard drops (maybe because the picture contains severe occlusion, generated BEVs reduce performance)

Conclusion

  • Using GANs to generate 3D data from 2D images can bring performance close to the most advanced 3D object detector
  • Two generation mechanisms are proposed to match the two different latest three-dimensional target detection architectures, and training strategies are proposed, which have better detection performance
  • The later fusion of networks trained with real and generated 3D data can improve their respective performance

Guess you like

Origin blog.csdn.net/qq_31622015/article/details/103002098