CVPR 2019

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

3D object detection

2D monocular images
autonomous driving scenarios

Proposal

lift the 2D images to 3D representations using learned neural network
3D representations using state-of-the-art GANs
leverage existing networks workding directly on 3D data to perform 3D object detection and localization
3D data for ground plane estimation using recent 3D networks

Results

highter results than many methods working on actual 3D inputs acquired from physical sensors
a late fusion of the output of the network trained on
- generated 3D image
- real 3D image

improve performance

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

Introduction
Related work
Approach
Experiments
Conclusion

Introduction

Two approaches have been widespread for 3D object detection problems

to detect objects in 2D using monocular images and then infer in 3D
to use 3D data (e.g. LiDAR(激光雷达)) to detect bounding boxes directly in 3D
- MV3D
  [CVPR, 2017] Multi-view 3d object detection network for autonomous driving
Compare the two methods
- the methods based on 2D monocular images significantly lag behind the the method use 3D data
  - methods based on monocular images attempt at implicitly inferring 3D information from the input
  - availability of depth information (derived or explicit)
    greatly increases the performance of methods that use 3D data
- a monocular image based 3D object detection method will be highly practical
  - closing the gap in performance with the methods requiring explicit 3D data
  - cheaper and lighter 2D cameras
  - expensive and bulky 3D scanners

Our Results are of importance as

(i) only using monocular images at inference
the efforts that are directed towards collecting high quality 3D data can help in scenarios where explicit 3D data cannot be acquired at test time.
(ii) the method can be used as a plug-and-play module
with any existing 3D method which works with BEV images, allowing operations with seamless switching between RGB and 3D scanners while leveraging the same underlying object detection platform.

This paper refers to the following methods

Related work

Object Detection in 3D

Images

3D data
- LiDAR
  - Birdnet
    [ITSC, 2018] Birdnet: A 3d object detection framework from lidar information
  - [CVPR, 2015] Data-driven 3d voxel patterns for object category recognition
- stereo
  - 3DVP
    [TPAMI,2018] 3d object proposals using stereo imagery for accurate object class detection
monocular images
- [CVPR, 2015] Joint sfm and detection cues for monocular 3d localization in road scenes

The approaches for 3D object detection

proposing new neural network architectures
- BirdNet
  [ITSC, 2018] Birdnet: A 3d object detection framework from lidar information
- MV3D
  [CVPR, 2017] Multi-view 3d object detection network for autonomous driving
novel object representations
- 3DVP
  [TPAMI,2018] 3d object proposals using stereo imagery for accurate object class detection
utilize other modalities along with 3D
- corresponding 2D images
  MV3D
  [CVPR, 2017] Multi-view 3d object detection network for autonomous driving
- structure from motion
  [CVPR, 2016] A continuous occlusion model for road scene understanding
follow the success of 2D object detection methods and are based on 3D proposal networks and classifying them
- MV3D
  [CVPR, 2017] Multi-view 3d object detection network for autonomous driving
- AVOD
  [IROS, 2018] Joint 3d proposal generation and object detection from view aggregation
took multiview projections of the 3D data to use with 2D image networks followed by fusion mechanisms
[ICCV,2015] Multi-view convolutional neural networks for 3d shape recognition

Inferring 3D using RGB images

Methods

predicting 2D keypoint heat maps and 3D objects structure recovery
[ECCV,2016] Single image 3d interpreter network
use single RGB image to obtain detailed 3D structure using MRFs on small homogeneous patches to predict plane parameters encoding 3D locations and orientations of the patches

[TPAMI, 2009] Make3d: Learning 3d scene structure from a single still image
learn to predict 3D human pose from single image using a fine discretization of the 3D space around the subject and predicting per voxel likelihoods for each joint, and using a coarse-to-fine scheme
[CVPR,2017] Coarse-tofine volumetric prediction for single-image 3d human pose

Generating 3D data from 2D

[NeurIPS, 2016] Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling
- use Generative Adversarial Networks (GANs) to generate 3D objects
- using volumetric networks, extending the vanilla GAN and VAE GAN to 3D
[3DV, 2017] 3d shape induction from 2d views of multiple objects
PrGAN (propose projective generative adversarial networks)
for obtaining 3D structures from multiple 2D views
[CVPR, 2017] Transformation-grounded image generation network for novel 3d view synthesis
synthesize novel views from a single image by inferring geometrical information followed by image completion, using a combination of adversarial and perceptual loss
[NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision
propose Perspective Transformer Nets (PTNs), an encoder-decoder network with a novel projection loss using perspective transformation, for learning to use 2D observations without explicit 3D supervision
[AAAI, 2018] Learning adversarial 3d model generation with 2d image enhancer
generate 3D models with an enhancer neural network extracting information from other corresponding domains (e.g. image)
[ICCV, 2017] 3d object reconstruction from a single depth view with adversarial learning
uses a GAN to generate 3D objects from a single depth image, by combining autoencoders and conditional GAN
[arXiv,2017] Improved adversarial systems for 3d object generation and reconstruction
uses a GAN to generate 3D from 2D images, and perform shape completion from occluded 2.5D views,using Wasserstein objective.

Image to image translation

Our work addresses the specific task of 3D object detection by translating RGB images to BEV

Our work is to solve the specific task of 3D target detection by converting RGB images to BEV

In recent years, image translation has attracted attention due to its application in style conversion, such as the latest achievements of pix2pix① or ②.

Although 3D object detection may not be as challenging as complete and accurate 3D scene generation, 3D object detection is still a very challenging and related task for autonomous driving use cases. Here, we will generate 3D data as an intermediate step, but we do not pay attention to the quality of the generated 3D data like ① and ②, but design and evaluate our method directly from monocular images.

① [CVPR, 2017] Image-to-image translation with conditional adversarial networks
② [NeurIPS, 2017] Toward multimodal image-to-image translation

Approach

A. Generating BEV images from 2D images (BirdGAN)

based on the GAN

[CVPR, 2017] Image-to-image translation with conditional adversarial networks

BirdGAN

Encoder
VGG-16
generate the BEV image
DCGAN Conditional processing of encoded vectors
DCGAN [arXiv 2015] Unsupervised representation learning with deep convolutional generative adversarial networks

The quality of the data used to train GAN has a great impact on the final performance. Two methods of training GAN to generate BEV images are proposed and tested.

take all the objects in the scene
take only the ‘well defined’ objects in the scene
motivated by the fact that the point clouds become relatively noisy, and possibly uninformative for object detection, as the distance increases due to very small objects and occlusions

RGB —— only shows the front view

the top mounted LiDAR point cloud —— front、back and sides view

We crop the lidar point cloud appropriately, leaving only the corresponding information in the two modes. The BEV points in the distance are also deleted because they are highly occluded in the RGB image (such as the red arrow object)

B. BirdNet 3D object detection

Input
extracted from the full LIDAR point cloud
- 3 channel BEV image consisting of height , density , intensity of the point
- ground plane estimation for determining the height of the 3D bounding boxes

Proposed pipeline with BirdNet

BirdGAN
translated the 2D RGB images into 3 channel BEV image
3 channel BEV are height , density , intensity of the point
Image to 3D network
like the [NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision
- input
  3 channel RGB
- generate either the point clouds or their voxelized version as the 3D model
- 3D model used to obtain the ground planes estimation
  for constructing the 3D bounding boxes around the detected objects
The BEV detections are then converted to 3D detections using the ground plane estimation

C. MV3D as base architecture

Proposed pipeline with MV3D

BirdGANs
- input
  2D RGB image
- translate to
  (m+2) channel BEV images
Image to Depth Net
- input
  2D RGB image
- to obtain the corresponding depth image
- use the depth map image to obtain the 3D point cloud
- generate LiDAR FV (Front View) image
MV3D
- input
  RGB , FV , BEV images
- obtain 3D detection

The difference between MV3D and BirdNet

the format of BEV
- BirdNet
  
  takes a 3 channel BEV image ( i.e. height, intensity, density )
- MV3D
  
  pre-processes the height channel to encode more detailed height information
  - divides the point cloud into $M$ slice
  - compute a height map for each slice
  - giving a BEV image of $M + 2$ channels
  - use multiple independently trained BirdGANs to generate the M height channels of the BEV image is better than directly generating the M + 2 channel BEV image

D. Ground plane estimation

The ground plane estimation

BirdNet
uses the ground plane,
i.e. the bottom-most points, to estimate the height of the object for constructing the 3D bounding boxes
MV3D
obtains the 3D localizations by projecting the 3D bounding boxes to the ground plane.
The ground plane estimation is an important step here, especially for MV3D, as it governs the size of the projected objects on the BEV impacting the quality of 3D object localization.

Two ways to obtain the ground plane

by reconstructing a 3D model from a single RGB image
- Perspective Transformer Network
  [NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision
- Point Set generation
  [CVPR, 2017] Unsupervised monocular depth estimation with left-right consistency
- depth estimation
  [ITSC, 2018] Birdnet: A 3d object detection framework from lidar information
using the image to directly estimate the ground plane without transforming the image to 3D
requires explicit presence of strong 2D object proposals or texture/color pattern

Methods of this paper

choose the former paradigm with PTN and reconstruct the 3D object/scene
The ground plane is then estimated by fitting a plane using RANSAC [31].
RANSAC [ICCAS,2014] Robust ground plane detection from 3d point clouds

Experiments

Dataset

KITTI

training ： 7, 481 images
testing ： 7, 518 images

validation

like [NeurIPS,2015] 3d object proposals for accurate object class detection split the KITTI training set into train and validation sets (each containing half of the images).
We ensure that our training and validation set do not come from the same video sequences, and evaluate the performance of our bounding box proposals on the validation set

Training data for BirdGAN

training BirdGAN on two types of training data

w/o clipping
use the data in the field of view of RGB images i.e. 90° in the front view
clipping
In KITTI dataset, the objects that are far, are difficult to detect mostly due to occlusion
using only the nearby objects for training the GANs, i.e. we remove the BEV image data corresponding to points which are more than 25 meters away and use these modified BEV images to train the GAN based translation model

A. Quantitative results

BEV Object Detection

MV3D is better than BirdNet in both real data and generated data
The performance of the cut data method is 10-25% higher than that of the corresponding network trained without the cut data.
Low noise training improves the performance of the test by learning a better quality BEV generator

3D Object Detection

Generalization Ability

Proved on AVOD that the proposed method can be used as a drop-in replacement

2D Object Detection

It can be observed that even with entirely generated data the method also performs close to the base networks for 2D object detection

B. Qualitative Results

actual BEV images for compared methods
(first three columns)
generated BEV images for the proposed methods

It can be seen from the first and second columns that outs MV3D can use the image detection results of the generated BEVs and the MV3D results of the real BEV images are very close.
The second column shows that our BirdNet is highly sensitive to occlusion (the pole affects the detection of the car)

C. Ablation Studies

the impact of various channels within BEV on 3D object detection and localization

Analyze whether the generated data can improve detection and positioning performance by adding real data

First try to merge the ground real training image and the generated BEV image into a common data set for training. The
performance is reduced.

This may be because the network cannot optimize the merged data set. For the same image, one is real and the other is generated. They will have different statistics and may confuse the detector when training together
Train two independent networks separately and
combine their output with a concatenation operation (such as splicing or averaging)
- pretrained on ground truth
- pretrained on generated BEV
- Mean operation

Improved performance, hard drops (maybe because the picture contains severe occlusion, generated BEVs reduce performance)

Conclusion

Using GANs to generate 3D data from 2D images can bring performance close to the most advanced 3D object detector
Two generation mechanisms are proposed to match the two different latest three-dimensional target detection architectures, and training strategies are proposed, which have better detection performance
The later fusion of networks trained with real and generated 3D data can improve their respective performance

【3D 目标检测】2019 CVPR Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

Introduction

Related work

Object Detection in 3D

Inferring 3D using RGB images

Generating 3D data from 2D

Image to image translation

Approach

A. Generating BEV images from 2D images (BirdGAN)

B. BirdNet 3D object detection

C. MV3D as base architecture

D. Ground plane estimation

Experiments

Dataset

Training data for BirdGAN

A. Quantitative results

B. Qualitative Results

C. Ablation Studies

Conclusion

Guess you like