CVPR 2019
Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles
3D object detection
- 2D monocular images
- autonomous driving scenarios
Proposal
- lift the 2D images to 3D representations using learned neural network
3D representations using state-of-the-art GANs - leverage existing networks workding directly on 3D data to perform 3D object detection and localization
3D data for ground plane estimation using recent 3D networks
Results
-
highter results than many methods working on actual 3D inputs acquired from physical sensors
-
a late fusion of the output of the network trained on
- generated 3D image
- real 3D image
improve performance
Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles
Introduction
Two approaches have been widespread for 3D object detection problems
-
to detect objects in 2D using monocular images and then infer in 3D
-
to use 3D data (e.g. LiDAR(激光雷达)) to detect bounding boxes directly in 3D
-
Compare the two methods
-
the methods based on 2D monocular images significantly lag behind the the method use 3D data
- methods based on monocular images attempt at implicitly inferring 3D information from the input
- availability of depth information (derived or explicit)
greatly increases the performance of methods that use 3D data
-
a monocular image based 3D object detection method will be highly practical
- closing the gap in performance with the methods requiring explicit 3D data
- cheaper and lighter 2D cameras
- expensive and bulky 3D scanners
-
Our Results are of importance as
- (i) only using monocular images at inference
the efforts that are directed towards collecting high quality 3D data can help in scenarios where explicit 3D data cannot be acquired at test time. - (ii) the method can be used as a plug-and-play module
with any existing 3D method which works with BEV images, allowing operations with seamless switching between RGB and 3D scanners while leveraging the same underlying object detection platform.
This paper refers to the following methods
-
3D reconstruction from single images
-
depth estimation
Related work
Object Detection in 3D
Images
-
3D data
-
monocular images
The approaches for 3D object detection
-
proposing new neural network architectures
-
novel object representations
-
utilize other modalities along with 3D
-
corresponding 2D images
MV3D
[CVPR, 2017] Multi-view 3d object detection network for autonomous driving -
structure from motion
[CVPR, 2016] A continuous occlusion model for road scene understanding
-
-
follow the success of 2D object detection methods and are based on 3D proposal networks and classifying them
-
took multiview projections of the 3D data to use with 2D image networks followed by fusion mechanisms
[ICCV,2015] Multi-view convolutional neural networks for 3d shape recognition
Inferring 3D using RGB images
Methods
-
predicting 2D keypoint heat maps and 3D objects structure recovery
[ECCV,2016] Single image 3d interpreter network -
use single RGB image to obtain detailed 3D structure using MRFs on small homogeneous patches to predict plane parameters encoding 3D locations and orientations of the patches
[TPAMI, 2009] Make3d: Learning 3d scene structure from a single still image
-
learn to predict 3D human pose from single image using a fine discretization of the 3D space around the subject and predicting per voxel likelihoods for each joint, and using a coarse-to-fine scheme
[CVPR,2017] Coarse-tofine volumetric prediction for single-image 3d human pose
Generating 3D data from 2D
-
- use Generative Adversarial Networks (GANs) to generate 3D objects
- using volumetric networks, extending the vanilla GAN and VAE GAN to 3D
-
[3DV, 2017] 3d shape induction from 2d views of multiple objects
PrGAN (propose projective generative adversarial networks)
for obtaining 3D structures from multiple 2D views -
[CVPR, 2017] Transformation-grounded image generation network for novel 3d view synthesis
synthesize novel views from a single image by inferring geometrical information followed by image completion, using a combination of adversarial and perceptual loss -
[NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision
propose Perspective Transformer Nets (PTNs), an encoder-decoder network with a novel projection loss using perspective transformation, for learning to use 2D observations without explicit 3D supervision -
[AAAI, 2018] Learning adversarial 3d model generation with 2d image enhancer
generate 3D models with an enhancer neural network extracting information from other corresponding domains (e.g. image) -
[ICCV, 2017] 3d object reconstruction from a single depth view with adversarial learning
uses a GAN to generate 3D objects from a single depth image, by combining autoencoders and conditional GAN -
[arXiv,2017] Improved adversarial systems for 3d object generation and reconstruction
uses a GAN to generate 3D from 2D images, and perform shape completion from occluded 2.5D views,using Wasserstein objective.
Image to image translation
Our work addresses the specific task of 3D object detection by translating RGB images to BEV
Our work is to solve the specific task of 3D target detection by converting RGB images to BEV
In recent years, image translation has attracted attention due to its application in style conversion, such as the latest achievements of pix2pix① or ②.
Although 3D object detection may not be as challenging as complete and accurate 3D scene generation, 3D object detection is still a very challenging and related task for autonomous driving use cases. Here, we will generate 3D data as an intermediate step, but we do not pay attention to the quality of the generated 3D data like ① and ②, but design and evaluate our method directly from monocular images.
① [CVPR, 2017] Image-to-image translation with conditional adversarial networks
② [NeurIPS, 2017] Toward multimodal image-to-image translation
Approach
A. Generating BEV images from 2D images (BirdGAN)
based on the GAN
BirdGAN
- Encoder
VGG-16 - generate the BEV image
DCGAN Conditional processing of encoded vectors
DCGAN [arXiv 2015] Unsupervised representation learning with deep convolutional generative adversarial networks
The quality of the data used to train GAN has a great impact on the final performance. Two methods of training GAN to generate BEV images are proposed and tested.
- take all the objects in the scene
- take only the ‘well defined’ objects in the scene
motivated by the fact that the point clouds become relatively noisy, and possibly uninformative for object detection, as the distance increases due to very small objects and occlusions
RGB —— only shows the front view
the top mounted LiDAR point cloud —— front、back and sides view
We crop the lidar point cloud appropriately, leaving only the corresponding information in the two modes. The BEV points in the distance are also deleted because they are highly occluded in the RGB image (such as the red arrow object)
B. BirdNet 3D object detection
- Input
extracted from the full LIDAR point cloud- 3 channel BEV image consisting of height , density , intensity of the point
- ground plane estimation for determining the height of the 3D bounding boxes
Proposed pipeline with BirdNet
-
BirdGAN
translated the 2D RGB images into 3 channel BEV image
3 channel BEV are height , density , intensity of the point -
Image to 3D network
like the [NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision- input
3 channel RGB - generate either the point clouds or their voxelized version as the 3D model
- 3D model used to obtain the ground planes estimation
for constructing the 3D bounding boxes around the detected objects
- input
-
The BEV detections are then converted to 3D detections using the ground plane estimation
C. MV3D as base architecture
Proposed pipeline with MV3D
-
BirdGANs
- input
2D RGB image - translate to
(m+2) channel BEV images
- input
-
Image to Depth Net
- input
2D RGB image - to obtain the corresponding depth image
- use the depth map image to obtain the 3D point cloud
- generate LiDAR FV (Front View) image
- input
-
MV3D
- input
RGB , FV , BEV images - obtain 3D detection
- input
The difference between MV3D and BirdNet
-
the format of BEV
-
BirdNet
takes a 3 channel BEV image ( i.e. height, intensity, density )
-
MV3D
pre-processes the height channel to encode more detailed height information
-
divides the point cloud into M M M slice
-
compute a height map for each slice
-
giving a BEV image of M + 2 M+2 M+2 channels
-
use multiple independently trained BirdGANs to generate the M height channels of the BEV image is better than directly generating the M + 2 channel BEV image
-
-
D. Ground plane estimation
The ground plane estimation
-
BirdNet
uses the ground plane,
i.e. the bottom-most points, to estimate the height of the object for constructing the 3D bounding boxes -
MV3D
obtains the 3D localizations by projecting the 3D bounding boxes to the ground plane.
The ground plane estimation is an important step here, especially for MV3D, as it governs the size of the projected objects on the BEV impacting the quality of 3D object localization.
Two ways to obtain the ground plane
-
by reconstructing a 3D model from a single RGB image
-
Perspective Transformer Network
[NeurIPS, 2016] Learning single-view 3d object reconstruction without 3d supervision -
Point Set generation
[CVPR, 2017] Unsupervised monocular depth estimation with left-right consistency -
depth estimation
[ITSC, 2018] Birdnet: A 3d object detection framework from lidar information
-
-
using the image to directly estimate the ground plane without transforming the image to 3D
requires explicit presence of strong 2D object proposals or texture/color pattern
Methods of this paper
-
choose the former paradigm with PTN and reconstruct the 3D object/scene
-
The ground plane is then estimated by fitting a plane using RANSAC [31].
RANSAC [ICCAS,2014] Robust ground plane detection from 3d point clouds
Experiments
Dataset
KITTI
training : 7, 481 images
testing : 7, 518 images
validation
- like [NeurIPS,2015] 3d object proposals for accurate object class detection split the KITTI training set into train and validation sets (each containing half of the images).
We ensure that our training and validation set do not come from the same video sequences, and evaluate the performance of our bounding box proposals on the validation set
Training data for BirdGAN
training BirdGAN on two types of training data
-
w/o clipping
use the data in the field of view of RGB images i.e. 90° in the front view -
clipping
In KITTI dataset, the objects that are far, are difficult to detect mostly due to occlusion
using only the nearby objects for training the GANs, i.e. we remove the BEV image data corresponding to points which are more than 25 meters away and use these modified BEV images to train the GAN based translation model
A. Quantitative results
BEV Object Detection
- MV3D is better than BirdNet in both real data and generated data
- The performance of the cut data method is 10-25% higher than that of the corresponding network trained without the cut data.
Low noise training improves the performance of the test by learning a better quality BEV generator
3D Object Detection
Generalization Ability
Proved on AVOD that the proposed method can be used as a drop-in replacement
2D Object Detection
It can be observed that even with entirely generated data the method also performs close to the base networks for 2D object detection
B. Qualitative Results
-
actual BEV images for compared methods
(first three columns) -
generated BEV images for the proposed methods
-
It can be seen from the first and second columns that outs MV3D can use the image detection results of the generated BEVs and the MV3D results of the real BEV images are very close.
-
The second column shows that our BirdNet is highly sensitive to occlusion (the pole affects the detection of the car)
C. Ablation Studies
the impact of various channels within BEV on 3D object detection and localization
Analyze whether the generated data can improve detection and positioning performance by adding real data
-
First try to merge the ground real training image and the generated BEV image into a common data set for training. The
performance is reduced.This may be because the network cannot optimize the merged data set. For the same image, one is real and the other is generated. They will have different statistics and may confuse the detector when training together
-
Train two independent networks separately and
combine their output with a concatenation operation (such as splicing or averaging)-
pretrained on ground truth
-
pretrained on generated BEV
-
Mean operation
-
Improved performance, hard drops (maybe because the picture contains severe occlusion, generated BEVs reduce performance)
Conclusion
- Using GANs to generate 3D data from 2D images can bring performance close to the most advanced 3D object detector
- Two generation mechanisms are proposed to match the two different latest three-dimensional target detection architectures, and training strategies are proposed, which have better detection performance
- The later fusion of networks trained with real and generated 3D data can improve their respective performance