Many application scenarios in real life need to involve 3D information. In view of the complex and diverse application scenarios of 3D vision technology, numerous 3D perception tasks, and complex processes, Paddle provides developers with a low-cost depth information collection solution PaddleDepth and a full-process development kit Paddle3D for autonomous driving 3D perception.
3D Vision Technology Application Scenarios
3D vision is a very popular concept in recent years. It focuses on making computers mimic the human brain to understand and analyze the data collected by sensors. In the past, the 2D vision tasks we did were more about understanding and analyzing the color image information collected by the camera, but in real life, many scenes require 3D information, so 3D vision tasks emerged as the times require.
As shown in the figure below, from the displayed 3D vision technology application scenarios, we can see that 3D vision technology has great application value in the fields of intelligent manufacturing, intelligent unmanned systems, and intelligent medical care.
According to the application scenarios of 3D vision technology, we need to carry out targeted modeling of the problem. For example, for intelligent sports scenes, it is necessary to quantitatively analyze the athlete's posture through 3D posture estimation technology. In the autonomous driving scenario, it is necessary to detect the vehicles next to the unmanned vehicle in real time through 3D object detection and tracking technology.
To sum up, the main scenarios that 3D vision tasks have to face are complex and changeable. We need to select appropriate sensors for specific business scenarios and determine the specific modeling methods of tasks based on the collected data. From this, we discovered two core problems that need to be solved in the implementation of 3D vision technology.
3D Vision Application Difficulties and Solutions
In terms of data acquisition, the existing 3D data acquisition equipment has problems such as high price, sparse or low resolution of the collected data. In response to these problems, we propose PaddleDepth, a paddle depth enhancement development kit. At present, commonly used depth information collection equipment is divided into lidar and ToF (Time of Flight) equipment. Among them, lidar is often used in outdoor scenes. The depth information it collects is relatively sparse and cannot be used for dense 3D reconstruction. Therefore, it is necessary to complement the depth information. ToF equipment is often used in indoor scenes, and the depth information it collects is generally stored in the form of images with low resolution, so it is necessary to perform super-resolution operations on depth information. In addition, the existing depth equipment is expensive, which greatly limits its application range in real scenarios. Therefore, we consider directly estimating the depth information of the scene from the color image, thereby greatly reducing the cost of obtaining depth information, that is, depth information estimation. We have open-sourced the depth information enhancement technology in PaddleDepth, a flying paddle depth enhancement development kit, which can provide a low-cost depth information collection solution.
In the field of 3D perception, the end-to-end process from training, evaluation to deployment is very complicated. Based on this, Fei Paddle proposes Paddle3D, which focuses on the field of 3D perception, covers a large number of 3D perception models, and provides training, evaluation to A full-process tutorial for deployment to reduce user development costs.
PaddleDepth Enhancement Development Kit - PaddleDepth
As shown in the figure below, PaddleDepth aims to create a low-cost depth information collection solution to achieve full coverage of three types of depth information enhancement technologies: depth information completion, depth information super-resolution, and depth information estimation. At present, PaddleDepth contains a total of 10+ cutting-edge models and 4+ self-developed algorithms that are open source for the first time.
In terms of technical influence, PaddleDepth's self-developed algorithm for depth information completion, super-resolution, and single/binocular depth estimation has achieved SOTA performance in various public datasets.
On the open source dataset KITTI, PaddleDepth ranks first in self-supervised monocular depth information estimation, supervised binocular depth information estimation tasks, and depth information completion tasks. On the Middlebury dataset, PaddleDepth ranked first in the depth super-resolution task, and won the championship in the ECCV2020 Robust Vision Challenge Stereo Matching task, and its depth information enhancement technology is industry-leading.
The following is the effect display:
In-depth information completion result display
Compared with obtaining a sparse depth map directly through lidar, users can obtain dense depth estimation results through depth information completion, and perform better 3D reconstruction.
Depth Completion Results
Point cloud reconstruction result after completion
Depth map super-resolution result display
Through depth image super-resolution, users can obtain denser 3D reconstruction results.
Left: super-resolution result Right: original point cloud result
Display of monocular depth estimation results
Through monocular depth estimation, users can reconstruct the 3D information of the original object from a single image.
Monocular Depth Estimation Results
Monocular depth estimation point cloud reconstruction results
Display of binocular depth estimation results
Through the principle of binocular ranging, users can better reconstruct the three-dimensional information of the original object.
Binocular Depth Estimation Results
Binocular Depth Estimation Point Cloud Reconstruction Results
As shown in the figure below, by comparing the 3D reconstruction results, the above methods can obtain more reasonable 3D reconstruction results. Among them, the results of depth completion and binocular depth estimation are more accurate through the input of lidar and the constraints of the game.
PaddleDepth-point cloud reconstruction results display
To sum up, in view of the limitations of existing 3D information collection equipment, we propose PaddleDepth to provide a low-cost depth information collection solution.
-
Through the super-resolution of the depth map , it is mainly used to solve the problem of low resolution of the collected depth image;
-
Through depth completion , it is mainly used to solve the problem of sparse depth images collected;
-
By directly performing depth estimation on input color images , users can further reduce the cost of 3D information collection.
Flying Paddle 3D Perception Development Kit—Paddle3D
As mentioned earlier, one of the difficulties in the 3D perception development task is that there are many tasks and complex processes. Based on this background, we designed and developed the Paddle3D 3D perception development kit.
The figure below is the overall architecture of Paddle3D. The bottom layer is the framework layer, which is developed based on the core framework of Paddle. On top of the Paddle framework, we provide some basic tools, including the integration of common datasets and operators in specific 3D domains. Further up is the algorithm layer, including different types of algorithms. The top layer is the tool layer, which integrates other tools of the flying paddle.
Paddle3D has four characteristics, including rich model library, flexible framework design, end-to-end full process coverage, and seamless connection with Apollo during deployment.
Rich model library
Paddle3D covers cutting-edge classic models in many different directions. For example, the classic models in the monocular 3D detection task based on a single camera, such as SMOKE, CaDDN, etc. The advantage of this type of method is that the cost of the camera is low and the cost is controllable. Paddle3D also integrates a lidar-based target detection model, that is, a point cloud detection model, such as PointPillars, IA-SSD, etc. The advantage of this type of method is that point cloud data has three-dimensional information, and point cloud-based three-dimensional target detection is better than monocular 3D The accuracy is higher. Paddle3D supports multi-modal models. The advantage of this type of method is the advantage of fusing different modal data, which has better robustness. In addition, Paddle3D also supports currently popular multi-view detection task models, such as BEVFormer, PETR, etc. Users can choose the appropriate model for verification according to their actual scenarios.
In point cloud-based 3D detection tasks, the problem often encountered is that the amount of video memory and calculation is very large. In order to avoid these problems, many methods have adjusted the model structure, mapping features from three-dimensional space to two-dimensional space, to reduce the consumption of memory by the model, but another problem is that the accuracy of the model is reduced. To solve this problem, the solution provided by Paddle is sparse convolution SparseConv, which reduces invalid calculations through rule tables, and then solves the problem of video memory and calculation.
Paddle framework version 2.4 has provided relevant capabilities, and Paddle3D has also integrated many cutting-edge models using SparseConv, such as PV-RCNN, Voxel R-CNN, etc.
You can see the model accuracy and speed indicators listed in the above figure, and the results are very good
Flexible frame design
The framework design of Paddle3D can meet the needs of different users. For users who need to integrate Paddle3D into specific tasks, rapid secondary development can be carried out based on the API provided by Paddle.
As shown in the figure below, taking model training as an example, Paddle quickly completes model networking, data set loading, optimizer definition, etc. through 6 APIs, and then starts the training function. For users who do not need secondary development, use the configuration files provided by Paddle to configure different components, and then use the command line tool to start training with one click.
1. Six APIs complete model training to meet secondary development or integration requirements
-
Specify the training data set
train_dataset = KittiMonoDataset(
dataset_root='datasets/KITTI’, mode='train‘,
transforms=[
T.LoadImage(reader='pillow', to_chw=False), T.Gt2SmokeTarget(mode='train', num_classes=3),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
-
define model
model = SMOKE(
backbone=DLA34(),
head=SMOKEPredictor(num_classes=3),
depth_ref=[28.01, 16.32],
dim_ref=[[3.88, 1.63, 1.53], [1.78, 1.70, 0.58], [0.88, 1.73, 0.67]])
-
Learning Rate Update Strategy
lr_scheduler = paddle.optimizer.lr.MultiStepDecay(
milestones=[36000, 55000],
learning_rate=1.25e-4)
-
define optimizer
optimizer = paddle.optimizer.Adam(
learning_rate=lr_scheduler,
parameters=model.parameters())
-
designated trainer
trainer = Trainer(
model=model,
optimizer=optimizer,
iters=20,
train_dataset=train_dataset)
-
start training
trainer.train()
2. Configurable simple training cost, one line command to start training
batch_size: 8
iters: 70000
train_dataset:
type: KittiMonoDataset
dataset_root: datasets/KITTI
transforms:
- type: LoadImage
reader: pillow
to_chw: False
- type: Normalize
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
lr_scheduler:
type: MultiStepDecay
milestones: [36000, 55000]
learning_rate: 1.25e-4
optimizer:
type: Adam
python tools/train.py --config configs/smoke/smoke_dla34_no_dcn_kitti.yml --iters 20 --log_interval 1 --num_worker 5
End-to-end full process coverage
Starting from data preparation, Paddle provides an interface for point cloud data for the scripts generated by the database. During the training process, Paddle integrates VisualDL to view the indicators during the training process in real time. In the final model deployment section, a complete and detailed tutorial and deployment script are provided, as well as the ultimate optimization of model inference performance.
Seamless connection with Apollo
Based on the development process of the Paddle3D perception model After completing the Paddle3D training model, put the model into the Apollo project, replace the original perception model, call the relevant perception interface, and then start the automatic driving front-end software DreamView to view the prediction effect of the model.
Support rapid verification of model effects, high-performance fusion of multi-modal models, and realize efficient construction of full-stack technology solutions for autonomous driving.
Perception model development process based on Paddle3D
In summary, flying paddles can solve two difficulties in 3D perception tasks.
-
3D data collection . For example, data acquisition equipment is expensive, the resolution of equipment acquisition data is low, and the depth map collected by lidar is sparse. PaddleDepth provides developers with a low-cost depth information collection solution.
-
3D information application . Difficulties in this direction include many ways to build task models and high cost of getting started. Paddle3D provides developers with a full-process development solution in the direction of 3D perception, covering a large number of 3D perception models, and provides a full-process tutorial from training, evaluation to deployment, and users can quickly verify the effect.