Real-time 3D Pose Estimation with a Monocular Camera Using Deep Learning and Object Priors On an Autonomous Racecar

背景

三维物体投影在平面上会失去一个维度，即不知道物体的距离。但是，有了三维物体的先验信息，我们可以知道三维物体的距离

To this end, we propose a low-latency real-time pipeline to detect and estimate 3D position of multiple objects of interest using just a single measurement, i.e. a single image without the need for any special external markers

We propose a novel “keypoint regression” scheme that exploits prior information about the object’s shape and size to regress and ﬁnd speciﬁc feature points on the image.

We propose a complete pipeline that allows object detection and simultaneously estimate pose of these multiple object using just a single image by exploiting object priors

As per the rules of the competition, the track is marked by cones. The left and right track limits are marked by blue and yellow trafﬁc cones respectively

A novel feature regression scheme, “keypoint regression” is introduced which is used to match 2D-3D correspondences

This section shifts the focus on how to estimate 3D position of multipleobjects from a single image. Although,it is an ill-posed problem but with a priori information in the form of the shape,size and geometry of the object-of-interest, this is solvable, as elaborated in this chapter.

采用ROS系统的优势

1.ROS通过节点通讯，并且有各种传感器、导航的消息类型
2.ROS开源，有一系列的可视化、仿真工具

The pipeline’s sub-modules are run as nodes using Robot Operating System or ROS [5] as the framework that eases handling of communication and data messages across multiple systems as well as different nodes. Different sub-modules communicate via messages, they receive data and output processed information. Another important aspect is that ROS is open-source and provides tools for visualization, monitoring and simulation, making it easy to integrate, test, diagnose and develop the complete software system.

视觉感知系统（两部分）

双目立体
单目
the stereo and the monocular pipeline. The stereo pipeline use the sub-modules explained in this section to have an extremely efﬁcient way of triangulating and estimating depth from binocular vision. This methodology of drastically reducing the search space and cleverly tackling the issue of having numerous and often incorrect feature matche

单目通道

The monocular pipeline has 3 crucial sub-modules which enable it to detect multiple objects of interest and accurately estimate their 3D position up to a distance of 15 meters by making use of a single measurement in the form of an image captured by the monocular camera.

三个子模块

The monocular pipeline can be broken down into three parts. (1) Multiple object detection, (2) Keypoint regression and (3) 2D-3D correspondence followed by 3D pose estimation from a single image

4.2 多目标检测

Object recognition has 4 main categorizes of tasks:
(1) classiﬁcation, (2) classiﬁcation and localization,(3)objectdetectionand(4)instancesegmentation

Instead of using slow and computationally intensive cascade and sliding window approaches, weemployaquick,real-time and powerful object detector in our pipeline in the form of YOLOv2

4.2.1 Importance of color information

The path planning then has a cost function with apenalization term for potential paths that drive the car through same colored cones.
怎么获取锥形桶颜色？
We design the detector such that the cone color information can be directly obtained from it. In other words, we treat each colored cone as a different class for the object detector.

4.2.2 Customizing YOLOv2 for Formula Student Driverless

控制阈值
We choose YOLOv2 for the purpose of detecting different colored cones. Thresholds for it are chosen such that false positives, incorrect detections and misclassiﬁcation are avoided at any cost; even if that translate to not being able to detect all cones in a given image

不太懂，不过应该是缩小置信区间，重新计算特征
Since the annotations for cones are long and thin rectangular bounding boxes, we exploit such prior information by re-calculating the anchor boxes used by YOLOv2. This is done by performing k-means clustering on the aspect-ratio of the rectangle annotations in the dataset and improves the object detector’s performance.

needs to distinguish and detect ‘yellow’, ‘ blue’ and ‘orange’ cones that provide information about the track

4.2.3 Training to detect cones 训练样本

4.3 Keypoint Regression（关键点回归）

先验信息中的geometry（几何）是怎么知道的？
However, since there is prior information about the 3D shape, size and geometry of the cone, one has hope to recover 3D pose from a single measurement。

4.3.1 From patches to features-The need for “keypoint regression”

Using an object detector, cones can be detected in an image. However, one needs more information to go from detections on the image to 3D positions. We exploit a priori knowledge about the cone and a calibrated camera to help estimate its depth via 2D-3D correspondences

分辨率不高或其他情况，提取不到足够的3D信息。
为此，我们引入了一种基于经典计算机视觉的特征提取方案，该方案具有通过机器学习从数据中学习的味道（To this end, we introduce a feature extraction scheme that is inspired by classical computer vision but has a ﬂavor of learning from data via machine learning）

4.3.2 Design and architecture of the “keypoint regressor”

卷积神经网络
The primary difference between this scheme and any other feature extraction process is that this is very speciﬁc as compared to commonly used techniques.

In our case, we want to ﬁnd position of very speciﬁc points on the image that correspond to 3D counterparts whose locations can be measured in 3D from an arbitrary world frameFw.
在这里插入图片描述

4.3.3 Loss function 损失函数

The “keypoint network” also exploits a priori information about the object’s 3D geometry and appearance through the loss function. It uses the concept of the cross-ratio.

4.3.4 Training scheme

锥形桶上有7个关键点，使用卷积神经网络不断训练样本，使其最终能检测出这7个点。这里面的损失函数定义和训练方案不是很懂。
在这里插入图片描述
其中的过程不太懂，但是结果就像上图所示，即使样本模糊甚至被覆盖，仍然可以使用深度学习检测出7个关键点的具体位置。

4.4 2D-3D Correspondences and 3D Pose Estimation（2D-3D对应和3D姿态估计）

The “keypoint network” provides with accurate locations of very speciﬁc features, the keypoints. Since, there is a priori information available about the shape, size, appearance and 3D geometry of the object, the cone in this case, 2D-3D correspondences can be matched. With access to a calibrated camera and 2D-3D correspondences, it is possible to estimate the pose of the object in question from a single image

We use Perspective n-Point or PnP to estimate the pose of every detected cone.（求世界坐标系和相机坐标系的转换矩阵）

在这里插入图片描述

6.1改进之处

改变YOLOv2检测的阈值，但要权衡利弊
视野广角受限，可以使相机方向改变
最大的问题：数据延迟和处理的速度，可以用更好的设备

6.2 Using the “keypoint regression” for efficient stereo triangulation
两个单目相机用关键点回归
we use the “keypoint regression” and PnP on a single image from the left camera to acquire 3D position of detected cones. This 3D position is further improved via additional information in the form of a second image of the same scene (captured at the same time instance) from the right camera. The position accuracy is improved by performing triangulation.
应该就是缩小检测范围，用左相机检测的位置缩小右相机的检测范围

iqiu

发布了35 篇原创文章 · 获赞 4 · 访问量 4025

私信关注

单目摄像头实现实时3D位姿估计