A Low-Cost Ranging Solution--Monocular Depth Estimation

 
  

Click on " Xiaobai Learning Vision " above , and choose to add " Star " or " Top "

重磅干货,第一时间送达

Article guide

Introduction: With the continuous development of computer vision technology, especially in some cutting-edge research such as automatic driving, the depth information of images is very important. The monocular distance measurement has been favored by researchers due to its low cost. The editor is also learning the knowledge of monocular distance measurement recently, and I will share with you a BTS monocular distance measurement method, let us learn together.

Part 01

The difference between monocular and binocular ranging principles

Monocular and binocular are two different types of cameras. Both of them can obtain distance information through calculation through the collected images, but the distance measurement principles of the two are completely different. Monocular ranging is generally called depth estimation, and its accuracy is relatively low. Monocular ranging is to achieve target recognition through image matching, and then estimate the target distance by the size of the target in the image. The binocular distance measurement is to calculate the distance through the disparity map between the two images. This method does not need to identify the type of the target, and the accuracy is more accurate than the monocular distance measurement.

Part 02

Advantages and disadvantages of single and binocular ranging

The advantages of monocular distance measurement are low cost, simple system structure, and low demand for calculation. But its disadvantage is that it needs to update and maintain a huge sample database to ensure a high recognition rate, and the overall ranging accuracy is low.

The advantage of binocular ranging is that it has high precision. It directly uses the principle of disparity map to measure distance directly, without maintaining the sample database, and the sorted distance measurement accuracy is high. The disadvantage is that the cost of binocular ranging is higher than that of monocular, and binocular systems have very high requirements for computing performance, and usually need to be equipped with a dedicated image processing chip.

Part 03

Difficulties in monocular ranging

Monocular odometry is an ill-posed problem because there are infinitely many 3D scenes that can be projected onto the same 2D scene. To understand geometric configurations from a single image, one needs to consider not only local cues but also global context.

Note: Well-posed problem and ill-posed problem are both terms in the field of mathematics. The former needs to meet three conditions, and if one of them is not satisfied, it is called "ill-posed problem":

(1) a solution exists: the solution must exist

(2) the solution is unique : the solution must be unique

(3) the solution's behavior changes continuously with the initial conditions: the solution can change continuously according to the initial conditions without jumping, that is, the solution must be stable

Part 04

 Proposal of BTS method

A convolutional neural network usually consists of 2 parts, an encoder for dense feature extraction and a decoder for predicting the desired depth. In the codec scheme, repeated strided convolution and spatial pooling layers reduce the spatial resolution of the over-output, and employ skip connections or multi-layer deconvolution network techniques to restore the resolution to the original resolution, thereby To achieve effective dense prediction, most current methods of restoring feature maps to the original resolution of the network are relatively direct and will lose information, which is also the core content of the improvement in the BTS paper.

The BTS network structure is shown in the figure below:

2da111eb5168905d76ee4a22c5889e6d.png

BTS network structure diagram

The structure of the network includes: encoder structure, skip connection, porous space pyramid pooling (ASPP) and LPG layer.

The innovation of BTS: A local planar guidance layer network structure (local planar guidance layers) is proposed, which associates the features of different scales in the decoding stage with the final depth prediction. The usual codec is to impose training loss constraints on the final output of the decoding to output the depth map. The editor thinks that the LPG layer network structure proposed in this paper should play a role in imposing constraints on the network.

Network performance: As of now, the BTS method ranks 7th in the KITTI monocular depth estimation list, with an inference speed of 60ms, achieving a balance between accuracy and speed.

Part 05

The specific implementation of the LPG layer

The core idea proposed by the LPG network layer: Different from the traditional method that simply uses the nearest neighbor upsampling and skip connection to restore the image to the original size, the BTS method defines the internal features and the final output in an effective way (LPG layers) There is a direct and explicit relationship between them, guiding features to global resolution, and combining them to get the final depth estimate.

Specifically, given a feature map with a spatial resolution of H/K, the proposed LPG layer estimates a 4D plane coefficient for each spatial unit, with the size of the plane corresponding to the feature resolution. The coefficients fit locally defined k × k patches on full resolution H, and they are concatenated together by a final convolutional layer for final prediction. For example, when the input feature resolution is 1/4, the 4D vector output at each position will fit a 4*4 plane. In simple terms, the 4D plane coefficient will fit a plane with a larger resolution than the input feature, so that although the resolution of the input feature map is different, it will eventually output a plane with the same size. The schematic diagram of the LPG layer is as follows:

c680bb3290c440a2bce139e495a3fe81.png

(1)    Use 1×1 convolution to reduce the number of channels, and reduce the number of channels by 2 times each time 1×1 convolution is performed until channel=3, because the number of channels of the color image is 3, so far a H/ K×H/K×3 feature maps.

(2)    Channel1 and channel2 represent the two degrees of freedom of the plane normal vector, which are polar(θ) and azimuthal angles(φ), respectively. Next, the first two channels of the feature map are regarded as angles, and the following formula is used Convert them to unit normal vectors.

8036d0ae0e986311a39763dd9d9e6998.png

Channel3 represents the vertical distance between plane and origin (perpendicular distance).

(3)    After transformation, each pixel corresponds to a set of 4D vectors (n1, n2, n3, n4).

(4)    To use local plane assumptions to guide features, a ray-plane intersection method is used to convert each estimated 4D plane coefficient into K×K local depth cues. The conversion formula is shown in the figure below, where (ui, vi) is the k × k block-by-block normalized coordinates of pixel i, and c is the final fitting result.

38bb72101685c47753c2a94c45314bb3.png

Part 06

experiment

Evaluation indicators:

fda091899c4fae2cde9d216e759b86d0.png

Experimental results: The author conducted experiments on two large public data sets: KITTI and NYU Depth V2. The following are the specific experimental results.

56d41873fc7b4a56be538608ffafbb68.png

9ddd1727e7b1b4363f1233f2e3984dd9.png

982e54a46e8cc2e9bbcc8df0e994ede6.png

8283e8a64c728f0960027e2ed99d5b93.png

8de4d6e4d7892111c27c8ef2096996f1.png

Part 07

Summarize

In this paper, the author studies the structure of the enconde-decoder, and analyzes the shortcomings of the existing method for violent upsampling in the decoder part. The structure of the LPG network layer is proposed. By associating the features of different scales in the decoder stage with the final output depth prediction, a more sufficient and effective use of the features is realized, thereby improving the overall effect of the network. I think this module It can be migrated to other tasks for use, and it should also help improve the performance of the network.

下载1:OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复:扩展模块中文教程,即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。


下载2:Python视觉实战项目52讲
在「小白学视觉」公众号后台回复:Python视觉实战项目,即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。


下载3:OpenCV实战项目20讲
在「小白学视觉」公众号后台回复:OpenCV实战项目20讲,即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。


交流群

欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~

Guess you like

Origin blog.csdn.net/qq_42722197/article/details/131278596