Monocular Depth Estimation--Deep Learning

1: Depth Estimation Application Background

1. Definition of Depth Estimation

Suppose we have a 2d image III , we need a functionFFF to find its corresponding depthddd . This process can be written as:

d = F ( I ) d = F(I) d=F(I)

Depth info here ddd actually representsthe actual distance between each pixel in the 2D image projected from the 3D object and the camera.

But as we all know, FFF is a very complex function, because obtaining a specific depth from a single image is equivalent to inferring a three-dimensional space from a two-dimensional image, even if the human eye uses two eyes to locate objects in the natural world, there will still be problems. , not to mention using a single photo. Therefore, the traditional depth estimation is not effective in monocular depth estimation. People pay more attention to the study of stereo vision (Stereo Vision), that is, to obtain depth information from multiple pictures. Because the two pictures can get the change of disparity between the pictures according to the change of the viewing angle, so as to achieve the purpose of obtaining the depth. Too much to say, let's look back first.

2. Application scenarios of depth estimation

insert image description here
insert image description here
In addition to the application scenarios mentioned in the above two pictures, depth estimation can also be used in a series of downstream tasks that require depth information, such as 3D reconstruction, obstacle detection, and SLAM. Therefore, it can be seen that depth estimation often exists as an upstream task , and its importance is self-evident.

3. Several methods of depth estimation

  • Using lidar or structured light reflection on the object surface to obtain depth point cloud
    This method can be described as a "local tyrant method". Directly use the sensor to scan to obtain high-precision point cloud depth information, but the price is expensive!

  • Traditional binocular distance
    measurement Binocular stereo vision consists of two cameras, like human eyes can see three-dimensional objects, and obtain object length, width, and depth information. The position of the camera is generally manually calibrated (such as Zhang Zhengyou’s camera calibration algorithm), and then the process of deriving the internal and external parameter matrix of the camera through the position of the target point in the image coordinate system and the world coordinate system is often a process of coordinate transformation.

  • The traditional monocular distance measurement
    monocular vision can obtain two-dimensional object information, that is, length and width, so if you want to measure distance, you need to take several images of different angles (timing), and then use Mobileye monocular distance measurement A series of methods are used to solve it. At the same time, the amount of calculation is complicated, and the accuracy is not as high as binocular, so it is often used when the conditions are difficult.

4. Advantages and disadvantages of using deep learning estimation

After laying the groundwork for several common methods in several traditional fields, let's talk about today's protagonist-deep learning monocular estimation. As the name suggests, deep learning, deep learning, the first reaction End2End, throws the image into the trained network, without any manual participation, and directly obtains the final depth map, one word, convenient ! At the same time, we only need a monocular camera, in a word, low cost !

What are the disadvantages? First of all, the accuracy of depth estimation within 80m is not bad, but the error is very large any further. It can be seen that the low accuracy and the limitation of estimated distance are all its shortcomings. Of course, there is another problem that deep learning has always been able to avoid, which requires a large number of training sets , which is obviously a problem that cannot be ignored in some environments that lack training data.

But after all, it is on the sidelines with the cutting-edge technology of the times, so let's have a good chat, and let's get to the point.

Two: Monocular depth estimation model

1. Datasets used

The depth estimation model explained here uses the KITTI data set , which is taken on urban and rural roads. This data set is widely used in many research fields , as shown in the figure below:
insert image description here
insert image description here

2. Overall network architecture

The depth estimation model is to input an image and output a picture containing depth information, so it is a generative model , so it must be inseparable from the core process of encoding and decoding, as shown in the figure below:
insert image description here
Of course, the real network architecture is not so Simple, but they all revolve around encoding-decoding. The following shows a network architecture recently released by CVPR. I will explain this "behemoth" module by module:
insert image description here

3. Module analysis

Ⅰ: Level

In fact, after several layers of pooling, it is reduced by half each time. The backbone here uses Resnet101. It is similar to the operation of networks such as U-net, mainly for the following operations, as shown in the figure:
insert image description here

Ⅱ:ASPP

The author made ASPP on the last feature map of the backbone. I won’t repeat what ASPP is here. It is the combination of atrous convolution and SPP . The purpose of using ASPP is to add some feature diversity while retaining a certain resolution (also image split domain routine operations). as the picture shows:
insert image description here

Ⅲ: Feature map subtraction operation

In the field of depth estimation research, the contour depth information of objects is a challenge. In order to effectively solve this problem, the author performed a divine operation , combining two feature maps A − B ABAB is subtracted (feature map B is a feature map of the same size as A after upsampling), the difference features are extracted, and the contour feature map L is obtained. As shown below:
insert image description here

insert image description here

IV: Feature Fusion

For a superposition, first make full use of the fusion of the high-level feature map and the feature map of this layer to obtain the intermediate feature map X to increase the multi-scale feature. Then splice the high-level prediction result R'' and the current layer's contour map L, and get the current layer's prediction result R after a hodgepodge. Each layer is operated like this, as shown in the figure:
insert image description here

Ⅴ:Coarse-to-Fine

Finally, there is the stage of "sculpting" the details, where the R of each layer is fused to obtain the final prediction result R''', as shown in the figure:
insert image description here

Ⅵ: Weight parameter preprocessing WS and pre_act operation

Some details have been added to the real network. First of all, the weight standardization WS operation is added to make the distribution of weight parameters more even. Functions, such as Mish, Leaky ReLU, Swish??), and then added pre_act , which is to perform ReLU on x first, and then enter the convolution layer. Compared with the Buddhist system, for their experiments, the accuracy has indeed improved. leap, as shown in the figure below:
insert image description here
insert image description here

VII: Loss function

insert image description here
insert image description here

dd in the simplified loss functiond is actually the difference between the predicted depth and the real depth of each pixel, the focus is on the laterdidj di djWhat does the sum of d i d j mean, and why is there a negative sign in front of it. For example, everyone understands that if two pixels are found, the difference between the predicted value and the actual value isd 1 d1d 1 andd 2 d2d 2 , if the two differences are both negative, then the multiplication is a positive value, and a negative sign is added in front, which means that it will not be punished. On the contrary, if the two have different signs, it will be punished. visibleThe purpose of this item in the loss function is to hope that the predicted values ​​​​to be obtained are either a little lower or a little higher, instead of a big prediction here and a small prediction there., so that the effect is actually worse and the model is less reliable.


  So far, I have briefly explained the principle of using deep learning for monocular depth estimation. I hope it will be helpful to everyone. If you have any questions or suggestions, please leave a comment below.

I am a Jiangnan salted fish struggling in the CV quagmire, let's work hard together and leave no regrets!

Guess you like

Origin blog.csdn.net/weixin_43702653/article/details/123831009