Depth Estimation and Application

I. Introduction
      estimated from the 2D image is a key step depth understanding and scene reconstruction tasks, such as 3D object detection and segmentation. Monocular image obtained based on the depth information is defined as an MDE problem (Monocular Depth Estimation).

Second, references and data

参考论文:
1、Deep Ordinal Regression Network for Monocular Depth Estimation(CVPR, 2018)
2、MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization(AAAI,2019 oral)

Reference article:
1, https://blog.csdn.net/kingsleyluoxin/article/details/82377902

2、https://cloud.tencent.com/developer/article/1399535

3、https://blog.csdn.net/qq_26697045/article/details/84796815

Reference Code:

1、https://github.com/hufu6371/DORN

2、https://github.com/Zengyi-Qin/MonoGRNet

Third, an overview of
the depth estimation problem is part reconstructed in 3-D computer vision, i.e., Shape from X. The X including stereo, multiple view stereo, silhouette, motion (SfM, SLAM), focusing, hazing, shading, occlusion, texture, vanishing points, ... in front of five images are multi-input, space geometry from the time domain conversion relation varying the focal length and the depth distance is derived. The rest are single-purpose input.

Depth estimation can be used for 3D modeling, scene understanding, depth perception (depth-aware) image compositing, and other fields.

Monocular depth study based on estimated based on the pixel value reflects the relationship between the depth of the relationship, the method is to fit a function to the image map a depth map: As can be seen from the drawn contour and the depth map results increasing, indeed such a function can be recovered from the relative depth values ​​of the pixel values.

If the image blur model, based on the response model blurred image edge, then the monocular image can estimate the depth, i.e. shape from defocusing.

Conventional monocular depth estimation methods typically utilize a single perspective image data as an input, direct prediction image corresponding to each pixel depth value, this solution leads to the conventional methods generally require a large amount of deep annotation data, and such data usually it requires a higher acquisition costs. So most recent depth estimation uses a lot of unsupervised learning in MDE

Fourth, the network structure

        

  Feature extractor network of dense, multi-scale features learner (ASPP), cross channel information learning, a full image coding, and Ordinal Regression (ordinary regression) Composition

1, dense feature extractor

  Containing conventional DCNn repeated maxpooling striding and greatly reduces the resolution of the features of the image, deleting the final article DCNn pooling layer, and then uses a convolution cavity, thereby reducing spatial resolution without increasing the number of parameters or the case of increasing the receptive field

 

2, the scene understanding module

  It consists of three parts, aspp, cross channel, full image coding.

  ASPP;

  ASPP parts of the network with different expansion coefficients (12, 18) the expansion of the convolution operation, it is possible without changing the resolution of the image, to obtain effective convolution different receptive field size, and thus obtain fused features multiscale . In order to obtain a multi-scale features, using the above ASPP module, which is a conventional feature extraction network portion of the network becomes compressed expansion convolution multi-scale, and thus to obtain different scales, used to characterize different sizes image feature region

  1 * 1 convolution can learn complex cross channel information

  Full access to the global video encoding context information

  

Conventional methods to obtain a fully coupled layer context information. Full image encoding proposed here contains fewer parameters conventional scene understanding (on) all-connection layer, and wherein each element of all of the pixels in FIG FC are connected, in order to get the global characteristics of the image, then connected by total reduced to the operation of the image layer. Using the methods described herein, the core is first passed through a cell layer k to obtain a characteristic view after pooling, using this feature, and all of the connecting layer to obtain elements C, wherein C is regarded as elements of FIG 1x1xC use convolution 1 * 1 C were mixed to obtain a channel characteristic, and then the resulting copy feature characteristic WxHxC new generation.

(channel attention?)

Five, 3D target detection depth estimation (MonoGRNet)

Question: pixel exists in the object of much less than the background pixels, so good effect in depth estimation algorithm does not guarantee the depth estimate is accurate vehicle

Here the semantic information and geometry information fusion, the depth estimation network implementation instance Instance Depth Estimatiob (IDE).

IDE estimate the depth value of the 3D center of the box

Early feature provides a global features (information added to the low-level), using the calculated global early feature depth

After deep feature ROI-align to obtain the fine information of the detected object, feature here combines all the features of the depth.

 

 

Sixth, discussion 

Guess you like

Origin www.cnblogs.com/bupt213/p/10973381.html