Depth Estimation (1)

What is monocular depth estimation?
Bronze: In pictures, depth is distance, so depth estimation is distance estimation. This distance refers to the distance between each pixel in the picture and the camera (camera).
Silver: For monocular depth estimation, the input is a picture, and the output is the depth value of each pixel in the image. That is, the feature extraction of the input image is performed through the neural network, and finally a regression is performed on each pixel point one by one to predict the distance of each pixel point from the camera.
What is the depth estimate to do?
Input a color picture
and output depth estimation
insert image description here
That is to say, input a color picture, the output is the depth value of each pixel in the color picture, that is, the distance value.

Paper: Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals (Monocular Depth Estimation Based on Laplacian Pyramid Depth Residuals)

The first trick: feature extraction network structure design (backbone)

Overall structure diagram:
insert image description here
Decompose the overall structure diagram:
1
A short story before the solution: There is such a math problem − 2 5 + 3 4 -\frac{2}{5}+\frac{3}{4}\52+43 
For me in first grade: what is this? I don’t understand.
For me in the second grade: I have a little impression, but I don’t know how to do it.
For me in the sixth grade, I always say: This is it? Then make a result In other words, we have different views
on a certain matter at different ages .

For the network structure of the feature extraction in this paper , there is also such an idea. That is, for an input color picture (a certain thing), if the feature extraction process is divided into several sections (different age stages), the meaning of each section is different (different views).

Question 1 : Why should the feature extraction process be divided into several sections? ? ?

As the overall structure diagram is decomposed above, in the process of rolling in the input color image (SXSX3), we have to do many convolution operations. In the process of continuous convolution, we separate the feature maps that are 1/2, 1/4, 1/8, and 1/16 of the original input image. And the extracted feature map is used for other purposes.

Note: S/2 is the size of the feature map that is 1/2 times the size of the input image. It may be the size of the feature map obtained by extracting the feature for the 10th time, or it may be the size of the feature map for the 20th time, but it is not the number of extraction times for the S/2 feature extraction.

To solve question one : it is to obtain the diversity of features in the process of advancing features (there will be other feature diversity later). That is, in the process of feature extraction, such as the feature map obtained by the 10th convolution operation, only some shallow features can be obtained, such as the texture of the picture or the feature information of the edge. When doing the 20th convolution, the obtained features will be deeper than the 10th convolution, and the obtained feature information will be deeper.

When the traditional network is doing convolution, it is continuous convolution, continuous convolution, and then a fully connected layer is connected after the last convolution, and then the nodes in the fully connected layer are elongated and transformed into one-dimensional, and finally a result is output to complete the task of image classification or image recognition.

In this article, the feature information generated by the image in the process of continuous convolution is extracted. After extraction, one is to realize the diversity of features (there will be a manifestation of diversity later), and the other is that these extracted features will provide reference or directly participate in the subsequent process of feature fusion.

Summary of the first trick: Get the features of each level, that is, get the features of different levels in the backbone.

And every time you do a convolution, you will get some feature maps, so we can take: when doing S/2 convolution, take out the feature maps obtained by it; The extracted feature map is used for other purposes.

For example, if the size of the input color image is 224X224, then in the process of feature extraction of the image, the feature map of size 112 is extracted separately, the feature map of size 56 is extracted separately, the feature map of size 28 is extracted separately, and the feature map of size 14 is extracted separately.

Trick #2: Dealing with Differences

What is the difference?
For an image, the premise of finding the distance of each object in the image from the camera is to identify each object in the image, and then know how far the car is from my camera and how far the pedestrian is from my camera. So how to identify each object in the image?

Or, to put it another way, for a pixel in an image, how do I know whether the pixel is the boundary pixel of the object or the internal pixel of the object? That is, for this image, how to use the neural network to realize the discrimination of the outline information of the object in the image.

That is, the difference between the various objects in the image, or how to represent their outline information. It is a bit similar to how to achieve the segmentation of objects in the image in the image segmentation task.

The method used in this article is "difference". This difference is somewhat similar to the feeling of "subtraction".

As shown in the figure, the part out of the red box:
insert image description hereSpecific steps: Looking back at the first trick above, S/2 is the size of the feature map 1/2 times the size of the input image. Similarly, we can know that S/4, S/8, and S/16 are the size of the feature map 1/4 times the size of the image, 1/8 times the size, and 1/16 times the size. These are obtained by downsampling .

Now do such a thing, perform a convolution on 一批the feature map of S/16 size (denoted as S 16A另一批 ), and the size of the feature map is still S/16 (denoted as S 16B ). Then, the S/16 size 另一批feature map (S 16B ) is up- sampled . The purpose of upsampling is to obtain more features, and the second is to multiply the feature map of S/16 size by 2 to become a feature map of S/8 size. 一批Therefore , a feature map with a size of S/8 can be obtained , which is denoted as S 8B , while the previous one is denoted as S 8A . Then the difference L 4
is:
L 4 = S 8 A − S 8 B L_4=S_{8A}-S_{8B}L4=S8AS8B

Continue to perform up-sampling operation on S 8B to obtain S 4B . Then the difference L 3 is:
L 3 = S 4 A − S 4 B L_3=S_{4A}-S_{4B}L3=S4 AS4 B

Perform an upsampling operation on S 4B to obtain S 2B ; perform an upsampling operation on S 2B to obtain S 1B . Then the difference between L 2 and L 1 is:
L 2 = S 2 A − S 2 B L_2=S_{2A}-S_{2B}L2=S2 AS2 B
L 1 = S 1 A − S 1 B L_1=S_{1A}-S_{1B} L1=S1AS1B

Simple understanding of downsampling and upsampling. Downsampling is to reduce the size of the feature map by n (here n is 2) times, and upsampling is to expand the feature map by n (here n is also 2) times, which is similar to the pooling operation with a sliding step size of 2.

Summary of the second trick: Extract the information of the outline of the object in the picture through the method of difference. The difference calculation method is to obtain feature maps of the same size through up- and down-sampling , and then let them be directly subtracted. The reason why there are so many difference calculations of L 1 , L 2 , L 3 and L 4 is to obtain the difference of different scales, and then better calculate the depth information of the picture (only one scale difference is also ok, but the final result, that is, the boundary information will be very blurred, as shown in the figure below ) .

insert image description here

The third measure: ASPP

SPP before recruiting:

Purpose: For feature diversity (width).

When extracting features, one can extract more features from the perspective of extending the "length", that is, extending the number of layers of the network. For example, in the first method, extract features from the input color image many times, and then take out four of them. The feature map is used for calculation. This is to achieve feature diversity in terms of length.

In fact, there is another way to obtain more features of the picture by expanding the "width".

For example, when we describe our views on a person, we can say that she is tall, thin, has long hair, fair skin, etc. These descriptions are based on the description of the person's appearance, which feels a bit one- sided . We can also describe her from multiple angles, such as personality, dealing with people, etc., so that through the description of "diversity", she can become more three-dimensional.

In the same way, in this article, the diversity of features is obtained from the way of expanding the "width", which is achieved through different pooling methods.

In the past, the pooling operation (such as maximum pooling) was to select the maximum value in the specified area by specifying an area size (such as 2X2), and then sliding on the feature map (such as the sliding step size is 2), and then output the value to the next layer of the network.

The SPP is the opposite of the previous pooling process. In the past, the size of the pooling window was specified first and the sliding steps of the pooling. In the operation, the size of the final characteristics of the use of different pooling windows is different.

In SPP, on the contrary, it obtains a certain feature map size through pooling operation. Therefore, in SPP, neither the size of the pooling window nor the step size of the pooling operation is considered, and its focus is the final pooling result.

For example, if the size of the input feature map is 100X100, 120X120, and 1125X125, but for SPP, it doesn’t matter. SPP only requires these feature maps of different sizes to become a fixed size after the pooling operation, such as 60X60, or 40X40, 30X30 (obtaining feature maps of different sizes is also a manifestation of feature diversity).

For SPP, since it does not care about the size of the input feature map, it only pays attention to the size of the feature map after pooling output, so it does not limit the size of the input initial image. For example, in the previous AlexNet network, the image size is required to be 227X227; in the VGG network and GoogLeNet network, the input image size is required to be 224X224, and these networks have requirements for the input image size, because the last few layers of these networks have fully connected layers, so for the needs of fully connected layers, these networks need to have these restrictions on the input image size.

In order to meet the requirements of the network model for the size of the input image, a reshape operation is usually performed when the image is input in the first step, that is, the image size is changed manually. For example, the original size of an image is 100X200, and then the image can be converted to a size of 224X224 through the reshape operation. It is equal to a rectangular input image at the beginning. In order to meet the needs of the network model, it finally becomes a square.

Although it meets the needs of the network model and completes the change of image size, it will inevitably affect the image itself, such as losing some features. It’s okay for a small image to adapt to a large size, and only need to fill in some gray values; if it is originally a large image, to meet the needs of the network, to adapt to the input of a small size, after reshaping, the pixels in the large image will inevitably be compressed, and naturally a lot of information about the original image will be lost.

SPP, it does not care about the input size of the image, only focus on the result. If the result requires the final pooling size to be 20X20 or 30X30, no matter what the size of the input image is, after a series of convolution pooling, the result is 20X20 or 30X30.

Summary: SPP is for pooling. A pooling operation that only focuses on results. Advantage 1: You can do a variety of different pooling, so you can get richer features, that is, feature diversity (width). Advantage 2: There is no requirement for the size of the input image, so there is no need to reshape the input image to prevent loss of image feature information.

The empty convolution before the move:

As shown in the figure, ordinary convolution
insert image description here
hole convolution:
insert image description here
hole convolution:
bronze: cross-grid (hole) convolution.
Silver: Dilated convolution is also known as dilated convolution, also known as dilated convolution. At the beginning, the proposal of dilated convolution was proposed to solve the problem of image segmentation. Because common image segmentation algorithms usually use convolutional layers and pooling layers to increase the receptive field, while also reducing the size of the feature map. Therefore, it is necessary to use upsampling to restore the size of the image. For example, in the second method of finding the difference, upsampling is used. One of its functions is to satisfy the fact that the size of the feature map is the same before the difference can be made for the feature map.

However, upsampling the feature map is actually to enlarge the feature map, which will inevitably cause the loss of image information. So a solution is needed: while increasing the receptive field, keep the size of the feature map unchanged, thereby replacing the upsampling and downsampling operations. So dilated convolution is proposed to solve this problem.

What is the receptive field?
When people look at a mural (why it is a mural, because it is big enough), it is impossible to "see" the entire mural at a glance. But from top to bottom, or "scanning bit by bit" from the area of ​​interest, the visual range that the human eye can accommodate (or the size of the mural that the human eye can accommodate) is the receptive field. In contrast to the neural network, the receptive field is the area of ​​the feature map that is framed during convolution or pooling. For example, for 3X3 convolution, the range of the receptive field is 3X3; for 2X2 pooling, the range of the receptive field is 2X2.

As shown in the figure above, hole convolution is to increase the framed feature map area when convolving an image without changing the "number" of convolution. Atrous convolution introduces a hyperparameter called the number of holes (expansion rate), which is the spacing value inside the convolution kernel when performing convolution operations. If the number of holes is 1, it is the normal convolution in the past.

Summary: For hole convolution, different features can be obtained by setting different numbers of holes; in SPP, different features can also be obtained by setting different final pooling sizes. Then combine the hole convolution and SPP together, which is ASPP. That is, through ASPP, by expanding the "width", the diversity of features can be obtained.

Guess you like

Origin blog.csdn.net/qq_41769706/article/details/128328192