AdaBins: Depth Estimation using Adaptive Bins

Summary

The core idea of ​​the article is to perform a global statistical analysis of the output of a traditional encoder-decoder architecture and refine the output using learned post-processing building blocks operating at the highest resolution. For the construction of the processing module, the transformer structure is introduced, and through the combination with cnn, the Adabins model is proposed, which has certain advantages in monocular depth estimation.

Innovation

It solves the problem of how to estimate a high-quality dense depth map using the input of a single RGB image, and proposes a transformer-based architecture module—Adabins—to perform global processing of scene information.

performance

Under the KITTI dataset, this method currently ranks first. Under the NYU-Depth-v2 dataset, it is also the current SOTA model, which proves that this method has strong performance advantages and is worth learning and reference.

method

Four Design Options

When designing the network architecture, the authors propose four design choices.
(1) Use an adaptive binning strategy to discretize the depth interval D=(Dmin,Dmax) into N units. This interval is fixed for a given dataset, determined by the dataset specification or manually set to a reasonable range. The final solution is compared here with the other three possible design choices, as shown in Figure 1.
a. Fix the unit with a uniform unit width: the depth interval D will be divided into N units of the same scale

b. Fixed cells with cell width on a logarithmic scale: the depth interval D will be divided into cells of the same size on a logarithmic scale

c. Trained unit width: The unit width is adaptive and can be learned from a specific data set. All pictures share the unit subdivided by the depth interval D

d. Final scheme Adabins: The interval b of each picture is calculated adaptively.
Alt

figure 1

(2) Discretize the depth interval D into individual units and assign each pixel to a single unit to form depth discrete artifacts, and the final depth prediction is a linear combination of unit centers, so that the model can estimate smoothly changing depth values.
(3) The author found through analysis that using attention on tensors with higher spatial resolution can achieve better results. Therefore, the order of the network modules is adjusted to be encoder + decoder + attention mechanism, while most of the previous methods put the attention mechanism in the middle.

(4) The proposed method uses a baseline codec convolutional network combined with a transformer-based architecture block. The network uses the pre-trained Efficientnet B5 network as the backbone of the encoder, and the decoder method is standard feature upsampling. The core content of this paper is the proposed adaptive unit width estimation block—AdaBins. The input of this module is the output tensor X d ∈ R ( H × W × C d ) X_d∈R(H×W×C_d )XdR(H×W×Cd) , that is, the decoding feature, after being processed by the Adabins module,( H × W × 1 ) (H×W×1)(H×W×1 ) Tensors.

four sub-modules

The four submodules of AdaBins are: Mini-ViT, Bin-widths, Range attention maps and Hybrid regression.
mini-ViT : This module is a simplified version of ViT to fit smaller datasets, its role is to use global attention to compute cell width vectors for each input image. The structure diagram of the module is shown in Figure 2. mini-ViT consists of two outputs: 1) the bin widths vector, which defines how to divide the depth interval for the input image; 2) the range attention map with the size of H×W×C, which Contains information useful for pixel-level depth calculations.
Alt

figure 2

Bin-widths : Transformer needs a fixed-size vector sequence as input, and the input is X d ∈ R ( H × W × C d ) X_d∈R(H×W×C_d)XdR(H×W×Cd) of the decoded feature tensor. Therefore, through a convolution kernel sizep × pp × pp×p , the step size isppp , the convolution whose output channel is E operates on the input, and the output result of the convolution ish / p × w / p × E h/p×w/p×Eh/p×w/p×Tensor of E. Then reshape the result into a space-tan tensorX p ∈ ( S × E ) X_p ∈ (S × E)XpS×E ) , whereS = hw / p 2 S=hw/p2S=h w / p 2 is used as the effective sequence length of Transformer, and this E-dimensional vector is used as patch embeddings. The output of the vector after Transformer isX o ∈ ( S × E ) X_o ∈ (S × E)XoS×E ) , use the MLP head in the first output embedding and normalize the output b', and finally obtain the bin-widths vector b, as follows:
bi = bi ′ + ε ∑ j = 1 N ( bi ′ + ε ) b_i =\frac{b_i^{'}+\varepsilon}{\sum_{j=1}^N(b_i^{'}+\varepsilon)}bi=j=1N(bi+ e )bi+ e

range attention maps : decoded features represent high-resolution and local pixel-level features, while Transformer output embeddings contain more global information. The output embeddings 2 from the Transformer are used as a set of 1 × 1 convolutional kernels by C + 1 and convolved with the decoded features (after a 3 × 3 convolutional layer) to obtain a range attention map R. final depth=global information R+local information b.

Hybrid regression : The range attention map R obtains N channels through a 1×1 convolutional layer, and performs softmax activation. The depth cell center c(b) is calculated from the cell width vector b, and the calculation formula is as follows:
c ( bi ) = dmin = ( dmax − dmin ) ( bi / 2 + ∑ j = 1 i − 1 bi ) c(b_i)= d_{min}=(d_{max}-d_{min})(b_i/2+\sum_{j=1}^{i-1}b_i)c(bi)=dmin=(dmaxdmin)(bi/2+j=1i1bi) .
Finally, the final depth value d of each pixel is calculated from the linear combination of the Softmax score of the pixel and the depth cell center c(b), and the calculation formula is as follows: d ~ = ∑ k = 1 N c ( bk ) pk \
tilde d=\sum_{k=1}^Nc(b_k)p_kd~=k=1Nc(bk)pk,
where P k P_kPkis the score of the N softmax defined.

loss function

The loss function of this method consists of two parts, which are pixel-level depth loss and unit center density loss.

The pixel-level depth loss is defined as follows:
L pixel = α 1 T ∑ igi 2 − λ T 2 ( ∑ igi ) 2 L_{pixel}=\alpha\sqrt{\frac{1}{T}\sum_ig_i^2-\ frac{\lambda}{T^2}(\sum_ig_i)^2}Lpixel=aT1igi2T2l(igi)2

其中:
g i = log ⁡ d ~ i − log ⁡ d i g_i=\log \tilde d_i -\log d_i gi=logd~ilogdi
The cell center density loss is defined as follows:
L bins = chamfer ( X , c ( b ) ) + chamfer ( c ( b ) , X ) L_{bins}=chamfer(X,c(b))+chamfer(c(b ),X)Lbins=chamfer(X,c(b))+chamfer(c(b),X )
Finally, the total loss is defined as follows:
L total = L pixel + β L bins L_{total}=L_{pixel}+\beta L{bins}Ltotal=Lpixel+βLbins

experiment

The author conducted experiments under the two large data sets of kitti and NYU-Depth-v2, and both achieved SOTA performance.
insert image description here
insert image description here

Summarize

The article uses CNN combined with Transformer to conduct research on monocular depth estimation, and proposes a new module called Adabins, which has achieved excellent performance on two major public datasets. The main contribution is to use transformer's excellent global information processing ability, combined with CNN's local feature processing ability, and to design the network in a way of learning from each other.

Guess you like

Origin blog.csdn.net/dawnyi_yang/article/details/125278154