EGA-Depth:Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation

Reference code: None

introduce

In the SrroundDepth algorithm, self-attention is used for multi-view feature aggregation. One of these is that the calculation overhead and memory usage are relatively large. Each of the two individual views needs to find useful information from many views, which will slow down the network convergence speed. Based on the characteristics of imaging in a multi-view system, this article performs attention operations on the current view only with the left and right views (corresponding to the parts of the left and right views that are related to the current view), thereby greatly reducing computing and video memory overhead. In addition to saving these overheads, you can try to increase the resolution and increase the multi-frame input (the bullish point is not obvious from the results) to improve the performance of self-supervised depth estimation. In the figure below, the article method is compared with the FSM and SurroundDepth methods in terms of performance and computation:

insert image description here

It can be seen that this self-attention operation still has a lot of room for improvement. In this article, the left and right views related to the current view are used to participate in the attention calculation process. Then such an operation can be further replaced by a deformable attention operation. It can further save calculation and improve performance, which is similar to deformable-DETR effect. But you need to pay attention to whether the camera exposure synchronization time matches. If it does not match, you need to do additional operations and modifications to align it.

method design

In fact, the method of the article is roughly the same as the SurroundDepth method. The difference is that the self-attention operation has been modified accordingly, and replaced with the efficient guided attention operation shown in the figure below, as shown in the figure below:
insert image description here

In the above figure, the current view is used to construct the query, and then the surrounding view (which can correspond to the current moment or other characteristics of the moment) is used as the key and val, and the current view expression is optimized by finding the relationship between the current view and the surrounding view, and can be selected as a priori view 1 3 \frac{1}{3}31The area of ​​the image is used as the effective area, which can greatly reduce the calculation and memory overhead. The benefit of this is that the resolution of the feature map involved in the calculation can be increased to improve the depth estimation performance. The following table shows the impact of the feature map resolution on the depth estimation performance:
insert image description here

In the case of reducing calculation and memory overhead, it is also possible to integrate multi-sequence features to participate in the depth estimation at the current moment. The impact of the introduction of different timing features on the depth estimation performance is shown in the following table: It can be seen that adding timing features as input is only possible
insert image description here
in The method of this article has a little bit of improvement, and the performance of the previous surrounddepth has also declined. This should be because self-attention cannot learn the corresponding useful information from a lot of information, but simply increases the search space.

Experimental results

Performance comparison on nuScenes:
insert image description here

Performance comparison on DDAD:
insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/132013996