Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection - Paper Notes

Reference code: SOLOFusion

1 Overview

Introduction: The driving process of the car is time-varying, so the best way to deal with this scene should also have the introduction of time dimension. This article proposes that the time series information in the existing BEV perception algorithm has the problem of coarse feature granularity (small feature map size) and less time series information used (fewer frames used). The sensitivity to timing information is different under the distance. The farther the distance, the more frame information is required, but it is also necessary to consider that the resources of the actual machine are limited. For the above problems, the article proposes to perform time series information fusion at high resolution (construct cost volume in image feature space) and low resolution (construct cost volume in BEV feature space) , and the two complement each other. It should be pointed out that at high resolution, the method of MVSNet is used to construct the cost volume, while at low resolution, the features need to be aligned, so that the corresponding inter-frame pose relationship needs to be known .

In the process of using images for depth estimation, it is natural to hope that the deviation of the corresponding pixels is large enough so that they can be clearly distinguished and characterized. Corresponding to the time series data, it is also hoped that the span of the video segment used will be larger, so that the difference presented in the image will also become larger, as shown in the figure below, so as to ensure the localization potential (which is used in the article to represent the difficulty of depth estimation under multi-view ease)
insert image description here

The impact of the amount of time series data:
set t − 1 t-1tPoint aain 1 framea through the internal and external parameters of the camera and tottPose transformation of t frame [ R ∣ t ] [R|t][ R t ] maps to the correspondingbbPoint b , this mapping relationship can naturally be written. However, the article pays more attention to pointbbThe rate of change of the depth under b (the text calls it localization potential, or the rate of change of the relative depth at this point):
L ocalization P otential = ∣ ∂ xb ∂ da ∣ = ft ˉ cos ( α ) ∣ sin ( α − ( θ + β ) ) ∣ ( dacos ( α − θ ) + tzcos ( α ) ) 2 Localization\ Potential=|\frac{\partial x_b}{\partial d_a}|=\frac{f\bar{t} cos(\alpha)|sin(\alpha-(\theta+\beta))|}{(d_acos(\alpha-\theta)+t_zcos(\alpha))^2}Localization Potential=daxb=(dacos ( ai )+tzcos ( a ) )2ftˉ cos(α)sin(α( i+b ))
Referring to the above formula, the localization potential of points with far distances naturally requires larger molecules, and more time frames are required for accumulation. It can also be seen in the figure below that more frames of data are needed to participate in the depth perception of the distance :
insert image description here

The impact of the resolution of the feature map:
In addition to the positive effect of more time series data mentioned above, the size of the feature map is actually an important factor. A natural reason is that the larger the resolution, the farther away The more information that can be expressed, the smaller the amount of time series data required under the condition of increasing resolution. Similar conclusions can be drawn from the analysis diagram given in this article:
insert image description here
However, it should be pointed out that the resolution of the feature map and the amount of time series data are complementary, and they need to be chosen according to the actual resource allowance. At the same time, the corresponding experiment also proved that a better depth expression (corresponding to the information using more frames in the figure below) is beneficial to the detection of the target:
insert image description here

Comparison with other time-series fusion methods:
In this paper, the strategy of time-series feature fusion is analyzed with other methods in the dimensions of sampling method, sampling resolution, and feature association method, and the following table is obtained (as can be seen in the figure below, there are still many The method requires the pose to be aligned ):
insert image description here

2. Method design

2.1 Overall pipeline

The overall pipeline of the article method is shown in the figure below:
insert image description here
From the figure above, we can see that the fusion of timing features can be divided into two parts:

  • 1) High-resolution short-time fusion : This part is to use an algorithm similar to MVSNet on the backbone feature output to output a more accurate depth estimation result on the current reference map.
  • 2) Low-resolution long-term fusion : This part is to construct the cost volume after the BEV feature (low resolution) uses the pose information warp, so as to obtain the BEV feature of the reference frame.

2.2 Timing Fusion at High Resolution and Low Resolution

Timing fusion at high resolution:
This part uses the feature map of backbone output stride=4 as the input of MVSNet, and then obtains the depth estimation result at high resolution, and the number of frames used here will be correspondingly reduced. In addition, the depth bins here are not uniformly distributed sampling but Gaussian-Spaced Top-k Sampling, which can bring better performance. Compare it to some other sampling methods in the table below:
insert image description here

Timing fusion at low resolution:
This part of fusion mainly uses BEV features, so its feature map resolution is low. According to the previous analysis of the article, more video frames are needed as input to make up for the information expression in the distance. In the process of building a multi-frame cost volume, the pose information will be used to add the features in the warp source to the current reference. The impact of comparing the number of frames on performance is shown in the table below:
insert image description here

The impact of the corresponding fusion strategy on the overall network performance under high and low resolutions:
insert image description here

3. Experimental results

Performance comparison on NuScenes val
insert image description here

Performance comparison on NuScenes test:
insert image description here

Its comparison with BEVDepth (better performance at smaller image resolutions):
insert image description here

Comparison with less dataframe approach (more dataframes lead to better performance):
insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/128766421