BEVSimDet:Simulated Multi-modal Distillation in Bird’s-Eye View for Multi-view 3D Object Detection

Reference code: BEVSimDet

1 Overview

Introduction: During the actual deployment of the model, due to factors such as the lack of actual sensors and the limitation of computing resources, the actual deployment of the model is cut, and the natural performance will also decline. For such a situation, knowledge distillation is generally used to achieve performance compensation. For the common intra-modal, cross-modal, and multi-modal distillation methods, they need to have the same number of sensors input in the source and target, so as to achieve distillation. Specifically, lidar is not used in all models in autonomous driving scenarios, so how to improve the perception performance of cars without lidar? In this regard, the article proposes to add a simulated Lidar feature to the image feature to realize the migration of part of the Lidar information, so as to compensate for the performance impact caused by the lack of Lidar.

2. Method implementation

The method of this article is based on the BEVFusion method, so it involves the fusion of Lidar and Camera, but this is in the source, and Lidar does not exist in the target. In this regard, this article proposes to predict a feature similar to the Lidar feature from the image feature to replace the Lidar. The structure of the distillation is shown in the figure below:
insert image description here
For knowledge distillation, the difference between the source and the target is relatively large, so all knowledge transfer is in the BEV space. Completed, that is, on the corresponding CMD, IMD, MMD-F, and MMD-P above. First, perform knowledge transfer for the features that exist in both source and target, that is, IMD, MMD-F, and MMD-P.

IMD:
Image features get BEV features after view-trans, then add constraints to the corresponding regions in source and target:
LIMD = MSE ( FC bev T , FC bev S ) L_{IMD}=MSE(F_{C_{bev} }^T,F_{C_{bev}}^S)LIMD=MSE(FCbevT,FCbevS)

MMD-F:
This part is distilled after the fusion of Lidar and Camera features. The constraints used are:

L M M D − F = M S E ( F U b e v T , F U b e v S ) L_{MMD-F}=MSE(F_{U_{bev}}^T,F_{U_{bev}}^S) LMMDF=MSE(FUbevT,FUbevS)

MMD-P:
This part is to distinguish the bbox prediction results in the prediction header ( PB P_BPB) and cls features ( PC P_CPC) for distillation, and for the cls feature to be a continuous value between [0,1], then Quality Focal Loss (QFL) is used here for adaptation to balance the data distribution, and the constraint is: LMMD − P
= S smooth L 1 ( PBT , PBS ) ⋅ s + QFL ( PCT , PCS ) ⋅ s L_{MMD-P}=SmoothL1(P_B^T,P_B^S)\cdot s+QFL(P_C^T,P_C^S )\cdot sLMMDP=SmoothL1(PBT,PBS)s+QFL(PCT,PCS)s
wheresss is the loss weighting coefficient determined by the IoU of bbox and GT.

CMD:
Since there is no Lidar sensor in the target, it is obtained by adding branches to the image features for prediction, and the generation of some features uses the Deformable Attention Layer (it has also been verified by experiments that it is better than DeformConv) to build Geometry Compensation Module (GCM ), used to generate a feature expression with a stronger ability to express geometric information, the generated Lidar feature description is:
FL bev S = GCM bev ( ρ ( Υ ( GCM uv ( FC uv S ) ) × Θ ( GCM uv ( FC uv S ) ) ) F_{L_{bev}}^S=GCM_{bev}(\rho(\Upsilon(GCM_{uv}(F_{C_{uv}}}^S))\times\Theta(GCM_{uv }(F_{C_{uv}}^S)))FLbevS=GCMbev( ρ ( Y ( GC Muv(FCuvS))×Θ(GCMuv(FCuvS)))

Here record ρ , Υ , Θ \rho,\Upsilon,\Thetar ,Y ,Θ represents bev_pool operation, context feature generation, and depth feature expression, respectively. After obtaining the virtual Lidar data, it needs to be aligned with the Lidar features, refer to the pair( x ˉ i , y ˉ i ) (\bar{x}_i,\bar{y}_i)(xˉi,yˉi) as the center to draw a 2D Gaussian distribution:
H ( x , y ) = ∑ i H exp ( − ( x − x ˉ i ) 2 + ( y − y ˉ i ) 2 2 σ 2 ) H(x,y)= \sum_i^Hexp(-\frac{(x-\bar{x}_i)^2+(y-\bar{y}_i)^2}{2\sigma^2})H(x,y)=iHexp(2 p2(xxˉi)2+(yyˉi)2)
among them,σ = max ( f ( h , w ) , τ ) , τ = 2 \sigma=max(f(h,w),\tau),\tau=2p=max(f(h,w),t ) ,t=2 , also refer to the settings in CenterPoint. For the bbox to project in the BEV space to obtain the binary maskBBB
M o = B ⊙ H M_o=B\odot H Mo=BH
then uses this mask to distill Lidar features (focus on the foreground area):
LCMD = M o ⊙ MSE ( FL bev T , FL bev S ) L_{CMD}=M_o\odot MSE(F_{L_{bev}} ^T,F_{L_{bev}}^S)LCMD=MoMSE(FLbevT,FLbevS)

The actual performance impact of these modules:
insert image description here

3. Experimental results

Effectiveness of the distillation method:
insert image description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/131426762