DETR Series Mask Frozen-DETR: High Quality Instance Segmentation with One GPU Paper Reading Notes


write in front

  The highlight of this article is obviously that it can run with just one GPU! The way this topic is conceived is worth learning.

1. Abstract

  This paper aims to study the establishment of an instance segmenter, Mask Frozen-
DETR, that requires minimal training time and GPU, and can convert any DETR-based object detection model into an instance segmentation model. The proposed method only needs to train an additional lightweight mask network to predict instance masks within the Bounding box through a frozen DETR-based object detector. It works very well on the COCO data set and only requires a single V100 16G card for training.

Insert image description here

2. Introduction

  First, we point out the difficulty of instance segmentation, some recent methods: Cascade Mask R-CNN, Mask DINO. These methods usually use ResNet-50 and Swin-L as backbone and require a lot of GPU time to train.
  This paper shows that existing DETR-based object detection models can be transformed into instance segmentation models. Take H \mathcal{H}Taking H -DETR and DINO-DETR as examples, two innovations are proposed: designing a lightweight instance segmentation network to effectively utilize the output of the frozen DETR target detector; and demonstrating the effectiveness of the method proposed in this article under different scale models.

3. Related work

Target Detection

  Object detection is a basic research field, involving a lot of work: Faster R-CNN, Cascade R-CNN, YOLO, DETR, Deformable DETR, DINO-DETR, H \mathcal{H }H -DETR. The method in this article is built on these DETR-based methods, achieving SOTA performance and 10 times the training speed.

Instance splitting

  The initial instance segmentation method is based on the R-CNN series, with the idea of ​​detecting first and then segmenting. For example, Mask R-CNN, Casacde Mask R-CNN, HTC. Some methods are relatively simple in design: SOLO, QueryInst. There are some methods based on DETR: MaskFormer, Mask2Former, Mask-DINO. There are also methods based on 3D instance segmentation, such as SPFormer.

discuss

  Most instance segmentation models do not utilize the weights of offline object detection models and therefore require a large amount of training time. The method in this article adopts a frozen DETR-based model and introduces a lightweight instance segmentation head. The figure below is a comparison of some methods:

Insert image description here

4. Methods of this article

4.1 Baseline settings

  Use H \mathcal{H}H -DETR+ResNet50 as Baseline and reportsH \mathcal{H}Comparative performance of H -DETR+Swin-L and DINO-DETR+FocalNet-L. H \mathcal{H}The H -DETR+ResNet50 model is pre-trained on the Object365 dataset and then fine-tuned on the COCO dataset.
  First, based on the query scores output by the last Transformer decoder layer of the target detection model, the top~100 target queries are selected for mask prediction. Then these queries{ qi ∣ qi ∈ R d } i = 1 N \{ { \mathbf{q}_i|\mathbf{q}_i\in\mathbb{R}^\mathsf{d}\}}_{ i=1}^N{ qiqiRd}i=1Nwith 1 / 4 1/4Image feature F ⁡ ∈ RHW 16 × d {\operatorname*{F}}\in\mathbb{R}^{\frac{\mathsf{H}\mathsf{W}}{16} at 1/4 original resolution\times\mathsf{d}}FR16HW×d 相乘,得到实例分割 masks:
F = C 1 + interpolate ( E ) M i = interpolate ( reshape ( Sigmoid ( q i F ⊤ ) ) ) \begin{aligned} \text{F}& =\mathrm{C}_1+\text{interpolate}(\mathbf{E}) \\ \mathrm{M}_{i}& =\text{interpolate}(\text{reshape}(\text{Sigmoid}(\mathbf{q}_i\mathbf{F}^\top))) \end{aligned} FMi=C1+interpolate(E)=interpolate(reshape(Sigmoid(qiF)))
其中 C 1 ∈ R H W 16 × d \mathrm{C}_1\in\mathbb{R}^{\frac{\mathsf{H}\mathsf{W}}{16}\times\mathsf{d}} C1R16HW× d represents the feature map output by Backbone in the first stage,E ∈ RHW 64 × d \mathbf{E}\in\mathbb{R}^{\frac{\mathsf{H}\mathsf{W}}{64}\ times\mathsf{d}}ER64HW× d represents 1 / 8 1/8of the Transformer encoder outputFeature map at 1/8 original resolution. H \mathsf{H}H W \mathsf{W} W d \mathsf{d} d represents the height, width and feature hidden dimensions of the input image respectively. M i ∈ RHW \mathrm{M}_{i}\in\mathbb{R}^{\mathsf{H}\mathsf{W}}MiRHW represents the last predicted mask probability map. The picture below shows the overall roadmap:

Insert image description here
  Next, the confidence score is calculated to reflect the quality of the mask:
si = ci × sum ( M i [ M i > 0.5 ] ) sum ( [ M i > 0.5 ] ) \mathrm{s}_{i}=c_{i }\times\frac{\mathrm{sum}(\mathbf{M}_{i}[\mathbf{M}_{i}>0.5])}{\mathrm{sum}([\mathbf{M}_ {i}>0.5])}si=ci×sum([Mi>0.5])sum(Mi[Mi>0.5])Among them ci c_iciRepresents the iithThe classification score of i target query.
  Next, RoIAlign is used within the predicted Bounding boxes to aggregate the regional feature maps used for positioning:
R i = reshape ( R o IA lign ( reshape ( F ) , bi ) ) \mathbf{R}_i=\mathrm{reshape} (\mathrm{RoIAlign}(\mathrm{reshape}(\mathbf{F}),\mathbf{b}_i))Ri=reshape(RoIAlign(reshape(F),bi))其中 R i ∈ R H W × d \mathbf{R}_i\in\mathbb{R}^{\mathsf{H}\mathsf{W}\times\mathsf{d}} RiRHW×d b i \mathbf{b}_i biIndicates that according to qi \mathbf{q}_iqiPredicted bounding box. Default setting h = 32 \mathsf{h}=32h=32 w = 32 \mathsf{w}=32 w=32。之后计算实例分割 mask:
M i r = paste ( interpolate ( reshape ( Sigmoid ( q i R i ⊤ ) ) ) ) \mathbf{M}_i^r=\text{paste}(\text{interpolate}(\text{reshape}(\text{Sigmoid}(\mathbf{q}_i\mathbf{R}_i^\top)))) Mir=paste(interpolate(reshape(Sigmoid(qiRi)))) First deform and scale the predicted area masks to the original image size and paste them onto an empty mask. Also based on M ir \mathbf{M}_i^rin a similar wayMirto calculate the confidence score.

result

Insert image description here
  According to the Baseline experiments in the table above, it is shown that the performance of the model can be improved from three aspects: the design of the image feature encoder, the design of the box area feature encoder, and the design of the query feature encoder.
  Next, in the ablation experiment, the entire target detector network was frozen, and only the additional introduced parameters were fine-tuned, training for about 6 epochs.

Experimental setup

   AdamW optimizer, initial learning rate 1.5 × 1 0 − 4 1.5\times10^{-4}1.5×104β 1 = 0.9 \beta_1 =0.9b1=0.9β 2 = 0.999 \beta_2 =0.999b2=0.999 , weight decay5 × 1 0 − 5 5\times10^{-5}5×105 , train the model 88500 times, that is, 6 epochs, and the learning rate decreases by 10% at 0.9 and 0.95 iterations. The steps of data preprocessing are the same as those of Deformable DETR, Batch_size 8, all experiments are performed on a single V100 16G card, and the evaluation indicators are COCO: AP, AP 50, AP75,APS,AM,AL.

4.2 Image feature encoder

  First, a trainable image feature encoder is integrated to convert the image feature map into a more suitable feature space for instance segmentation masks. The following figure shows the overall roadmap:

Insert image description here
  As shown in the figure above, apply the image feature encoder to the feature map E \mathbf{E} in the Transformer encoderE 中:
F = C 1 + interpolate ( F e ( E ) ) \mathrm{F}=\mathrm{C}_1+\text{interpolate}(\mathcal{F}_e(\mathrm{E})) F=C1+interpolate(Fe( E )) Among them,F e \mathcal{F}_eFeRepresents an image feature encoder that aims to refine the image feature maps of all target queries simultaneously. Its structure consists of three types of convolution or Transformer blocks.

Deformable encoder block

  The design of multi-scale deformable encoder is used to stack multi-scale deformable encoder blocks to enhance the multi-scale feature map:
E = [ E 1 , E 2 , E 3 , E 4 ] F e ( E ) = MultiScaleDeformable E nc ( ∣ E 1 , E 2 , E 3 , E 4 ∣ ) \begin{aligned} \text{E}& =[\mathbf{E}_1,\mathbf{E}_2,\mathbf{E}_3,\mathbf{E}_4 ] \\ \mathcal{F}_e(\mathbf{E})& =\text{MultiScaleDeformable}\mathrm{Enc}(|\mathbf{E}_1,\mathbf{E}_2,\mathbf{E}_3 ,\mathbf{E}_4|) \end{aligned}EFe(E)=[E1,E2,E3,E4]=MultiScaleDeformableEnc(E1,E2,E3,E4)E 1 \mathbf{E} _1E1, E 2 \mathbf{E}_2E2, E 3 \mathbf{E}_3E3, E 4 \mathbf{E}_4E4Represent the feature maps of different scales of the Transformer encoder in the object detector respectively. Each multi-scale deformable encoder follows the design of MSDeformAttn → LayerNorm → FFN → LayerNorm, and FFN adopts the design of Linear → GELU → Linear.

Swin Transformer encoder block

  Following the design of Swin Transformer, apply stacked multiple Swin Transformer blocks to the highest resolution feature map E 1 \mathbf{E}_1E1Ex:
F e ( E 1 ) = SwinTransform E nc ( E ​​1 ) \mathcal{F}_e(\mathbf{E}_1)=\text{SwinTransform}\mathrm{Enc}(\mathbf{E}_1)Fe(E1)=SwinTransformerEnc(E1) Each Swin Transformer block follows the route of LayerNorm → W-MSA → LayerNorm → FFN. W-MSA represents sliding window multi-head self-attention operation.

ConvNext encoder block

  Apply ConvNext encoder block on the feature map of Transformer encoder:
F e ( E 1 ) = ConvNextBlock ( E 1 ) \mathcal{F}_e(\mathbf{E}_1)=\text{ConvNextBlock}(\mathbf{E} _1)Fe(E1)=ConvNextBlock(E1) Each ConvNext block follows the route of DWC → LayerNorm → FFN. DWC represents a large-scale convolution kernel7 × 7 7\times77×Depth-wise convolution of 7 .

result

Insert image description here

4.3 Box feature encoder

  The Box feature encoder roadmap is shown below:

Insert image description here
As shown in the figure above, in R i \mathbf{R}_iRiSimply add an extra transformer layer in:
R i = F b ( R i ) \mathbf{R}_i=\mathcal{F}_b(\mathbf{R}_i)Ri=Fb(Ri) Similar to the image feature encoder, study the impact of different box feature encoder structures on model performance.

channel mapper

  Use the channel mapper as a simple linear layer to reduce F ∈ RHW 16 × d \mathbf{F}\in{\mathbb{R}}^{\frac{\mathsf{HW}}{16}\times\mathsf {d}}FR16HW× d feature channel dimensions, and apply the box region feature encoder on the updated feature map.

result

Insert image description here
Insert image description here

4.4 Query feature encoder

Object-to-Object attention

  The proximity of objects to each other may cause multiple instances to be in the same Bounding box, so Object-to-Object attention is added to assist Object Query in distinguishing them from each other. Specifically, the multi-head attention mechanism is used to process Queries:
[ q 1 , q 2 , ⋯ , q N ] = SelfAttention ( [ q 1 , q 2 , ⋯ , q N ] ) [\mathbf{q}_1,\ mathbf{q}_2,\cdots,\mathbf{q}_N]=\text{SelfAttention}([\mathbf{q}_1,\mathbf{q}_2,\cdots,\mathbf{q}_N])[q1,q2,,qN]=SelfAttention([q1,q2,,qN])

Box-to-object attention

  In frozen DETR-based object detectors, object queries are used to perform object detection and process the entire image features instead of individual regional features. So Box-to-object attention is introduced to transform queries into queries adapted to the segmentation task:
qi = CrossAttention ( qi , R i ) \mathbf{q}_i=\text{CrossAttention}(\mathbf{q}_i,\mathbf {R}_i)qi=CrossAttention(qi,Ri) where the target queries are updated by referring to the box area features.

FFN

  FFN may also have an auxiliary role in adjusting Object query representation.

result

Insert image description here
  Conclusion: Just use the design of Box-to-object attention.

4.5 Other improvements

Mask Loss on sampled pixels

  In the calculation process of the final loss, the Mask loss of the sampling point is introduced. Specifically, consider NNN points, oversampling ratek ( k > 1 ) k~(k>1)k (k>1 ) , sampling rateβ ( β ∈ [ 0 , 1 ] ) \beta~(\beta\in[0,1])b ( b [0,1 ]) . Randomly sample k N kNfrom the output maskk N points, and randomly selectβ N \beta NβN points. Then randomly sample other( 1 − β ) N (1-\beta) N(1β ) N points, calculate thisNNA loss of N points.

Mask rating

  Since the classification score output by frozen DETR cannot reflect the quality of the segmentation mask, Mask scoring is introduced to adjust the classification score. Specifically, the mask scoring head takes the mask and box area features as input to predict the IoU scores of the output masks and GT:
ioui = MLP ( F latten ( C onv ( C at ( M i , R i ) ) ) ) \ mathrm{iou}_i=\mathrm{MLP}(\mathrm{Flatten}(\mathrm{Conv}(\mathrm{Cat}(\mathbf{M}_i,\mathbf{R}_i))))ioui=MLP(Flatten(Conv(Cat(Mi,Ri)))) Among them, Cat, Conv, and MLP represent splicing operations, convolutional layers, and multi-layer perceptrons respectively. The iou score predicted by the mask scoring header is used to adjust the classification score:
si = ciioui \mathrm{s}_i=\mathrm{c}_i\mathrm{i}\mathrm{o}\mathrm{u}_isi=ciiouiinside si s_isiis the confidence score of the output mask.

Neck structure for Backbone features

  Feature C 1 ∈ RHW 16 × d \mathbf{C}_1\in\mathbb{R}^{\frac{\mathsf{HW}}{16}\times\mathsf{d}}C1R16HW× d may not contain sufficient semantic information. So a simple neck block is introduced to encode semantic information into high-resolution feature mapC 1 \mathbf{C}_1C1:
C 1 = GN ( PWC onv ( C 1 ) ) \mathbf{C}_1=\mathrm{GN}(\mathrm{PWConv}(\mathbf{C}_1))C1=GN(PWConv(C1)) whereGN \mathrm{GN}GN P W C o n v \mathrm{PWConv} PWConv refers to group regularization and point-wise convolution respectively.

result

Insert image description here

5. Comparison with SOTA

Insert image description here

6. Ablation experiment and analysis

Output size of RoIAlign

Insert image description here

Training Epoch

Insert image description here

large scale jitter

Insert image description here

Design of instance mask header

Insert image description here

Bach Size

Insert image description here

Layer index of encoder feature map

Insert image description here

The impact of image feature encoder depth and box feature encoder

Insert image description here

Fine-tuning the effect of DETR

Insert image description here

Qualitative results

Insert image description here

7. Conclusion

  This article shows in detail how to convert an offline DETR-based object detection model into an instance segmentation model with minimal training time and cost. Experiments show that it is very effective.

write on the back

  After procrastinating for so long, I finally finished writing this blog post. The idea of ​​​​the article is to draw on the idea of ​​Parameter-Efficiency and introduce new structures to fine-tune the original model, which is something. Especially the experiments are very abundant. There are also disadvantages. Fig1 is not cited in the paper; there are many ablation experiments and it is not necessary to separate them all, and several can be merged.

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/132496782