Swin Transformer test process

The overall structure consists of four parts: backbone, neck, rpn_head, roi_head . You can choose Mask-Rcnn and cascade-cnn two structures, as shown in the upper left corner of Figure 1

1.backbone

It consists of four BasicLayers, including blocks and downsample. Blocks contains several basic transformer modules, and downsample is responsible for downsampling. Take the input image (3, 800, 1216) as an example, firstly it becomes (96, 200, 304) through patch_embed, and then input 4 BasicLayers in turn. The output out is: (1, 96, 200, 304) (1, 192, 100, 152) (1, 384, 50, 76) (1, 768, 25, 38). Send the output out to FPN
insert image description here

2.neck

As shown in the above figure, the output channel is first changed to 256 dimensions (through 4 full connections), and then the upper layer feature map is added to the next layer feature map through bilinear interpolation (F.interpolate). It is a typical pyramid structure. Finally, the uppermost feature map (25, 38, 256) directly generates the fifth layer feature map (13, 19, 256) by maxpool.

3.rpn_head

As shown in part 4 of the above figure, use three convolutions to combine the 5-layer feature map -> classification and regression. 3 channels represent 3 anchors (2:1, 1:1, 1:2), 12 channels represent: (4 coordinate values * 3 anchor) The
specific calculation details are : according to the score (such as 200, 308, 3) scores The first 1000 targets in each layer are then added together to get bbox: (4751, 5) (because the fifth layer only has 13*19*3, which is less than 1000 targets). Finally, send it to nms to get 1426 targets, and only select 1000 to get the final output Rois: (1000, 5)
insert image description here

4.roi_head

The input is a 5-layer feature map (both 256 channels) and rois (1000, 5)

4.1. By calculating the area of the target in rois, assign it to the 5-layer feature map (see the blue figure below for the calculation formula)

insert image description here

4.2. Through the function, the feature map is normalized to 7*7

# 循环54层特征图i：
roi_feats_t = self.roi_layers[i]（feats[i],rois）
# self.roi_layers：RoIAlign(out=(7,7,scale=0.25)
                   RoIAlign(out=(7,7,scale=0.125)
                   RoIAlign(out=(7,7,scale=0.0625) 
                   RoIAlign(out=(7,7,scale=0.03175)

Taking the feature map of the first layer as an example, feats[i] is (1, 200, 304, 256), and rois is (689, 5). 689 are the objects belonging to the feature map of the first layer computed from the 1000 ROIs.
The output roi_feats_t dimension is (689,256,7,7).
The final output roi_feats dimension is (1000, 256, 7, 7)

4.3. Feature map (1000,256,7,7,): classification + regression

cls_score,bbox_pred = self.bbox_head(bbox_feats)
# (1000,81)  (1000,320)
det_bbox,det_label = multiclass_nms(bbox,score)
# (57,5) (57)

This concludes the forecast.