CVPR19-(HTC) Hybrid Task Cascade for Instance Segmentation "Hybrid Task Cascade for Instance Segmentation"

First acquaintance

CascadeCascade is a powerful architecture that increases model performance on many tasks, such asCascade RCNNCascade RCNN< a i=3>, but there is currently no good solution on how to integrate it into the field of instance segmentation. If simply fusing and Mask RCNN will bring limited gain to mask AP, Only 1.8%, whilebbox AP has 3.5%.

The author believes that an important reason for this huge gap is that Cascade RCNN has poor information flow between mask branches atdifferent stages
Insert image description here
< a i=2>, in the later stage, the mask branch can only benefit from more accurate bounding boxes and has no direct connection with the information flow, as shown in the figure below. In order to solve this problem, this paper proposes the HTC (Hybrid Task Cascade) model for instance segmentation. The core idea is: in each Each stage integrates cascadingcascade and multitaskingmulti-tasking to process to improve information flow, and exploits spatial contextsaptial context to further improve accuracy . As shown in the figure below, HTC combines bbox regression and mask prediction in a multi-task manner at each stage. In addition, a direct connection is built between mask branches in different stages: encodes the mask features of each stage and sends them to the next stage .

For target detection, contextual information provides very important clues for object positioning and classification, so HTC also uses an additional full convolution branch performs splitting. This branch not only encodes contextual information from foreground instances, but also information from background regions, further improving the prediction accuracy of bounding boxes and instance masks.
Insert image description here
Finally, after combining a better backbone and some tricks, the author relied on HTC to win the 2018 COCO target detection task.

friend

main method

Insert image description here
The above figure shows how to evolve from Cascade Mask R-CNN to HTC network: a) native Mask R-CNN; b) abandon the parallel structure, but alternately< /span> and merge it with the bbox and mask branches to capture more Contextual information. The evolution of each module is described in more detail next:Add an additional semantic segmentation branch; d) Enhance information flow Perform bbox regression and mask prediction; c) Construct connections between mask branches [the mask features of the previous stage will be sent to the next stage]

Figure a): The directly fusedCascade Mask R-CNN can be directly expressed by the following formula. Box and mask predictions are performed in parallel at each stage, using The boxes from the previous stage are used as candidate boxes for the next stage.
Insert image description here

In that, t t t represents the t-th stage, P P P represents the pooling operation (RoI Align or ROI pooling), x t b o x x^{box}_t xtbox represents the box branch feature extracted through the pooling operation in the t-th stage, B t B_t Bt represents the bbox branch of the t-th stage, r t r_t rt represents the prediction result of the box branch in the tth stage; x t m a s k x^{mask}_t xtmask represents the mask branch feature extracted through the pooling operation in the t-th stage, M t M_t Mt represents the mask branch of the t-th stage, m t m_t mtRepresents the prediction result of the box branch at stage t.

Figure b): The disadvantage of a) is that both the box and mask branches only accept the box output of the previous stage, and there is no direct interaction in the same stage. Therefore, 交替结构 is introduced, which is represented by the following formula. Each stage first performs box prediction, and then performs mask segmentation based on the box prediction result.
Insert image description here
Figure c): b) There is still no information flow introduced between mask branches in different stages. The author first analyzed an important factor in the success of Cascade RCNN, which is thatThe input features of the box branch in each stage are jointly determined by the box output and backbone output of the previous stage , according to this design concept, information flow is introduced between mask branches, and the formula is as follows.
Insert image description here

In that, m t − 1 − m^{-}_{t-1} mt1表示 M t − 1 M_{t-1} Mt1Mid-level features of , F F F represents a function that fuses the output of the current stage and the previous stage.

The specific implementation structure is as follows, m t − 1 − m^{-}_{t-1} mt1 is the mask branch of the previous stage M t M_t MtAfter the 4-layer convlayer in , the output before the deconv layer is passed through a 1x1 conv and then combined with x t m a s k x^{mask}_t xtmask is combined with [pixel addition] to perform 4 layers of conv + deconv for segmentation prediction; similarly, the current stage's m t − m^{-}_{t} mt should be used similarly for the next stage.
Insert image description here
Through this design, adjacent mask branches have direct interaction, and the mask features at different stages are no longer isolated and will be supervised and returned by loss.

Figure d): Cascade training of strongly related tasks can improve feature expression and bring benefits to the final task. Therefore, the authoradditionally adopts a full convolution branch for semantic segmentation. The features of the semantic segmentation branch can be used as strongly complementary features in the box and mask branches to help further distinguish the foreground and background. After adding this branch, in the box and mask branches, not only the pooling operation is used on the features output by the backbone, but also the same pooling operation is performed on the features of the semantic segmentation branch a>, combine it with [pixel plus] as the input feature of the box and mask branches. The formula looks like this:

Insert image description here

S ( x ) S(x) S(x)Display language divided branch special expedition

The specific structure of the semantic segmentation branch is implemented as follows. First, combine features at different levels in FPN, carry them1x1 conv + 上采样/下采样 to the same resolution and then combine [pixel addition]. Then after 4 layers of convolution operations, a simple 1x1 conv is used for semantic segmentation, and a 1x1 conv is used as the semantic segmentation feature mentioned in the above formula.

Insert image description here
Loss function: HTC directly performs end-to-end training, in which the box branch is for allROI for target classification and detection frame regression, and the mask branch is Each正类ROI predicts a pixel-level mask, and the semantic segmentation branch predicts the semantic segmentation information of the entire image. Overall losses look like this:

Insert image description here

Among L b b o x t L^{t}_{bbox} Lbboxt represents the box branch loss at the tth stage, which mainly includes classification loss and regression loss. This part is consistent with cascade rcnn; L m a s k t L^{t}_{mask } Lmaskt represents the mask branch loss at the t-th stage, mainly binary cross-entropy loss; the final L s e g L_{seg} LsegIt is the semantic segmentation branch loss, mainly cross-entropy loss. α and β are equilibrium parameters, where α takes the value [1,0.5,0.25] and β is 1 in the experiment.

Part of the experiment

For more experiments and details, please refer to the original article

The figure below shows the comparison between HTC and the current SOTA method. It can be seen that HTC can significantly improve the performance
Insert image description here
The figure below shows the gains brought by different modules of HTC, among which alternate prediction It increased by 0.2, and mask information flow and semantic information fusion increased by 0.6 points respectively.
Insert image description here

HTC game tricks and extensions

The following table shows how the author gradually improved the accuracy of his scheme in the COCO competition:

  • Added deformable convolution in the last stage of ResNet, the overall mAP increased by 0.6;
  • Replaced with syn bn in backbone and head, improved by 1.2;
  • Usingmulti-scale training, the shortest side is sampled from [400,1400], and the long side is fixed at 1600, which is an improvement of 1.8; a>
  • backbone is replaced withSENet-154 (optimal single-mode performance), improved by 1.8;
  • Multi-scale test, using 5 scales + horizontal flipping, integrated results (600, 900), (800, 1200), (1000, 1500), (1200, 1800 ), (1400, 2100)
  • PerformMulti-model fusion, SENet-154, ResNext101 (64x4d + 32x8d), DPN-107, FishNet. Improved by 1.6 points.

Insert image description here
The authors also performed performance reports on commonly used modules for detection and segmentation at the time, including ASPP, PAFPN, GCN, PrRoIPool and softNMS.

Insert image description here

review

HTC is an improved work based on Cascade RCNN proposed for instance segmentation. The original Cascade RCNN is mainly designed for target detection tasks, so for < The gain of /span>mask AP is limited. HTC has made improvements in three aspects. One is to cancel the parallel prediction of box and mask, and instead detect the box first and then predict the mask based on the corrected box; the second is to construct the information flow for the mask branch at different stages; finally It is to add additional semantic segmentation branches to provide additional complementary features.

In addition, this article also provides some tips and comparative experiments, which can provide ideas for similar competitions and industrial applications.

However, I tried the HTC model in an instance segmentation competition, but its performance was worse than Mask RCNN. It is not clear why. However, time and resources were limited, so there was not much parameter adjustment.

Guess you like

Origin blog.csdn.net/qq_36560894/article/details/123093620