Article directory
First acquaintance
CascadeCascade
is a powerful architecture that increases model performance on many tasks, such asCascade RCNNCascade RCNN
< a i=3>, but there is currently no good solution on how to integrate it into the field of instance segmentation. If simply fusing and Mask RCNN
will bring limited gain to mask AP, Only 1.8%, whilebbox AP has 3.5%.
The author believes that an important reason for this huge gap is that Cascade RCNN has poor information flow between mask branches atdifferent stages
< a i=2>, in the later stage, the mask branch can only benefit from more accurate bounding boxes and has no direct connection with the information flow, as shown in the figure below. In order to solve this problem, this paper proposes the HTC (Hybrid Task Cascade
) model for instance segmentation. The core idea is: in each Each stage integrates cascadingcascade
and multitaskingmulti-tasking
to process to improve information flow, and exploits spatial contextsaptial context
to further improve accuracy . As shown in the figure below, HTC combines bbox regression and mask prediction in a multi-task manner at each stage. In addition, a direct connection is built between mask branches in different stages: encodes the mask features of each stage and sends them to the next stage .
For target detection, contextual information provides very important clues for object positioning and classification, so HTC also uses an additional full convolution branch performs splitting. This branch not only encodes contextual information from foreground instances, but also information from background regions, further improving the prediction accuracy of bounding boxes and instance masks.
Finally, after combining a better backbone and some tricks, the author relied on HTC to win the 2018 COCO target detection task.
friend
main method
The above figure shows how to evolve from Cascade Mask R-CNN to HTC network: a) native Mask R-CNN; b) abandon the parallel structure, but alternately< /span> and merge it with the bbox and mask branches to capture more Contextual information. The evolution of each module is described in more detail next:Add an additional semantic segmentation branch; d) Enhance information flow Perform bbox regression and mask prediction; c) Construct connections between mask branches [the mask features of the previous stage will be sent to the next stage]
Figure a): The directly fusedCascade Mask R-CNN
can be directly expressed by the following formula. Box and mask predictions are performed in parallel at each stage, using The boxes from the previous stage are used as candidate boxes for the next stage.
In that, t t t represents the t-th stage, P P P represents the pooling operation (RoI Align or ROI pooling), x t b o x x^{box}_t xtbox represents the box branch feature extracted through the pooling operation in the t-th stage, B t B_t Bt represents the bbox branch of the t-th stage, r t r_t rt represents the prediction result of the box branch in the tth stage; x t m a s k x^{mask}_t xtmask represents the mask branch feature extracted through the pooling operation in the t-th stage, M t M_t Mt represents the mask branch of the t-th stage, m t m_t mtRepresents the prediction result of the box branch at stage t.
Figure b): The disadvantage of a) is that both the box and mask branches only accept the box output of the previous stage, and there is no direct interaction in the same stage. Therefore, 交替结构
is introduced, which is represented by the following formula. Each stage first performs box prediction, and then performs mask segmentation based on the box prediction result.
Figure c): b) There is still no information flow introduced between mask branches in different stages. The author first analyzed an important factor in the success of Cascade RCNN, which is thatThe input features of the box branch in each stage are jointly determined by the box output and backbone output of the previous stage , according to this design concept, information flow is introduced between mask branches, and the formula is as follows.
In that, m t − 1 − m^{-}_{t-1} mt−1−表示 M t − 1 M_{t-1} Mt−1Mid-level features of , F F F represents a function that fuses the output of the current stage and the previous stage.
The specific implementation structure is as follows, m t − 1 − m^{-}_{t-1} mt−1− is the mask branch of the previous stage M t M_t MtAfter the 4-layer convlayer in , the output before the deconv layer is passed through a 1x1 conv
and then combined with x t m a s k x^{mask}_t xtmask is combined with [pixel addition] to perform 4 layers of conv + deconv for segmentation prediction; similarly, the current stage's m t − m^{-}_{t} mt− should be used similarly for the next stage.
Through this design, adjacent mask branches have direct interaction, and the mask features at different stages are no longer isolated and will be supervised and returned by loss.
Figure d): Cascade training of strongly related tasks can improve feature expression and bring benefits to the final task. Therefore, the authoradditionally adopts a full convolution branch for semantic segmentation. The features of the semantic segmentation branch can be used as strongly complementary features in the box and mask branches to help further distinguish the foreground and background. After adding this branch, in the box and mask branches, not only the pooling operation is used on the features output by the backbone, but also the same pooling operation is performed on the features of the semantic segmentation branch a>, combine it with [pixel plus] as the input feature of the box and mask branches. The formula looks like this:
S ( x ) S(x) S(x)Display language divided branch special expedition
The specific structure of the semantic segmentation branch is implemented as follows. First, combine features at different levels in FPN, carry them1x1 conv + 上采样/下采样
to the same resolution and then combine [pixel addition]. Then after 4 layers of convolution operations, a simple 1x1 conv
is used for semantic segmentation, and a 1x1 conv
is used as the semantic segmentation feature mentioned in the above formula.
Loss function: HTC directly performs end-to-end training, in which the box branch is for allROI
for target classification and detection frame regression, and the mask branch is Each正类ROI
predicts a pixel-level mask, and the semantic segmentation branch predicts the semantic segmentation information of the entire image. Overall losses look like this:
Among L b b o x t L^{t}_{bbox} Lbboxt represents the box branch loss at the tth stage, which mainly includes classification loss and regression loss. This part is consistent with cascade rcnn; L m a s k t L^{t}_{mask } Lmaskt represents the mask branch loss at the t-th stage, mainly binary cross-entropy loss; the final L s e g L_{seg} LsegIt is the semantic segmentation branch loss, mainly cross-entropy loss. α and β are equilibrium parameters, where α takes the value [1,0.5,0.25] and β is 1 in the experiment.
Part of the experiment
For more experiments and details, please refer to the original article
The figure below shows the comparison between HTC and the current SOTA method. It can be seen that HTC can significantly improve the performance
The figure below shows the gains brought by different modules of HTC, among which alternate prediction It increased by 0.2, and mask information flow and semantic information fusion increased by 0.6 points respectively.
HTC game tricks and extensions
The following table shows how the author gradually improved the accuracy of his scheme in the COCO competition:
- Added deformable convolution in the last stage of ResNet, the overall mAP increased by 0.6;
- Replaced with syn bn in backbone and head, improved by 1.2;
- Usingmulti-scale training, the shortest side is sampled from [400,1400], and the long side is fixed at 1600, which is an improvement of 1.8; a>
- backbone is replaced withSENet-154 (optimal single-mode performance), improved by 1.8;
- Multi-scale test, using 5 scales + horizontal flipping, integrated results (600, 900), (800, 1200), (1000, 1500), (1200, 1800 ), (1400, 2100)
- PerformMulti-model fusion, SENet-154, ResNext101 (64x4d + 32x8d), DPN-107, FishNet. Improved by 1.6 points.
The authors also performed performance reports on commonly used modules for detection and segmentation at the time, including ASPP, PAFPN, GCN, PrRoIPool and softNMS.
review
HTC is an improved work based on Cascade RCNN
proposed for instance segmentation. The original Cascade RCNN
is mainly designed for target detection tasks, so for < The gain of /span>mask AP
is limited. HTC has made improvements in three aspects. One is to cancel the parallel prediction of box and mask, and instead detect the box first and then predict the mask based on the corrected box; the second is to construct the information flow for the mask branch at different stages; finally It is to add additional semantic segmentation branches to provide additional complementary features.
In addition, this article also provides some tips and comparative experiments, which can provide ideas for similar competitions and industrial applications.
However, I tried the HTC model in an instance segmentation competition, but its performance was worse than Mask RCNN
. It is not clear why. However, time and resources were limited, so there was not much parameter adjustment.