TOOD: Task-aligned One-stage Object Detection principle and code analysis

paper:TOOD: Task-aligned One-stage Object Detection

code:https://github.com/fcjian/TOOD 

existing problems 

Object detection includes two sub-tasks of classification and localization. The features learned in the classification task mainly focus on the key or salient regions of the object, while the localization task focuses on the boundary of the object in order to accurately locate the entire object. Due to the different learning mechanisms for classification and localization, the spatial distribution of features learned by the two tasks may differ, leading to a certain degree of misalignment when using two independent branches for prediction . As shown in the figure below, the first line of the Result column is the result of ATSS predicting the dining table. The red and green patches are the anchors with the highest confidence and the largest IoU respectively, and the corresponding predicted bounding boxes are the red and green boxes respectively. It can be seen that the IoU of the bounding box and pizza predicted by the anchor with the highest classification score is larger, while the classification score of the anchor with the largest target IoU is very low, reflecting the misalignment of the two tasks. The second line is the prediction result of TOOD. It can be seen that the anchor with the highest classification score is also the largest IoU.

The innovation of this article

In order to solve the above existing problems, this paper proposes a task-aligned single-stage target detection model TOOD (Task-aligned One-stage Object Detection), by designing a new head and alignment-oriented learning method to more accurately Align the two tasks. details as follows

Task-aligned head. Different from the traditional one-stage object detection model, where classification and localization use two parallel branches respectively, this paper designs a task-aligned head to enhance the interaction between the two tasks so that they predictions remain consistent. The T-head consists of computing task-interactive features and making predictions through the newly proposed Task-Aligned Predictor (TAP), and then aligning the spatial distributions of the two predictions according to the information provided by task alignment learning.

Task alignment learning. In order to further overcome the misalignment problem, the author proposes Task Alignment Learning (TAP) to bring the optimal anchors of the two tasks closer. This is achieved by designing a new sample allocation method and task alignment loss function, the former assigns labels by calculating the task alignment of each anchor, and the latter gradually unifies the most suitable anchors for the two tasks during the training process. . Therefore, in the inference phase, an anchor with the highest classification score is also the highest positioning accuracy.

method introduction

In this paper, the two subtasks are aligned through the newly designed T-head and TAL, as shown in the figure below, the two can work together to improve the alignment of the two tasks. Specifically, T-head first predicts the classification and positioning of FPN features, then TAL calculates an alignment metric, which is used to measure the alignment of the two tasks, and finally T-head uses The calculated alignment metrics automatically adjust classification probabilities and localization predictions.

Task-aligned Head

As shown in the figure above, (a) is the head structure of a common one-stage detector, and (b) is the T-head proposed in this paper, which includes a simple feature extractor and two TAPs. To enhance the interaction between classification and localization, we use a feature extractor to learn task-interactive features from multiple convolutional layers, as shown in the middle blue part of (b). \(X^{fpn} \in \mathbb{R}^H\times W\times C\) is the FPN feature, and the feature extractor uses \(N\) consecutive convolutional layers and activation functions to calculate the task- interactive features, as shown below

Where \(conv_{k}\) is the \(k\)th convolutional layer, and \(\delta\) is the \(relu\) function. In this way, a single branch in the head is used to extract rich multi-scale features from the FPN features, and then these interactive features are fed into two TAPs for classification and localization alignment.

Task-aligned Predictor(TAP)

Due to the single-branch design, the task-interaction feature inevitably introduces a certain degree of feature conflict between two different tasks, because classification and localization have different goals and thus focus on different types of features, such as different levels and Feel wild. Therefore, this paper proposes a layer attention mechanism to perform task decomposition by dynamically computing different task-specific features at the level . As shown in Figure 3(c), for classification and localization, the task-specific features are calculated separately:

Where \(\omega_{k}\) is the \(k\)th element of the learned layer attention weight \(\omega \in \mathbb{R}^{N}\). \(\omega\) is calculated according to the cross-layer task interaction characteristics, and can capture the interaction relationship between layers:

Where \(fc_{1}\) and \(fc_{2}\) are two fully connected layers. \(\sigma\) is the sigmoid activation function, \(x^{inter}\) is obtained by applying average pooling to \(X^{inter}\), and \(X^{inter}\) is for\ (X^{inter}_{k}\) is obtained by concatenate.

Finally, the prediction results of classification and positioning are calculated according to the \(X^{task}\) of each task as follows:

Among them, \(X^{task}\) is obtained by concatenate \(X^{task}_{k}\), \(conv1\) is a 1x1 convolution for dimensionality reduction, \(Z^{ task}\) After \(sigmoid\) converted to classification score \(P\in \mathbb{R}^{H\times W\times 80}\), after \(distance{\small -}to{\small -}bbox\) conversion to obtain positioning prediction\(B\in \mathbb{R}^{H\times W\times 4}\).

Prediction alignment.

In the prediction stage, the two tasks are further aligned by adjusting the spatial distribution of the two predictions. Unlike previous models using the centerness branch or the IoU branch, which can only adjust the classification score according to the classification feature or localization feature, this paper considers both tasks when aligning two predictions with the task interaction feature. It is worth noting that the author performs the alignment method on the two tasks separately. As shown in the above figure (c), the author uses a spatial probability map\(M\in \mathbb{R}^{H\times W\times 1} \) to adjust the classification prediction

where \(M\) is calculated from the task interaction features, so that the consistency of the two tasks can be learned at each spatial location.

At the same time, in order to align the positioning prediction, the spatial offset map \(O\in \mathbb{R}^{H\times W\times 8}\) is also learned from the task interaction features, specifically the learned space The offset is the most aligned anchor able to identify the best boundary prediction around:

The index \((i,j,c)\) represents the \((i,j)\) position in the channel \(c\) in a tensor. Equation (6) is implemented by bilinear interpolation, and the offsets of each channel are learned independently, which means that each boundary of each object has its own learned offset, which makes The predictions of the four boundaries are more accurate because each boundary can be learned from the most accurate anchors near it.

The two aligned feature maps \(M\) and \(O\) are calculated as follows

Among them, \(conv1\) and \(conv3\) are two 1x1 convolutions for dimensionality reduction, and the learning of \(M\) and \(O\) is through the Task Alignment Learning (TAL) to be introduced next ongoing.

Task Alignment Learning

TAL differs from previous methods in two ways. First, from the perspective of task alignment, it dynamically selects high-quality anchors based on a separately designed metric. Second, it takes into account both anchor assignment and weights. Specifically, it includes a sample allocation strategy and a new loss function specifically designed to tune these two tasks.

Task-aligned Sample Assignment

To cope with NMS, the sample assignment of a training example should satisfy the following criteria: (1) Both classification and localization prediction of a well-aligned anchor should be accurate. (2) The classification score predicted by a misaligned anchor should be low so that it can be suppressed in NMS. This paper designs an anchor alignment metric to measure the task alignment of anchors, and integrates it into the sample allocation and loss function to dynamically refine the prediction of each anchor.

Anchor alignment metric.

The classification score and the IoU between the predicted bounding box and gt respectively indicate the prediction quality of the two tasks, so the author combined the two to design a new alignment measure, as follows

Where \(s\) and \(u\) represent the classification score and IoU, respectively, and \(\alpha\) and \(\beta\) are the weight coefficients used to control the influence of the two tasks on the task alignment metrics.

Training sample alignment.

The author introduces the task alignment index into the sample allocation summary. Specifically, for each target, select the \(m\) anchors with the largest \(t\) value as positive samples, and the rest as negative samples.

Task-aligned Loss

Classification objective.

The author replaces the binary classification label value 1 of the positive sample with the \(t\) value, but the author finds that when \(\alpha\) and \(\beta\) increase, \(t\) becomes very small resulting in The network cannot converge, so a normalized \(t\) is used, namely \(\hat{t}\), and the maximum value of \(\hat{t}\) is equal to the maximum value of the IoU of each object. The replaced cross entropy loss is as follows

In order to alleviate the problem of sample imbalance, the author used focal loss, and after replacing the corresponding labels, the final classification loss is as follows

Where \(j\) represents the \(j\)th anchor in the \(N_{neg}\) negative sample, and \(\gamma\) is the weight parameter.

Localization objective.

Like classification, \(\hat{t}\) is also added to the regression loss for re-weighting. The improved GIoU loss is as follows

Where \(b\) and \(\bar{b}\) represent the predicted bounding box and the corresponding gt box, respectively.

code analysis

Assuming batch_size=2, the input shape of the model=(2, 3, 300, 300), backbone=ResNet-50, neck=FPN, the output of FPN is [(2,256,38,38),(2,256,19,19) ,(2,256,10,10),(2,256,5,5),(2,256,3,3)], then traverse the output of each level of FPN, and calculate the prediction results of classification and positioning respectively. The following part of the tood head is implemented in tood_head.py. Here, taking the first layer (2, 256, 38, 38) as an example, first extract the task-interactive features according to formula (1), the code is as follows, where self.inter_convs It is 6 consecutive conv3x3-GN-ReLU, and finally get \(X_{k}^{inter}\).

# extract task interactive features
inter_feats = []
for inter_conv in self.inter_convs:
    x = inter_conv(x)
    inter_feats.append(x)  
# [(2,256,38,38),(2,256,38,38),(2,256,38,38),(2,256,38,38),(2,256,38,38),(2,256,38,38)]

Then \(X_{k}^{inter}\) is spliced ​​into \(X^{inter}\), and then \(x^{inter}\) is obtained through average pooling, and then the classification and positioning tasks are respectively performed break down. Corresponding formula (2) - formula (4)

feat = torch.cat(inter_feats, 1)  # X^{inter}_{k} -> X^{inter}, (2,1536,38,38),1536=256x6
# task decomposition
avg_feat = F.adaptive_avg_pool2d(feat, (1, 1))  # x^{inter}, (2,1536,1,1)
cls_feat = self.cls_decomp(feat, avg_feat)  # (2,256,38,38)
reg_feat = self.reg_decomp(feat, avg_feat)  # (2,256,38,38)

Task decomposition includes layer attention and dimensionality reduction. Note that the task interaction feature is the output of 6-layer convolution, and the number of channels in each layer is 256, so the number of channels after splicing is 256x6=1536, but the layer attention is that each layer shares a weight, so The shape of the weight is (2, 6, 1, 1).

The implementation in mmdet is to multiply the weight matrix of attention and the weight matrix of dimensionality reduction\(conv1\), and then calculate the task interaction features. In paddle detection, these two steps are separated tood_head.py . Here, the implementation of paddle is followed to separate the two parts to obtain tmp_feat. The final result is the same as the comment part in the following code, but note that atol=1e2 must be set in torch.allclose, and the When the result is printed out, it is found that there will be a difference in the fourth digit after the decimal, which may be the reason for the underlying implementation of the framework.

In addition, there is no \(conv2\) in formula (4) in the implementation of mmdet and paddle .

class TaskDecomposition(nn.Module):
    """Task decomposition module in task-aligned predictor of TOOD.

    Args:
        feat_channels (int): Number of feature channels in TOOD head.
        stacked_convs (int): Number of conv layers in TOOD head.
        la_down_rate (int): Downsample rate of layer attention.
        conv_cfg (dict): Config dict for convolution layer.
        norm_cfg (dict): Config dict for normalization layer.
    """

    def __init__(self,
                 feat_channels,
                 stacked_convs,
                 la_down_rate=8,  # 48
                 conv_cfg=None,
                 norm_cfg=None):
        super(TaskDecomposition, self).__init__()
        self.feat_channels = feat_channels  # 256
        self.stacked_convs = stacked_convs  # 6
        self.in_channels = self.feat_channels * self.stacked_convs  # 256x6=1536
        self.norm_cfg = norm_cfg
        self.layer_attention = nn.Sequential(
            nn.Conv2d(self.in_channels, self.in_channels // la_down_rate, 1),  # 1536//48=32
            nn.ReLU(inplace=True),
            nn.Conv2d(
                self.in_channels // la_down_rate,  # 32
                self.stacked_convs,  # 6
                1,
                padding=0), nn.Sigmoid())

        self.reduction_conv = ConvModule(
            self.in_channels,
            self.feat_channels,
            1,
            stride=1,
            padding=0,
            conv_cfg=conv_cfg,
            norm_cfg=norm_cfg,
            bias=norm_cfg is None)

    def init_weights(self):
        for m in self.layer_attention.modules():
            if isinstance(m, nn.Conv2d):
                normal_init(m, std=0.001)
        normal_init(self.reduction_conv.conv, std=0.01)

    def forward(self, feat, avg_feat=None):  # (2,1536,38,38),(2,1536,1,1)
        b, c, h, w = feat.shape
        if avg_feat is None:
            avg_feat = F.adaptive_avg_pool2d(feat, (1, 1))
        weight = self.layer_attention(avg_feat)  # (2,6,1,1)


        # tmp_weight = weight.unsqueeze(-1)  # (2,6,1,1,1)
        # tmp_feat = feat.reshape(2, 6, 256, 38, 38)
        # tmp_feat = tmp_feat * tmp_weight  # (2,6,256,38,38)
        # # tmp_feat = tmp_feat.reshape(2, 1536, 38, 38)
        # tmp_feat = tmp_feat.flatten(1, 2)  # (2,1536,38,38)
        # tmp_feat = self.reduction_conv(tmp_feat)  # (2,256,38,38)


        # here we first compute the product between layer attention weight and
        # conv weight, and then compute the convolution between new conv weight
        # and feature map, in order to save memory and FLOPs.
        conv_weight = weight.reshape(
            b, 1, self.stacked_convs,
            1) * self.reduction_conv.conv.weight.reshape(
                1, self.feat_channels, self.stacked_convs, self.feat_channels)
        # (2,6,1,1)->(2,1,6,1) * (256,1536,1,1)->(1,256,6,256) = (2,256,6,256)
        conv_weight = conv_weight.reshape(b, self.feat_channels,
                                          self.in_channels)  # (2,256,1536)
        feat = feat.reshape(b, self.in_channels, h * w)  # (2,1536,1444)
        feat = torch.bmm(conv_weight, feat).reshape(b, self.feat_channels, h,
                                                    w)  # (2,256,1444)->(2,256,38,38)
        if self.norm_cfg is not None:
            feat = self.reduction_conv.norm(feat)
        feat = self.reduction_conv.activate(feat)  # ReLU(inplace=True), (2,256,38,38)
        # 好像少了式(4)中的conv2


        # t = torch.allclose(tmp_feat, feat, atol=1e-2)
        # print(tmp_feat[0][0][0])
        # print(feat[0][0][0])
        # print(t)
        # exit()

        return feat

Next, calculate the spatial probability map of classification prediction and classification, where sigmoid_geometric_mean is to take sigmoid, multiplication, and root as the input parameters, corresponding to formula (5)

# cls prediction and alignment
cls_logits = self.tood_cls(cls_feat)  # 没有sigmoid的P, (2,20,38,38)
cls_prob = self.cls_prob_module(feat)  # 没有sigmoid的M, (2,1,38,38)
cls_score = sigmoid_geometric_mean(cls_logits, cls_prob)  # P^{align}, (2,20,38,38)

Then calculate the positioning prediction and positioning spatial offset map, where deform_sampling adjusts the positioning prediction according to the learned spatial offset map, and the separable convolution used for the specific implementation is the same as the bilinear interpolation.

# reg prediction and alignment
if self.anchor_type == 'anchor_free':
    reg_dist = scale(self.tood_reg(reg_feat).exp()).float()  # (2,4,38,38), 为什么要exp
    reg_dist = reg_dist.permute(0, 2, 3, 1).reshape(-1, 4)  # (2,4,38,38)->(2,38,38,4)->(2888,4)
    reg_bbox = distance2bbox(
        self.anchor_center(anchor) / stride[0],
        reg_dist).reshape(b, h, w, 4).permute(0, 3, 1,
                                              2)  # (2888,4)->(2,38,38,4)->(2,4,38,38)
reg_offset = self.reg_offset_module(feat)  # O, (2,8,38,38)
bbox_pred = self.deform_sampling(reg_bbox.contiguous(),
                                 reg_offset.contiguous())  # (2,4,38,38)

After adjusting the classification and localization prediction results to get the final prediction of the model, the next step is to assign positive and negative samples and calculate the loss. In task_aligned_assigner.py, the \(t\) value is calculated for each object and \(m\) anchors with the largest \(t\) value are selected as positive samples, and the rest are negative samples. The code is as follows, where m= 13

# select top-k bboxes as candidates for each gt
alignment_metrics = bbox_scores ** alpha * overlaps ** beta  # t
topk = min(self.topk, alignment_metrics.size(0))  # 13

After allocating the positive and negative samples, it is to calculate the loss function, return to tood_head.py, and calculate the classification label and regression target corresponding to each anchor in the function _get_target_single(), where \(t\) value is normalized to get\ (\hat{t}\), the code is as follows

pos_norm_alignment_metrics = pos_alignment_metrics / (
        pos_alignment_metrics.max() + 10e-8) * pos_ious.max()  # normalized t

Note that in the implementation of mmdet, in the first 4 epochs, ATSS is used for label assignment, topk=9, and Focal loss is used for the loss function. After that, the label allocation was changed to the TaskAligned label allocation method proposed in this paper, topk=13, and the loss function was changed to Qualiy Focal loss. The positioning loss is GIoU loss throughout.

The selection of other parameters, the number of continuous convolutional layers \(N=6\) for extracting task interactions, \(\alpha=1,\beta=6\) in the anchor alignment metrics \(t\), label assignment Take the \(m=13\) anchors with the largest \(t\) value as positive samples.

Experimental results

The following are some test images of T-head+TAL proposed in this paper, and the comparison with ATSS. It can be seen that T-head+TAL can align the two predictions very well, and the prediction with the highest final classification score is also the largest IoU.

The following is a comparison with some current mainstream single-stage detection models and label assignment methods. TOOD has achieved SOTA results.

Guess you like

Origin blog.csdn.net/ooooocj/article/details/128193646