Generalized Focal Loss (GFLv1) paper understanding and code analysis

GFLv1 is an earlier paper that raised the issue of inconsistencies in quality assessment and classification during training and inference. Quality and classification are often trained independently, (yolo output has an independent dimension for confidence evaluation of the target, which is decoupled from classification during training; fcos, atss use centerness as quality evaluation, and is also decoupled from classification ), but they are multiplied and used jointly during the test process. Supervision is only assigned to positive samples, but sometimes negative samples may provide higher-quality supervision, and there must be some "negative samples" with lower scores. The quality prediction is not supervised during the training process. That is to say For a large number of possible negative samples, their quality prediction is an undefined behavior. What may happen is that a true negative sample with a relatively low classification score may be ranked as a true positive sample due to the prediction of an unbelievably high quality score (the classification score is not high enough and the quality score is relatively high. low) front.

Questions:

In summary, GFLv1 poses two problems:
1. In the process of training and reasoning, the methods of quality assessment and classification are inconsistent;
2. In complex scenes, the representation of regression is not flexible enough, and the Dirac function (similar to pulse function) to locate is inaccurate;

Methods:

1.localization quality representation
insert image description here In order to solve the above two problems, GFLv1 proposes positioning quality estimation, which is directly merged with the classification score in the paper, and the category vector is retained. The meaning of the score of each category becomes the IoU of GT. In addition, using this method, positive and negative samples can be trained at the same time, and there will be no difference between training and testing.

2.general distribution
insert image description here

For the representation method of the prediction box, the general distribution is used for modeling without imposing any constraints. Not only can reliable and accurate prediction results be obtained, but also the potential real distribution can be perceived. As shown in the figure above, for ambiguous or uncertain boundaries, the distribution will be expressed as a smoother curve, otherwise, its distribution will be expressed as a sharp curve.

In fact, using the two strategies mentioned above will face optimization problems. In the conventional one-stage detection algorithm, the classification branch is optimized using Focal loss, and Focal loss is mainly for discrete classification labels. After the paper combines the positioning quality with the classification score, its output becomes a category-related continuous IoU score, and Focal loss cannot be used directly. Therefore, the paper expands Focal loss and proposes GFL (Generalized Focal Loss), which can handle the global optimization problem of continuous value targets. GFL includes two specific forms of QFL (Quality Focal Los) and DFL (Distribution Focal Los). QFL is used to optimize difficult samples while predicting the continuous value scores of the corresponding categories, while DFL models the general distribution of the predicted frame position. Provides more information and accurate location forecasts.

Below I will combine the code to explain the idea and implementation of GFL.

Focal loss(FL)

insert image description here
FL is mainly used to solve the imbalance problem of positive and negative samples in the one-stage target detection algorithm, including the standard cross entropy part − log ( pt ) -log(p_t)log(pt) and scaling factor part( 1 − pt ) γ (1-p_t)^\gamma(1pt)γ , where the scaling factor can reduce the weight of simple samples, increase the proportion of difficult samples in loss, and alleviate the problem of sample imbalance.

Quality Focal loss(QFL)

insert image description hereSince FL only supports discrete labels, it is extended in order to apply its ideas to continuous labels combining classification and localization quality. First the cross entropy part − log ( pt ) -log(p_t)log(pt) expands to the full form− ( ( 1 − y ) log ( 1 − σ ) + ylog ( σ ) ) -((1-y)log(1-\sigma)+ylog(\sigma))((1y)log(1s )+y l o g ( σ )) , and then generalize the scaling factor back to the predicted valueσ \sigmaThe absolute difference between σ and the continuous label y (here refers to the iou of gt and anchor)∣ y − σ ∣ β |y-\sigma|^\betayσβ , combine them to get QFL.

def quality_focal_loss(pred, target, beta=2.0):
    assert len(target) == 2, """target for QFL must be a tuple of two elements,
        including category label and quality label, respectively"""
    # label denotes the category id, score denotes the quality score
    label, score = target

    # negatives are supervised by 0 quality score
    pred_sigmoid = pred.sigmoid()
    scale_factor = pred_sigmoid
    zerolabel = scale_factor.new_zeros(pred.shape)
    loss = F.binary_cross_entropy_with_logits(
        pred, zerolabel, reduction='none') * scale_factor.pow(beta)

    # FG cat_id: [0, num_classes -1], BG cat_id: num_classes
    bg_class_ind = pred.size(1)
    pos = ((label >= 0) & (label < bg_class_ind)).nonzero().squeeze(1)
    pos_label = label[pos].long()
    # positives are supervised by bbox quality (IoU) score
    scale_factor = score[pos] - pred_sigmoid[pos, pos_label]
    loss[pos, pos_label] = F.binary_cross_entropy_with_logits(
        pred[pos, pos_label], score[pos],
        reduction='none') * scale_factor.abs().pow(beta)

    loss = loss.sum(dim=1, keepdim=False)
    return loss

The code is as above, first the supervision information corresponding to the negative sample is 0, and the scaling factor in QFL is ∣ − σ ∣ β |-\sigma|^\betaσbeta . Then, find the pos index of the positive sample through the label, and get pos_label, which is the category corresponding to the pred output. Score refers to the quality score, which is the iou of anchor and GT. The scaling factor is equal to score[pos] - pred_sigmoid[pos, pos_label], and finally The loss corresponding to the positive sample is reassigned to obtain the final loss.

Distribution Focal loss(QFL) .

The paper changes the single value of coordinate regression to output n+1 values, each value represents the probability of the corresponding regression distance, and then uses the integral to obtain the final regression distance.
insert image description hereThe paper makes statistics on coco data and finds that the regression distance is distributed between 0-16, so the paper changes the value of single coordinate regression to the output of dimension 17, corresponding to the regression distance 0-16 respectively. In the code below, x is the regression pred, and self.project is the tensor of 0-16. First use softmax to normalize x, so that the sum of x is 1, so that the expectation of x and project can represent the regression distance. F.linear is to let x and project do matrix multiplication to obtain expectations.

class Integral(nn.Module):
    def __init__(self, reg_max=16):
        super(Integral, self).__init__()
        self.reg_max = reg_max
        self.register_buffer('project',
                             torch.linspace(0, self.reg_max, self.reg_max + 1))

    def forward(self, x):
        x = F.softmax(x.reshape(-1, self.reg_max + 1), dim=1)
        x = F.linear(x, self.project.type_as(x)).reshape(-1, 4)
        return x

Considering that the distribution should be concentrated near the regression target, the paper proposes DFL to force the network to increase the yi closest to y , yi + 1 y_i,y_{i+1}yi,yi+1The probability of , as shown in Equation 6. Among them, S i S_iSidisplay yi y_iyiThe probability of occurrence of point distance, in the following code, dis_left means yi y_iyi. Formula 6 can be understood in this way. Assuming that the regression target of 0-16 is regarded as a multi-classification problem, pred is the predicted value of multi-classification, and dis_left and dis_right are labels. Then the role of DFL is to hope that pred can be close to dis_left and dis_right. In other words, DFL to force the network to improve the yi closest to y , yi + 1 y_i,y_{i+1}yi,yi+1The probability.
insert image description here

def distribution_focal_loss(pred, label):
    dis_left = label.long()
    dis_right = dis_left + 1
    weight_left = dis_right.float() - label
    weight_right = label - dis_left.float()
    loss = F.cross_entropy(pred, dis_left, reduction='none') * weight_left \
        + F.cross_entropy(pred, dis_right, reduction='none') * weight_right
    return loss

Guess you like

Origin blog.csdn.net/litt1e/article/details/127257802