Zero-based entry target detection series learning records (2): Detailed explanation of v3 algorithm of YOLO series

Baidu Flying Paddle Zero-Basic Practice Deep Learning Target Detection Series Study Notes

YOLOv3 structure

YOLOv3 network structure features:

1. Only convolution without pooling
2.3 feature maps to detect objects of different sizes
3. Use ResNet structure
4.3 feature maps to use add for splicing
5. Use sigmoid to achieve multi-category
insert image description here
Backbone **:Darknet53** Use 1 in the backbone network The three feature maps of /8, 1/16, and 1/32 are used for feature map fusion. A residual structure is used in the network.
Why use it: Much faster than ResNet with the same precision, slower but higher precision than Darknet19.
insert image description hereinsert image description here
insert image description here

shape format and prior box

SxSx3x(C+5) : size, number of prior frames 3, detection frame position (XYHW4), detection confidence (1), category dimension (C)
use clustering to label frames, and get 9 frames as a priori Inspection box:

(10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 ×
119), (116 × 90), (156 × 198), (373 × 326).

The network output detection box is as follows:
insert image description here

The processing flow of the box

1. Obtained through three feature maps: 8 8 3, 16 16 3 , 32 32 3 , 3 boxes for each grid , a total of 4032 boxes.
2. Send it to the loss calculation.
3. Set the confidence threshold during inference, and then use NMS to output the prediction result.

insert image description here

Training strategy and loss function

Prediction box: positive example (positive), negative example (negative), ignore example (ignore)

Positive example : Take a ground truth and calculate the IOU with the predicted frame. The largest one is a positive example. The
positive example generates confidence loss, detection frame loss, and category loss.

Negative example: If the IOU with all ground truth is less than the normal value (0.5), it is a negative example.
For negative cases, only the classification confidence produces loss , the classification label is 0, and border regression does not produce loss.

Ignore the sample: Except for the positive example, if the IOU with any ground truth is greater than the threshold (0.5 is used in the paper), it is an ignored sample. Ignoring examples does not generate any loss.

Record : For negative examples, it will only participate in the calculation of Focal Loss of classification confidence, and will not affect the Loss of classification and frame regression. Because for the negative examples, they do not contain the target object, so they do not need to participate in the calculation of the Loss of the classification and position regression of the target.

Negative examples, although they are not the target, they may be predicted as some target. If we completely ignore the case of negative examples, the network may ignore some negative examples, which may contain some features very similar to the target. Therefore, for negative examples, we need to penalize in the loss function of classification confidence to encourage the network to make more accurate predictions for negative examples. This is why negative examples also need to participate in the calculation of Focal Loss of classification confidence.

Loss function: confidence loss, detection frame loss, category loss

Confidence loss : FOCAL = FocalLoss(gamma=2, alpha=1.0, reduction="none")
detection frame loss : giou = tools.GIOU_xywh_torch(p_d_xywh, label_xywh).unsqueeze(-1)
category loss:BCE = nn.BCEWithLogitsLoss(reduction="none")

YOLOv3 code interpretation

1. Network forward part

    def forward(self, x):
        out = []
			
		#通过Darknet53提取三个feature
        x_s, x_m, x_l = self.__backnone(x)
        #通过FPN进行concat与输出
        x_s, x_m, x_l = self.__fpn(x_l, x_m, x_s)
		#将三个不同尺寸的feature送入head进行解码
		#也就是将三个输出解码成预测框!!
        out.append(self.__head_s(x_s))
        out.append(self.__head_m(x_m))
        out.append(self.__head_l(x_l))

        if self.training:
            p, p_d = list(zip(*out))
            return p, p_d  # smalll, medium, large
        else:
            p, p_d = list(zip(*out))
            return p, torch.cat(p_d, 0)

2. YOLOv3_head part

The principle is the implementation of these formulas:
insert image description here

class Yolo_head(nn.Module):
    def __init__(self, nC, anchors, stride):
        super(Yolo_head, self).__init__()
		# anchors是先验框(v3中是每个尺度三个先验框),nA是先验框的个数,nC是类别的个数,stride是步长
        self.__anchors = anchors
        self.__nA = len(anchors)
        self.__nC = nC
        self.__stride = stride


    def forward(self, p):
        # 获取输入的batch_size和特征图大小
        bs, nG = p.shape[0], p.shape[-1]
        # 将p转为bs,nA,nC+5,nG,nG的形状,注意这里将类别+5是因为每个anchor对应的输出包含:tx,ty,tw,th,confidence,类别概率
        # 因此5+C的形状
        p = p.view(bs, self.__nA, 5 + self.__nC, nG, nG).permute(0, 3, 4, 1, 2)
	     # 将预测的特征图解码,返回解码后的预测值
        p_de = self.__decode(p.clone())

        return (p, p_de)


    def __decode(self, p):
        # 获取batch_size和输出大小
        batch_size, output_size = p.shape[:2]
		# 获取当前设备
        device = p.device
        # 获取步长和先验框,转化为device类型
        stride = self.__stride
        anchors = (1.0 * self.__anchors).to(device)
			
		# 预测中心坐标、宽高、置信度以及类别概率
        conv_raw_dxdy = p[:, :, :, :, 0:2]
        conv_raw_dwdh = p[:, :, :, :, 2:4]
        conv_raw_conf = p[:, :, :, :, 4:5]
        conv_raw_prob = p[:, :, :, :, 5:]
        
		# 将特征图中心坐标转为全图坐标
        y = torch.arange(0, output_size).unsqueeze(1).repeat(1, output_size)
        x = torch.arange(0, output_size).unsqueeze(0).repeat(output_size, 1)
        grid_xy = torch.stack([x, y], dim=-1)
        grid_xy = grid_xy.unsqueeze(0).unsqueeze(3).repeat(batch_size, 1, 1, 3, 1).float().to(device)
		# 计算预测的坐标和宽高
        pred_xy = (torch.sigmoid(conv_raw_dxdy) + grid_xy) * stride
        pred_wh = (torch.exp(conv_raw_dwdh) * anchors) * stride
        # 将预测的坐标和宽高拼接在一起得到预测的边界框
        pred_xywh = torch.cat([pred_xy, pred_wh], dim=-1)
        # 计算预测的置信度和类别概率
        pred_conf = torch.sigmoid(conv_raw_conf)
        pred_prob = torch.sigmoid(conv_raw_prob)
        # # 将预测的边界框、置信度和类别概率拼接在一起得到最终的预测
        pred_bbox = torch.cat([pred_xywh, pred_conf, pred_prob], dim=-1)

        return pred_bbox.view(-1, 5 + self.__nC) if not self.training else pred_bbox

insert image description hereThis section of the code is to determine the center point of the anchor, the center point is the upper left corner of each grid, and each grid generates three prior boxes.

Guess you like

Origin blog.csdn.net/m0_63495706/article/details/130062471