Single-stage target detection model YoLo series (1): YoLoV3 detailed explanation and code implementation

Table of contents

 1. YoLoV3 network structure 

1.1 Backbone:Darknet-53

1.2 Building a Feature Pyramid

1.3 YoLo Head

2. Decoding of yolov3 model prediction results

2.1 A priori box

2.2 Detection frame decoding

2.3 Confidence decoding

2.4 Category decoding

 3. The training strategy and loss function of the yolov3 model


 1. YoLoV3 network structure 

        The network structure of the YoLoV3 model is roughly shown in the figure below. It is mainly composed of three parts: the Backbone network extracts image features, constructs a feature pyramid FPN to achieve feature fusion, and uses YoLo Head to obtain prediction results .

1.1 Backbone:Darknet-53

        The YoLoV3 model uses Darknet-53 as the Backbone network to extract image features. Multiple residual modules are stacked in the Darknet-53 network, and there is a kernel-size=3×3, stride=2 between adjacent residual modules. The convolutional layer is mainly used for downsample. The Darknet-53 network was initially applied to image classification tasks. It has a total of 53 layers, including 52 convolutional layers for extracting image features and a fully connected layer at the end for image classification, as shown in the figure below. The commonly used size of the input image is 416×416 as an example (the input image size only needs to be a multiple of 32, because there are 5 downsampling operations in the Darknet-53 network, and the size of the feature map is reduced to the original size after each downsampling 1/2). In addition, it should be noted that the Backbone network in yolov3 only uses the 52 convolutional layers in front of Darknet-53 to extract image features.

        The input feature map and output feature map of the residual module of the Darknet-53 network undergo two convolutions with a convolution kernel size of 1×1 and 3×3 and adopt the Residual residual connection method, which is similar to the Resnet network residual The connection method is basically the same, that is, add the input feature map of the residual module and the input feature map after two convolutions , as shown in the figure below.

         In addition, the Convolutional module in the Darknet-53 network = conv2d+bn+leaky relu , as shown in the figure below.

        The Darknet-53 network code is implemented as follows:

import math
from collections import OrderedDict

import torch.nn as nn


# ---------------------------------------------------------------------#
#   残差结构
#   利用一个1x1卷积下降通道数,然后利用一个3x3卷积提取特征并且上升通道数
#   最后接上一个残差边
# ---------------------------------------------------------------------#
class BasicBlock(nn.Module):
    def __init__(self, inplanes, planes):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes[0], kernel_size=1, stride=1, padding=0, bias=False)
        self.bn1 = nn.BatchNorm2d(planes[0])
        self.relu1 = nn.LeakyReLU(0.1)

        self.conv2 = nn.Conv2d(planes[0], planes[1], kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes[1])
        self.relu2 = nn.LeakyReLU(0.1)

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)

        out += residual
        return out


class DarkNet(nn.Module):
    def __init__(self, layers):
        super(DarkNet, self).__init__()
        self.inplanes = 32
        # 416,416,3 -> 416,416,32
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(self.inplanes)
        self.relu1 = nn.LeakyReLU(0.1)

        # 416,416,32 -> 208,208,64
        self.layer1 = self._make_layer([32, 64], layers[0])
        # 208,208,64 -> 104,104,128
        self.layer2 = self._make_layer([64, 128], layers[1])
        # 104,104,128 -> 52,52,256
        self.layer3 = self._make_layer([128, 256], layers[2])
        # 52,52,256 -> 26,26,512
        self.layer4 = self._make_layer([256, 512], layers[3])
        # 26,26,512 -> 13,13,1024
        self.layer5 = self._make_layer([512, 1024], layers[4])

        self.layers_out_filters = [64, 128, 256, 512, 1024]

        # 进行权值初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    # ---------------------------------------------------------------------#
    #   在每一个layer里面,首先利用一个步长为2的3x3卷积进行下采样
    #   然后进行残差结构的堆叠
    # ---------------------------------------------------------------------#
    def _make_layer(self, planes, blocks):
        layers = []
        # 下采样,步长为2,卷积核大小为3
        layers.append(("ds_conv", nn.Conv2d(self.inplanes, planes[1], kernel_size=3, stride=2, padding=1, bias=False)))
        layers.append(("ds_bn", nn.BatchNorm2d(planes[1])))
        layers.append(("ds_relu", nn.LeakyReLU(0.1)))
        # 加入残差结构
        self.inplanes = planes[1]
        for i in range(0, blocks):
            layers.append(("residual_{}".format(i), BasicBlock(self.inplanes, planes)))
        return nn.Sequential(OrderedDict(layers))

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)

        x = self.layer1(x)
        x = self.layer2(x)
        out3 = self.layer3(x)
        out4 = self.layer4(out3)
        out5 = self.layer5(out4)

        return out3, out4, out5


def darknet53():
    model = DarkNet([1, 2, 8, 8, 4])
    return model

1.2 Building a Feature Pyramid

       Before explaining how yolov3 builds the feature pyramid, let's briefly introduce the feature pyramid structure. The feature pyramid structure was first proposed in the paper "Feature Pyramid Networks for Object Detection". The main problem to be solved is the insufficiency of target detection in dealing with multi-scale changes. Now many networks only use a single deep feature for target detection ( For example, Faster R-CNN uses a four-fold downsampling convolutional layer for subsequent object classification and bounding box regression), but this has an obvious defect, that is, small objects themselves have less pixel information, and the downsampling It is very easy to be lost in the process of detection. In order to deal with the detection problem with obvious differences in the size of objects, the classic method is to use image pyramids for multi-scale change enhancement, but this will bring a huge amount of calculation. Therefore, this paper proposes the network structure of the feature pyramid, as shown in the figure below, which can handle the multi-scale change problem in object detection with a minimal amount of calculation.

However, it should be noted that in this paper, the low-resolution feature map is used for feature fusion with the high-resolution feature map after upsampling (that is, the corresponding feature values ​​​​on the feature map are added, the first time in ResNet In fact, there is another more commonly used feature fusion method is the concat method (that is, splicing the number of channels, which first appeared in DenseNet). The feature pyramid built in yolov3 uses the concat method to achieve feature fusion .

        After Darknet-53 extracts the features of the input image, three feature layers are selected from the extracted feature layers to construct a feature pyramid to achieve effective fusion of different levels of features. These three feature layers are located in different positions of the Darknet-53 network. Their shapes are (52,52,256), (26,26,512), (13,13,1024) respectively , as shown in the figure below.

The process of constructing a feature pyramid FPN using these three effective feature layers is as follows:

① After performing 5 convolution operations on the 13×13×1024 feature layer, the first enhanced feature layer 13×13×512 is obtained , and then this enhanced feature layer is upsampled by UmSampling2d and then spliced ​​with the 26x26x512 feature layer for channel number, Thereby realizing feature fusion, the shape of the new feature layer obtained after fusion is (26,26,768) ;

②After performing 5 convolution operations on the new feature layer with shape (26,26,768), the second enhanced feature layer 26×26×256 is obtained , and then this enhanced feature layer is upsampled UmSampling2d and then channeled with the 52x52x512 feature layer Number stitching, so as to realize feature fusion, the shape of the new feature layer obtained after fusion is (52,52,384) ;

③After performing 5 convolution operations on the new feature layer with shape (52,52,384) , the third enhanced feature layer 52×52×128 is obtained .

Note: The order of the 5 convolution operations is 1×1, 3×3, 1×1, 3×3, 1×1, of which 1×1 convolution is mainly used to reduce the number of channels, and 3×3 convolution is mainly used It is used to further extract image features and increase the number of channels.

1.3 YoLo Head

       By building the feature pyramid above, we obtained 3 enhanced feature layers. The shapes of these three enhanced feature layers are (13,13,512), (26,26,256), (52,52,128) , and then we strengthen these 3 The feature layers are respectively passed to Yolo Head to obtain the model prediction results. Yolo Head is essentially a 3x3 convolution plus a 1x1 convolution . Taking the VOC dataset (20 types of targets) as an example, these 3 enhanced feature layers are input into the YoLo Head and first obtained through 3x3 convolution (13 , 13, 1024), (26, 26, 512), (52, 52, 256) feature maps, and then get 3 shapes as (13, 13, 75), (26, 26, 75), ( 52,52,75) , where 75 is related to the total number of target categories in the data set, 75=3×(20+1+4) , 3 means that there are 3 prediction boxes on each feature point on the output feature map , 20 represents that the data set contains 20 types of objects, 1 represents whether the prediction frame contains objects, and 4 represents the adjustment parameters of the prediction frame, that is, the coordinate parameters x_offset and y_offset of the center point of the prediction frame and the height h and width w of the prediction frame.

        The complete code implementation of the YoLoV3 network structure is as follows:

from collections import OrderedDict

import torch
import torch.nn as nn

from nets.darknet import darknet53


def conv2d(filter_in, filter_out, kernel_size):
    pad = (kernel_size - 1) // 2 if kernel_size else 0
    return nn.Sequential(OrderedDict([
        ("conv", nn.Conv2d(filter_in, filter_out, kernel_size=kernel_size, stride=1, padding=pad, bias=False)),
        ("bn", nn.BatchNorm2d(filter_out)),
        ("relu", nn.LeakyReLU(0.1)),
    ]))


# ------------------------------------------------------------------------#
#   make_last_layers里面一共有七个卷积,前五个用于提取特征。
#   后两个用于获得yolo网络的预测结果
# ------------------------------------------------------------------------#
def make_last_layers(filters_list, in_filters, out_filter):
    m = nn.Sequential(
        conv2d(in_filters, filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        conv2d(filters_list[1], filters_list[0], 1),
        conv2d(filters_list[0], filters_list[1], 3),
        nn.Conv2d(filters_list[1], out_filter, kernel_size=1, stride=1, padding=0, bias=True)
    )
    return m


class YoloBody(nn.Module):
    def __init__(self, anchors_mask, num_classes, pretrained=False):
        super(YoloBody, self).__init__()
        # ---------------------------------------------------#
        #   生成darknet53的主干模型
        #   获得三个有效特征层,他们的shape分别是:
        #   52,52,256
        #   26,26,512
        #   13,13,1024
        # ---------------------------------------------------#
        self.backbone = darknet53()
        if pretrained:
            self.backbone.load_state_dict(torch.load("model_data/darknet53_backbone_weights.pth"))

        # ---------------------------------------------------#
        #   out_filters : [64, 128, 256, 512, 1024]
        # ---------------------------------------------------#
        out_filters = self.backbone.layers_out_filters

        # ------------------------------------------------------------------------#
        #   计算yolo_head的输出通道数,对于voc数据集而言
        #   final_out_filter0 = final_out_filter1 = final_out_filter2 = 75
        # ------------------------------------------------------------------------#
        self.last_layer0 = make_last_layers([512, 1024], out_filters[-1], len(anchors_mask[0]) * (num_classes + 5))

        self.last_layer1_conv = conv2d(512, 256, 1)
        self.last_layer1_upsample = nn.Upsample(scale_factor=2, mode='nearest')
        self.last_layer1 = make_last_layers([256, 512], out_filters[-2] + 256, len(anchors_mask[1]) * (num_classes + 5))

        self.last_layer2_conv = conv2d(256, 128, 1)
        self.last_layer2_upsample = nn.Upsample(scale_factor=2, mode='nearest')
        self.last_layer2 = make_last_layers([128, 256], out_filters[-3] + 128, len(anchors_mask[2]) * (num_classes + 5))

    def forward(self, x):
        # ---------------------------------------------------#
        #   获得三个有效特征层,他们的shape分别是:
        #   52,52,256;26,26,512;13,13,1024
        # ---------------------------------------------------#
        x2, x1, x0 = self.backbone(x)

        # ---------------------------------------------------#
        #   第一个特征层
        #   out0 = (batch_size,255,13,13)
        # ---------------------------------------------------#
        # 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512 -> 13,13,1024 -> 13,13,512
        out0_branch = self.last_layer0[:5](x0)
        out0 = self.last_layer0[5:](out0_branch)

        # 13,13,512 -> 13,13,256 -> 26,26,256
        x1_in = self.last_layer1_conv(out0_branch)
        x1_in = self.last_layer1_upsample(x1_in)

        # 26,26,256 + 26,26,512 -> 26,26,768
        x1_in = torch.cat([x1_in, x1], 1)
        # ---------------------------------------------------#
        #   第二个特征层
        #   out1 = (batch_size,255,26,26)
        # ---------------------------------------------------#
        # 26,26,768 -> 26,26,256 -> 26,26,512 -> 26,26,256 -> 26,26,512 -> 26,26,256
        out1_branch = self.last_layer1[:5](x1_in)
        out1 = self.last_layer1[5:](out1_branch)

        # 26,26,256 -> 26,26,128 -> 52,52,128
        x2_in = self.last_layer2_conv(out1_branch)
        x2_in = self.last_layer2_upsample(x2_in)

        # 52,52,128 + 52,52,256 -> 52,52,384
        x2_in = torch.cat([x2_in, x2], 1)
        # ---------------------------------------------------#
        #   第一个特征层
        #   out3 = (batch_size,255,52,52)
        # ---------------------------------------------------#
        # 52,52,384 -> 52,52,128 -> 52,52,256 -> 52,52,128 -> 52,52,256 -> 52,52,128
        out2 = self.last_layer2(x2_in)
        return out0, out1, out2

2. Decoding of yolov3 model prediction results

        Assuming that the image size in the VOC data set (a total of 20 types of targets) is 416×416×3 , input it into the yolov3 network, and the output feature maps of 3 different scales represent 3 different prediction results, and their The shapes are 13 × 13 × 75, 26 × 26 × 75 , and 52 × 52 × 75 respectively . Small-scale feature maps predict large targets, and large-scale feature maps predict small targets . Here we briefly understand how each prediction result transforms the input image. Taking the output feature map with a size of 13 × 13 × 75 as an example, it is equivalent to dividing the original input image into 13 × 13 grids , that is to say, every 32×32 pixel points on the original input image are mapped to the output feature map by the yolov3 network and become a feature point (because the 13 × 13 output feature map is equivalent to 32 times the input image is downsampled) . Then, each feature point on each output feature map has 3 kinds of prior frames with different aspect ratios. The h and w of these prior frames are pre-set according to past experience before network training, and will be passed through the network later. The training adjusts the parameters of the prior frame. The prediction results of the yolov3 network include: the confidence of the detection frame containing the object, the adjustment parameters x, y, w, h of the detection frame, and the confidence of the type of the object, 3×(1+ 4+20)=75, which is why the number of channels of the three output feature maps is 75. How to decode the detection information is described in detail below.

2.1 A priori box

        There are 9 sizes of prior frames, namely (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116 × 90), (156 × 198), (373 × 326), the order is w × h, where the output feature map with a scale of 13 × 13 corresponds to (116,90), (156,198), (373,326) these three kinds of width and height The a priori frame of the aspect ratio, the output feature map with a scale of 26×26 corresponds to (30×61), (62×45), and (59×119) these three kinds of aspect ratio a priori frames, with a scale of 52×52 The output feature map corresponds to (10×13), (16×30), (33×23) a priori boxes of these three aspect ratios. It should be noted that the prior frames of these 9 sizes are relative to the input image, and when the code is implemented, it is often operated on the output feature map, so you need to pay attention to the conversion; in addition, the prior frame is only related to the detection It is related to w and h of the frame, but not to x and y.

2.2 Detection frame decoding

        With the prior frame and the output feature map, the detection frame can be decoded by the following formula:

b_{x}=\sigma (t_{x})+c_{x}

b_{y}=\sigma (t_{y})+c_{y}

b_{w}=p_{w}e^{t_{w}}

b_{h}=p_{h}e^{t_{h}}

Among them, b_{x}, b_{y}, b_{w}, b_{h}represent the 4 parameters of the center point coordinates, width, and height after the detection frame is decoded, , , t_{x}, t_{y}represent t_{w}the t_{h}4 parameters of the output feature map (prediction result) of the yolov3 model, c_{x}, c_{y}represent the grid point in the upper left corner of the rectangular frame Coordinates (in the actual code implementation, the grid point in the upper left corner of the rectangular frame is used as the center point of the prior frame in advance, that is, the feature point in the output feature map), , representing the coordinates of the center point of the detection frame relative to the grid point in the upper left corner of the \sigma (t_{x})rectangular \sigma (t_{y})frame The offset, σ is the sigmoid activation function, p_{w}, p_{h}represents the width and height of the prior frame, as shown in the figure below, the dotted line represents the prior frame, the blue represents the detection frame, the red is the rectangular frame mentioned above, and the center point of the detection frame The adjustment range is constrained within the range of the rectangle.

         In order to let everyone understand the decoding process of the detection frame more clearly, you can continue to refer to the figure below. Take the output feature map of 13×13 as an example. The blue points in the figure below are equivalent to the feature points of the output feature map. The left one in the figure below The picture is based on the three prior frames corresponding to the marked feature point as the center point. The picture on the right in the figure below is the adjusted detection frame obtained by the above four formulas according to the yolov3 prediction result, the position of the center point of the prior frame, and the width The height has been adjusted.

2.3 Confidence decoding

        The detection confidence of objects is very important in the Yolo design, which is related to the precision and recall of the algorithm. The confidence level occupies a fixed bit in the output 25 dimensions, and it can be decoded by the sigmoid function. After decoding, the numerical range is in [0, 1], which represents the probability that there is an object in the detection frame.

2.4 Category decoding

        The VOC data set has 20 categories, so the number of categories occupies 20 dimensions in the 25-dimensional output. Each dimension represents the confidence of a category. The sigmoid activation function is used to replace the softmax in Yolov2, and the mutual exclusion between categories is cancelled. , which can make the network more flexible. The output feature maps of 3 different scales can decode a total of 13 × 13 × 3 + 26 × 26 × 3 + 52 × 52 × 3 = 10647 boxes and corresponding categories and confidence levels. These 10647 boxes are used during training and inference , the usage is different:

① During training, all 10647 boxes are sent to the labeling function, and the next step of labeling and calculation of the loss function is performed.

②When reasoning, select a confidence threshold, filter out the low threshold box, and then pass nms (non-maximum value suppression) to output the entire final prediction result.

 3. The training strategy and loss function of the yolov3 model

        When the neural network model is trained to calculate the loss, it is actually a comparison between the predicted result and the real label. Most of them can directly calculate MSE, MAE, cross-entropy, etc., while the training strategy of the yolov3 model is relatively complicated. It needs to go through the following step:

① Label the 10647 boxes generated during training. The boxes have three types of labels : positive examples, negative examples, and ignored examples :

        Positive example: Take any ground truth and calculate the IOU with all 10647 boxes. The prediction box with the largest IOU is the positive example (the detection box with the largest IOU after calculation with the ground truth, but the IOU is smaller than the threshold, it is still a positive example). And a prediction box can only be assigned to one Ground Truth. For example, the first Ground Truth has already matched a positive detection box, then the next Ground Truth will find the detection box with the largest IOU in the remaining 10647 boxes as a positive example, and the order of the Ground Truth can be ignored. Positive examples generate confidence loss, detection box loss, and category loss . The prediction box is the corresponding Ground Truth box label; the category label corresponds to 1, and the rest are 0; the confidence label is 1.

        Negative example: It is neither a positive example, but the IOU with all Ground Truth is smaller than the threshold, which is a negative example. Negative examples only generate a confidence loss with a confidence label of 0.

        Ignore the sample: Except for the positive example, if the IOU with any ground truth is greater than the threshold, it is an ignored sample. Ignoring samples does not generate any loss . The reason why the ignore example is defined is because Yolov3 uses multi-scale feature maps, and there will be overlapping detection parts between feature maps of different scales. For example, if there is a real object, the detection frame assigned during training is a feature map. The third box of 1 has an IOU of 0.98. At this time, the IOU between the first box of feature map 2 and the ground truth is 0.95, and the ground truth is also detected. If the confidence level is forcibly labeled as 0 at this time , the effect of online learning will be unsatisfactory.

② Calculate the loss. The loss of the detection frame x, y, w, h, using MSE as the loss function, can also use smooth L1 loss (from Faster R-CNN) as the loss function, smooth L1 can make the training smoother, in subsequent published papers, Some also use GIOU loss for network training; since the confidence and category labels are 0, 1 binary classification, cross-entropy is used as the loss function.

Guess you like

Origin blog.csdn.net/Mike_honor/article/details/126379701