Yolov3 model - pytorch implementation

Paper Portal: YOLOv3: An Incremental Improvement

Improvements of Yolov3:

1. Use Darknet53 as the backbone;
2. Multi-scale feature prediction (similar to FPN structure);
3. Other tricks .

The structure of Yolov3:

The backbone is the feature extraction part of Darknet53, where Convolutional means Conv+BN+LeakyReLU, and Residual means residual connection; the
input image is extracted through the backbone to extract three layers of feature layers , which are respectively marked as feature0, feature1 and feature2 from the shallow layer to the deep layer;
first Pass feature2 through Convolutional Layers, output out2 through Convolutional+Conv on one side, Convolutional and Upsampling on the other side, and then connect to feature1;
Concat features output out1 through Convolutional+Conv on one side, Convolutional and Upsampling on the other side, and then feature0 Connected;
finally, the features of Concat output out0 through Convolutional+Conv;
among them, the structure of Convolutional Layers is a 5-layer Convolutional stack, and its convolution kernel size is [ 1 , 3 , 1 , 3 , 1 ] [1,3,1, 3,1][1,3,1,3,1 ] ;
Except for the first Convolutional after Concat and the first Convolutional of image input, among other Convolutionals, if the convolution kernel size is 3, the number of channels will be doubled; if the convolution kernel size is 1, the number of channels will be halved.
Yolov3 structure

Output of Yolov3:

The network output is three layers. In the COCO target detection task, when the input image size is (3,416,416), the output result is:
Out 0 (255,52,52), which is used to predict small-size targets;
Out 1 (255, 26,26), for predicting medium-sized objects;
Out 1 (255,13,13), for predicting large-sized objects.
Similar to Yolov2, where 52x52, 26x26, and 13x13 represent the preset anchor positions; 255 = ( 4 + 1 + 80 ) ∗ 3 255=(4+1+80)*3255=(4+1+80)3 , 4 represents the target regression parameters, 1 represents the target confidence, 80 represents the conditional probability of 80 categories, and the last 3 represents the size of the anchor, that is, there are 3 sizes of anchors in each position (for each layer of Out ).
(The code only implements the model structure part)

Yolov3's attempt:

Regarding the prediction form and loss function, the author has tried some strategies that did not work, among which Focal Loss is worth noting . The author uses Focal Loss as the basic form of the loss function, resulting in a reduction of mAP by two points, and the reason why it does not work is not entirely sure.
Focal loss

import torch
import torch.nn as nn


def convolutional(in_channels, out_channels, kernel_size, strid):  # Conv+BN+LeakyReLU
    padding = 1 if kernel_size == 3 else 0
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strid, padding, bias=False),
        nn.BatchNorm2d(out_channels),
        nn.LeakyReLU(0.1)
    )


class Residual(nn.Module):  # Residual结构
    def __init__(self, in_channels, hidden_channels):
        super(Residual, self).__init__()
        self.residual_block = nn.Sequential(
            convolutional(in_channels, hidden_channels, 1, 1),
            convolutional(hidden_channels, in_channels, 3, 1)
        )

    def forward(self, x):
        return x + self.residual_block(x)  # x+F(x)


class Darknet53(nn.Module):  # Darknet53的特征提取部分
    def __init__(self):
        super(Darknet53, self).__init__()
        self.feature0 = nn.Sequential(
            convolutional(3, 32, 3, 1),
            convolutional(32, 64, 3, 2),
            Residual(64, 32),
            convolutional(64, 128, 3, 2),
            *[Residual(128, 64) for i in range(2)],
            convolutional(128, 256, 3, 2),
            *[Residual(256, 128) for i in range(8)],
        )
        self.feature1 = nn.Sequential(
            convolutional(256, 512, 3, 2),
            *[Residual(512, 256) for i in range(8)],
        )
        self.feature2 = nn.Sequential(
            convolutional(512, 1024, 3, 2),
            *[Residual(1024, 512) for i in range(4)],
        )

    def forward(self, x):
        feature0 = self.feature0(x)  # 浅层特征
        feature1 = self.feature1(feature0)  # 中层特征
        feature2 = self.feature2(feature1)  # 深层特征
        return feature0, feature1, feature2


class Convlayers(nn.Module):  # 5个Convolutional的堆叠
    def __init__(self, in_channels, hidden_channels):
        super(Convlayers, self).__init__()
        self.convlayers = nn.Sequential(
            convolutional(in_channels, hidden_channels, 1, 1),
            convolutional(hidden_channels, hidden_channels * 2, 3, 1),
            convolutional(hidden_channels * 2, hidden_channels, 1, 1),
            convolutional(hidden_channels, hidden_channels * 2, 3, 1),
            convolutional(hidden_channels * 2, hidden_channels, 1, 1),
        )

    def forward(self, x):
        return self.convlayers(x)


class Yolov3(nn.Module):  # yolov3模型
    def __init__(self):
        super(Yolov3, self).__init__()
        self.backbone = Darknet53()
        self.convlayers2 = Convlayers(1024, 512)
        self.convlayers1 = Convlayers(512 + 256, 256)
        self.convlayers0 = Convlayers(256 + 128, 128)
        self.final_conv2 = nn.Sequential(
            convolutional(512, 1024, 3, 1),
            nn.Conv2d(1024, 255, 1, 1, 0),
        )
        self.final_conv1 = nn.Sequential(
            convolutional(256, 512, 3, 1),
            nn.Conv2d(512, 255, 1, 1, 0),
        )
        self.final_conv0 = nn.Sequential(
            convolutional(128, 256, 3, 1),
            nn.Conv2d(256, 255, 1, 1, 0),
        )
        self.upsample2 = nn.Sequential(
            convolutional(512, 256, 1, 1),
            nn.Upsample(scale_factor=2)
        )
        self.upsample1 = nn.Sequential(
            convolutional(256, 128, 1, 1),
            nn.Upsample(scale_factor=2)
        )

    def forward(self, x):
        # (B,256,52,52),(B,512,26,26),(B,1024,13,13)
        feature0, feature1, feature2 = self.backbone(x)  # 输入图像经过backbone提取到3层特征
        f2 = self.convlayers2(feature2)  # 深层特征经过Conolutional layers得到f2,(B,1024,13,13)-->(B,512,13,13)
        out2 = self.final_conv2(f2)  # f2经过Convolutional+Conv获得out2,(B,512,13,13)-->(B,255,13,13)

        f1 = self.convlayers1(  # f2经过Convolutional+Upsampling与中层特征拼接,再经过Conolutional layers得到f1
            torch.cat([self.upsample2(f2), feature1], dim=1))  # (B,256,26,26)cat(B,512,26,26)-->(B,256,26,26)
        out1 = self.final_conv1(f1)  # f1经过Convolutional+Conv获得out1,(B,256,26,26)-->(B,255,26,26)

        f0 = self.convlayers0(  # f1经过Convolutional+Upsampling与浅层特征拼接,再经过Conolutional layers得到f0
            torch.cat([self.upsample1(f1), feature0], dim=1))  # (B,128,52,52)cat(B,256,52,52)-->(B,128,52,52)
        out0 = self.final_conv0(f0)  # f0经过Convolutional+Conv获得out0,(B,128,52,52)-->(B,255,52,52)
        return out0, out1, out2

Guess you like

Origin blog.csdn.net/Peach_____/article/details/128762798