睿智的目标检测33——Pytorch搭建Efficientdet目标检测平台

学习前言
什么是Efficientdet目标检测算法
源码下载
Efficientdet实现思路

一、预测部分

1、主干网络介绍
2、BiFPN加强特征提取
3、从特征获取预测结果
4、预测结果的解码
5、在原图上进行绘制

二、训练部分

1、真实框的处理
2、利用处理完的真实框与对应图片的预测结果计算loss

训练自己的Efficientdet模型

学习前言

一起来看看Efficientdet的keras实现吧，顺便训练一下自己的数据。
在这里插入图片描述

什么是Efficientdet目标检测算法

最近，谷歌大脑 Mingxing Tan、Ruoming Pang 和 Quoc V. Le 提出新架构 EfficientDet，结合 EfficientNet（同样来自该团队）和新提出的 BiFPN，实现新的 SOTA 结果。
在这里插入图片描述

源码下载

https://github.com/bubbliiiing/efficientdet-pytorch
喜欢的可以点个star噢。

Efficientdet实现思路

一、预测部分

1、主干网络介绍

在这里插入图片描述
Efficientdet采用Efficientnet作为主干特征提取网络。EfficientNet-B0对应Efficientdet-D0；EfficientNet-B1对应Efficientdet-D1；以此类推。

EfficientNet模型具有很独特的特点，这个特点是参考其它优秀神经网络设计出来的。经典的神经网络特点如下：
1、利用残差神经网络增大神经网络的深度，通过更深的神经网络实现特征提取。
2、改变每一层提取的特征层数，实现更多层的特征提取，得到更多的特征，提升宽度。
3、通过增大输入图片的分辨率也可以使得网络可以学习与表达的东西更加丰富，有利于提高精确度。

EfficientNet就是将这三个特点结合起来，通过一起缩放baseline模型（MobileNet中就通过缩放α实现缩放模型，不同的α有不同的模型精度，α=1时为baseline模型；ResNet其实也是有一个baseline模型，在baseline的基础上通过改变图片的深度实现不同的模型实现），同时调整深度、宽度、输入图片的分辨率完成一个优秀的网络设计。

在EfficientNet模型中，其使用一组固定的缩放系数统一缩放网络深度、宽度和分辨率。
假设想使用 2^N倍的计算资源，我们可以简单的对网络深度扩大α^N倍、宽度扩大β^N 、图像尺寸扩大γ^N倍，这里的α,β,γ都是由原来的小模型上做微小的网格搜索决定的常量系数。
如图为EfficientNet的设计思路，从三个方面同时拓充网络的特性。
在这里插入图片描述
本博客以Efficientnet-B0和Efficientdet-D0为例，进行Efficientdet的解析。

Efficientnet-B0由1个Stem+16个大Blocks堆叠构成，16个大Blocks可以分为1、2、2、3、3、4、1个Block。Block的通用结构如下，其总体的设计思路是Inverted residuals结构和残差结构，在3x3或者5x5网络结构前利用1x1卷积升维，在3x3或者5x5网络结构后增加了一个关于通道的注意力机制，最后利用1x1卷积降维后增加一个大残差边。
在这里插入图片描述
整体结构如下：

最终获得三个有效特征层传入到BIFPN当中进行下一步的操作。

import torch
from torch import nn
from torch.nn import functional as F

from nets.layers import (
    round_filters,
    round_repeats,
    drop_connect,
    get_same_padding_conv2d,
    get_model_params,
    efficientnet_params,
    load_pretrained_weights,
    Swish,
    MemoryEfficientSwish,
)


class MBConvBlock(nn.Module):
    '''
    EfficientNet-b0:
    [BlockArgs(kernel_size=3, num_repeat=1, input_filters=32, output_filters=16, expand_ratio=1, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=2, input_filters=16, output_filters=24, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=2, input_filters=24, output_filters=40, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=3, input_filters=40, output_filters=80, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=3, input_filters=80, output_filters=112, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=4, input_filters=112, output_filters=192, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=1, input_filters=192, output_filters=320, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25)]
    
    GlobalParams(batch_norm_momentum=0.99, batch_norm_epsilon=0.001, dropout_rate=0.2, num_classes=1000, width_coefficient=1.0, 
                    depth_coefficient=1.0, depth_divisor=8, min_depth=None, drop_connect_rate=0.2, image_size=224)
    '''
    def __init__(self, block_args, global_params):
        super().__init__()
        self._block_args = block_args
        # 获得标准化的参数
        self._bn_mom = 1 - global_params.batch_norm_momentum
        self._bn_eps = global_params.batch_norm_epsilon

        # 注意力机制的缩放比例
        self.has_se = (self._block_args.se_ratio is not None) and (
            0 < self._block_args.se_ratio <= 1)
        # 是否需要短接边
        self.id_skip = block_args.id_skip 

        Conv2d = get_same_padding_conv2d(image_size=global_params.image_size)

        # 1x1卷积通道扩张
        inp = self._block_args.input_filters  # number of input channels
        oup = self._block_args.input_filters * self._block_args.expand_ratio  # number of output channels
        if self._block_args.expand_ratio != 1:
            self._expand_conv = Conv2d(
                in_channels=inp, out_channels=oup, kernel_size=1, bias=False)
            self._bn0 = nn.BatchNorm2d(
                num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)

        # 深度可分离卷积
        k = self._block_args.kernel_size
        s = self._block_args.stride
        self._depthwise_conv = Conv2d(
            in_channels=oup, out_channels=oup, groups=oup,
            kernel_size=k, stride=s, bias=False)
        self._bn1 = nn.BatchNorm2d(
            num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)

        # 注意力机制模块组，先进行通道数的收缩再进行通道数的扩张
        if self.has_se:
            num_squeezed_channels = max(
                1, int(self._block_args.input_filters * self._block_args.se_ratio))
            self._se_reduce = Conv2d(
                in_channels=oup, out_channels=num_squeezed_channels, kernel_size=1)
            self._se_expand = Conv2d(
                in_channels=num_squeezed_channels, out_channels=oup, kernel_size=1)

        # 输出部分
        final_oup = self._block_args.output_filters
        self._project_conv = Conv2d(
            in_channels=oup, out_channels=final_oup, kernel_size=1, bias=False)
        self._bn2 = nn.BatchNorm2d(
            num_features=final_oup, momentum=self._bn_mom, eps=self._bn_eps)
        self._swish = MemoryEfficientSwish()

    def forward(self, inputs, drop_connect_rate=None):
        x = inputs
        if self._block_args.expand_ratio != 1:
            x = self._swish(self._bn0(self._expand_conv(inputs)))

        x = self._swish(self._bn1(self._depthwise_conv(x)))

        # 添加了注意力机制
        if self.has_se:
            x_squeezed = F.adaptive_avg_pool2d(x, 1)
            x_squeezed = self._se_expand(
                self._swish(self._se_reduce(x_squeezed)))
            x = torch.sigmoid(x_squeezed) * x

        x = self._bn2(self._project_conv(x))

        # 满足以下条件才可以短接
        input_filters, output_filters = self._block_args.input_filters, self._block_args.output_filters
        if self.id_skip and self._block_args.stride == 1 and input_filters == output_filters:
            if drop_connect_rate:
                x = drop_connect(x, p=drop_connect_rate,
                                 training=self.training)
            x = x + inputs  # skip connection
        return x

    def set_swish(self, memory_efficient=True):
        """Sets swish function as memory efficient (for training) or standard (for export)"""
        self._swish = MemoryEfficientSwish() if memory_efficient else Swish()

class EfficientNet(nn.Module):
    '''
    EfficientNet-b0:
    [BlockArgs(kernel_size=3, num_repeat=1, input_filters=32, output_filters=16, expand_ratio=1, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=2, input_filters=16, output_filters=24, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=2, input_filters=24, output_filters=40, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=3, input_filters=40, output_filters=80, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=3, input_filters=80, output_filters=112, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25), 
     BlockArgs(kernel_size=5, num_repeat=4, input_filters=112, output_filters=192, expand_ratio=6, id_skip=True, stride=[2], se_ratio=0.25), 
     BlockArgs(kernel_size=3, num_repeat=1, input_filters=192, output_filters=320, expand_ratio=6, id_skip=True, stride=[1], se_ratio=0.25)]
    
    GlobalParams(batch_norm_momentum=0.99, batch_norm_epsilon=0.001, dropout_rate=0.2, num_classes=1000, width_coefficient=1.0, 
                    depth_coefficient=1.0, depth_divisor=8, min_depth=None, drop_connect_rate=0.2, image_size=224)
    '''
    def __init__(self, blocks_args=None, global_params=None):
        super().__init__()
        assert isinstance(blocks_args, list), 'blocks_args should be a list'
        assert len(blocks_args) > 0, 'block args must be greater than 0'
        self._global_params = global_params
        self._blocks_args = blocks_args
        # 获得一种卷积方法
        Conv2d = get_same_padding_conv2d(image_size=global_params.image_size)

        # 获得标准化的参数
        bn_mom = 1 - self._global_params.batch_norm_momentum
        bn_eps = self._global_params.batch_norm_epsilon

        # 网络主干部分开始
        # 设定输入进来的是RGB三通道图像
        in_channels = 3  
        # 利用round_filters可以使得通道数在扩张的时候可以被8整除
        out_channels = round_filters(32, self._global_params)

        # 卷积+标准化
        self._conv_stem = Conv2d(
            in_channels, out_channels, kernel_size=3, stride=2, bias=False)
        self._bn0 = nn.BatchNorm2d(
            num_features=out_channels, momentum=bn_mom, eps=bn_eps)

        # 对每个block的参数进行修改
        self._blocks = nn.ModuleList([])
        for i in range(len(self._blocks_args)):
            # 对每个block的参数进行修改，根据所选的efficient版本进行修改
            self._blocks_args[i] = self._blocks_args[i]._replace(
                input_filters=round_filters(
                    self._blocks_args[i].input_filters, self._global_params),
                output_filters=round_filters(
                    self._blocks_args[i].output_filters, self._global_params),
                num_repeat=round_repeats(
                    self._blocks_args[i].num_repeat, self._global_params)
            )

            # 第一次大的Block里面的卷积需要考虑步长和输入进来的通道数！
            self._blocks.append(MBConvBlock(self._blocks_args[i], self._global_params))

            if self._blocks_args[i].num_repeat > 1:
                self._blocks_args[i] = self._blocks_args[i]._replace(input_filters=self._blocks_args[i].output_filters, stride=1)
            for _ in range(self._blocks_args[i].num_repeat - 1):
                self._blocks.append(MBConvBlock(self._blocks_args[i], self._global_params))

        # 增加了head部分
        in_channels = self._blocks_args[len(self._blocks_args)-1].output_filters
        out_channels = round_filters(1280, self._global_params)

        # 卷积+标准化
        self._conv_head = Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self._bn1 = nn.BatchNorm2d(num_features=out_channels, momentum=bn_mom, eps=bn_eps)

        # 最后的线性全连接层
        self._avg_pooling = nn.AdaptiveAvgPool2d(1)
        self._dropout = nn.Dropout(self._global_params.dropout_rate)
        self._fc = nn.Linear(out_channels, self._global_params.num_classes)
        # 进行swish激活函数
        self._swish = MemoryEfficientSwish()

    def set_swish(self, memory_efficient=True):
        """Sets swish function as memory efficient (for training) or standard (for export)"""
        # swish函数
        self._swish = MemoryEfficientSwish() if memory_efficient else Swish()
        for block in self._blocks:
            block.set_swish(memory_efficient)

    def extract_features(self, inputs):
        """ Returns output of the final convolution layer """

        # Stem
        x = self._swish(self._bn0(self._conv_stem(inputs)))

        # Blocks
        for idx, block in enumerate(self._blocks):
            drop_connect_rate = self._global_params.drop_connect_rate
            if drop_connect_rate:
                drop_connect_rate *= float(idx) / len(self._blocks)
            x = block(x, drop_connect_rate=drop_connect_rate)
        # Head
        x = self._swish(self._bn1(self._conv_head(x)))

        return x

    def forward(self, inputs):
        """ Calls extract_features to extract features, applies final linear layer, and returns logits. """
        bs = inputs.size(0)
        # Convolution layers
        x = self.extract_features(inputs)

        # Pooling and final linear layer
        x = self._avg_pooling(x)
        x = x.view(bs, -1)
        x = self._dropout(x)
        x = self._fc(x)
        return x

    @classmethod
    def from_name(cls, model_name, override_params=None):
        cls._check_model_name_is_valid(model_name)
        blocks_args, global_params = get_model_params(model_name, override_params)
        return cls(blocks_args, global_params)

    @classmethod
    def from_pretrained(cls, model_name, load_weights=True, advprop=True, num_classes=1000, in_channels=3):
        model = cls.from_name(model_name, override_params={'num_classes': num_classes})
        if load_weights:
            load_pretrained_weights(model, model_name, load_fc=(num_classes == 1000), advprop=advprop)
        if in_channels != 3:
            Conv2d = get_same_padding_conv2d(image_size = model._global_params.image_size)
            out_channels = round_filters(32, model._global_params)
            model._conv_stem = Conv2d(in_channels, out_channels, kernel_size=3, stride=2, bias=False)
        return model

    @classmethod
    def get_image_size(cls, model_name):
        cls._check_model_name_is_valid(model_name)
        _, _, res, _ = efficientnet_params(model_name)
        return res

    @classmethod
    def _check_model_name_is_valid(cls, model_name):
        """ Validates model name. """
        valid_models = ['efficientnet-b'+str(i) for i in range(9)]
        if model_name not in valid_models:
            raise ValueError('model_name should be one of: ' + ', '.join(valid_models))

2、BiFPN加强特征提取

在这里插入图片描述
BiFPN简单来讲是一个加强版本的FPN，上图是BiFPN，下图是普通的FPN，大家可以看到，与普通的FPN相比，BiFPN的FPN构建更加复杂，中间还增加了许多连接。

构建BiFPN可以分为多步：
1、获得P3_in、P4_in、P5_in、P6_in、P7_in，通过主干特征提取网络，我们已经可以获得P3、P4、P5，还需要进行两次下采样获得P6、P7。
P3、P4、P5在经过1x1卷积调整通道数后，就可以作为P3_in、P4_in、P5_in了，在构建BiFPN的第一步，需要构建两个P4_in、P5_in（原版是这样设计的）。
在这里插入图片描述
实现代码如下：

p3_in = self.p3_down_channel(p3)
p4_in_1 = self.p4_down_channel(p4)
p5_in_1 = self.p5_down_channel(p5)
p4_in_2 = self.p4_down_channel_2(p4)
p5_in_2 = self.p5_down_channel_2(p5)
p6_in = self.p5_to_p6(p5)
p7_in = self.p6_to_p7(p6_in)

2、在获得P3_in、P4_in_1、P4_in_2、P5_in_1、P5_in_2、P6_in、P7_in之后需要对P7_in进行上采样，上采样后与P6_in堆叠获得P6_td；之后对P6_td进行上采样，上采样后与P5_in_1进行堆叠获得P5_td；之后对P5_td进行上采样，上采样后与P4_in_1进行堆叠获得P4_td；之后对P4_td进行上采样，上采样后与P3_in进行堆叠获得P3_out。
在这里插入图片描述
实现代码如下：

# 简单的注意力机制，用于确定更关注p7_in还是p6_in
p6_w1 = self.p6_w1_relu(self.p6_w1)
weight = p6_w1 / (torch.sum(p6_w1, dim=0) + self.epsilon)
p6_up = self.conv6_up(self.swish(weight[0] * p6_in + weight[1] * self.p6_upsample(p7_in)))

# 简单的注意力机制，用于确定更关注p6_up还是p5_in
p5_w1 = self.p5_w1_relu(self.p5_w1)
weight = p5_w1 / (torch.sum(p5_w1, dim=0) + self.epsilon)
p5_up = self.conv5_up(self.swish(weight[0] * p5_in_1 + weight[1] * self.p5_upsample(p6_up)))

# 简单的注意力机制，用于确定更关注p5_up还是p4_in
p4_w1 = self.p4_w1_relu(self.p4_w1)
weight = p4_w1 / (torch.sum(p4_w1, dim=0) + self.epsilon)
p4_up = self.conv4_up(self.swish(weight[0] * p4_in_1 + weight[1] * self.p4_upsample(p5_up)))

# 简单的注意力机制，用于确定更关注p4_up还是p3_in
p3_w1 = self.p3_w1_relu(self.p3_w1)
weight = p3_w1 / (torch.sum(p3_w1, dim=0) + self.epsilon)
p3_out = self.conv3_up(self.swish(weight[0] * p3_in + weight[1] * self.p3_upsample(p4_up)))

3、在获得P3_out、P4_td、P4_in_2、P5_td、P5_in_2、P6_in、P6_td、P7_in之后，之后需要对P3_out进行下采样，下采样后与P4_td、P4_in_2堆叠获得P4_out；之后对P4_out进行下采样，下采样后与P5_td、P5_in_2进行堆叠获得P5_out；之后对P5_out进行下采样，下采样后与P6_in、P6_td进行堆叠获得P6_out；之后对P6_out进行下采样，下采样后与P7_in进行堆叠获得P7_out。
在这里插入图片描述
实现代码如下：

# 简单的注意力机制，用于确定更关注p4_in_2还是p4_up还是p3_out
p4_w2 = self.p4_w2_relu(self.p4_w2)
weight = p4_w2 / (torch.sum(p4_w2, dim=0) + self.epsilon)
# Connections for P4_0, P4_1 and P3_2 to P4_2 respectively
p4_out = self.conv4_down(
    self.swish(weight[0] * p4_in_2 + weight[1] * p4_up + weight[2] * self.p4_downsample(p3_out)))

# 简单的注意力机制，用于确定更关注p5_in_2还是p5_up还是p4_out
p5_w2 = self.p5_w2_relu(self.p5_w2)
weight = p5_w2 / (torch.sum(p5_w2, dim=0) + self.epsilon)
p5_out = self.conv5_down(
    self.swish(weight[0] * p5_in_2 + weight[1] * p5_up + weight[2] * self.p5_downsample(p4_out)))

# 简单的注意力机制，用于确定更关注p6_in还是p6_up还是p5_out
p6_w2 = self.p6_w2_relu(self.p6_w2)
weight = p6_w2 / (torch.sum(p6_w2, dim=0) + self.epsilon)
p6_out = self.conv6_down(
    self.swish(weight[0] * p6_in + weight[1] * p6_up + weight[2] * self.p6_downsample(p5_out)))

# 简单的注意力机制，用于确定更关注p7_in还是p7_up还是p6_out
p7_w2 = self.p7_w2_relu(self.p7_w2)
weight = p7_w2 / (torch.sum(p7_w2, dim=0) + self.epsilon)
p7_out = self.conv7_down(self.swish(weight[0] * p7_in + weight[1] * self.p7_downsamp

4、将获得的P3_out、P4_out、P5_out、P6_out、P7_out作为P3_in、P4_in、P5_in、P6_in、P7_in，重复2、3步骤进行堆叠即可，对于Effiicientdet B0来讲，还需要重复2次，需要注意P4_in_1和P4_in_2此时不需要分开了，P5也是。
在这里插入图片描述
实现代码如下：

p3_in, p4_in, p5_in, p6_in, p7_in = inputs

# 简单的注意力机制，用于确定更关注p7_in还是p6_in
p6_w1 = self.p6_w1_relu(self.p6_w1)
weight = p6_w1 / (torch.sum(p6_w1, dim=0) + self.epsilon)
p6_up = self.conv6_up(self.swish(weight[0] * p6_in + weight[1] * self.p6_upsample(p7_in)))

# 简单的注意力机制，用于确定更关注p6_up还是p5_in
p5_w1 = self.p5_w1_relu(self.p5_w1)
weight = p5_w1 / (torch.sum(p5_w1, dim=0) + self.epsilon)
p5_up = self.conv5_up(self.swish(weight[0] * p5_in + weight[1] * self.p5_upsample(p6_up)))

# 简单的注意力机制，用于确定更关注p5_up还是p4_in
p4_w1 = self.p4_w1_relu(self.p4_w1)
weight = p4_w1 / (torch.sum(p4_w1, dim=0) + self.epsilon)
p4_up = self.conv4_up(self.swish(weight[0] * p4_in + weight[1] * self.p4_upsample(p5_up)))

# 简单的注意力机制，用于确定更关注p4_up还是p3_in
p3_w1 = self.p3_w1_relu(self.p3_w1)
weight = p3_w1 / (torch.sum(p3_w1, dim=0) + self.epsilon)
p3_out = self.conv3_up(self.swish(weight[0] * p3_in + weight[1] * self.p3_upsample(p4_up)))


# 简单的注意力机制，用于确定更关注p4_in还是p4_up还是p3_out
p4_w2 = self.p4_w2_relu(self.p4_w2)
weight = p4_w2 / (torch.sum(p4_w2, dim=0) + self.epsilon)
# Connections for P4_0, P4_1 and P3_2 to P4_2 respectively
p4_out = self.conv4_down(
    self.swish(weight[0] * p4_in + weight[1] * p4_up + weight[2] * self.p4_downsample(p3_out)))

# 简单的注意力机制，用于确定更关注p5_in还是p5_up还是p4_out
p5_w2 = self.p5_w2_relu(self.p5_w2)
weight = p5_w2 / (torch.sum(p5_w2, dim=0) + self.epsilon)
p5_out = self.conv5_down(
    self.swish(weight[0] * p5_in + weight[1] * p5_up + weight[2] * self.p5_downsample(p4_out)))

# 简单的注意力机制，用于确定更关注p6_in还是p6_up还是p5_out
p6_w2 = self.p6_w2_relu(self.p6_w2)
weight = p6_w2 / (torch.sum(p6_w2, dim=0) + self.epsilon)
p6_out = self.conv6_down(
    self.swish(weight[0] * p6_in + weight[1] * p6_up + weight[2] * self.p6_downsample(p5_out)))

# 简单的注意力机制，用于确定更关注p7_in还是p7_up还是p6_out
p7_w2 = self.p7_w2_relu(self.p7_w2)
weight = p7_w2 / (torch.sum(p7_w2, dim=0) + self.epsilon)
p7_out = self.conv7_down(self.swish(weight[0] * p7_in + weight[1] * self.p7_downsample(p6_out)))

3、从特征获取预测结果

通过第二部的重复运算，我们获得了P3_out, P4_out, P5_out, P6_out, P7_out。

为了和普通特征层区分，我们称之为有效特征层，将这五个有效的特征层传输过ClassNet+BoxNet就可以获得预测结果了。

对于Efficientdet-B0来讲：
ClassNet采用3次64通道的卷积和1次num_priors x num_classes的卷积，num_priors指的是该特征层所拥有的先验框数量，num_classes指的是网络一共对多少类的目标进行检测。

BoxNet采用3次64通道的卷积和1次num_priors x 4的卷积，num_priors指的是该特征层所拥有的先验框数量，4指的是先验框的调整情况。

需要注意的是，每个特征层所用的ClassNet是同一个ClassNet；每个特征层所用的BoxNet是同一个BoxNet。

其中：
num_priors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。**

num_priors x num_classes的卷积 用于预测 该特征层上 每一个网格点上 每一个预测框对应的种类。

实现代码为：

class BoxNet(nn.Module):
    def __init__(self, in_channels, num_anchors, num_layers, onnx_export=False):
        super(BoxNet, self).__init__()
        self.num_layers = num_layers

        self.conv_list = nn.ModuleList(
            [SeparableConvBlock(in_channels, in_channels, norm=False, activation=False) for i in range(num_layers)])
        # 每一个有效特征层对应的Batchnor不同
        self.bn_list = nn.ModuleList(
            [nn.ModuleList([nn.BatchNorm2d(in_channels, momentum=0.01, eps=1e-3) for i in range(num_layers)]) for j in
             range(5)])
        self.header = SeparableConvBlock(in_channels, num_anchors * 4, norm=False, activation=False)
        self.swish = MemoryEfficientSwish() if not onnx_export else Swish()

    def forward(self, inputs):
        feats = []
        # 对每个特征层循环
        for feat, bn_list in zip(inputs, self.bn_list):
            # 每个特征层需要进行num_layer次卷积+标准化+激活函数
            for i, bn, conv in zip(range(self.num_layers), bn_list, self.conv_list):
                feat = conv(feat)
                feat = bn(feat)
                feat = self.swish(feat)
            feat = self.header(feat)

            feat = feat.permute(0, 2, 3, 1)
            feat = feat.contiguous().view(feat.shape[0], -1, 4)
            feats.append(feat)
        # 进行一个堆叠
        feats = torch.cat(feats, dim=1)

        return feats


class ClassNet(nn.Module):
    def __init__(self, in_channels, num_anchors, num_classes, num_layers, onnx_export=False):
        super(ClassNet, self).__init__()
        self.num_anchors = num_anchors
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.conv_list = nn.ModuleList(
            [SeparableConvBlock(in_channels, in_channels, norm=False, activation=False) for i in range(num_layers)])
        # 每一个有效特征层对应的Batchnor不同
        self.bn_list = nn.ModuleList(
            [nn.ModuleList([nn.BatchNorm2d(in_channels, momentum=0.01, eps=1e-3) for i in range(num_layers)]) for j in
             range(5)])
        self.header = SeparableConvBlock(in_channels, num_anchors * num_classes, norm=False, activation=False)
        self.swish = MemoryEfficientSwish() if not onnx_export else Swish()

    def forward(self, inputs):
        feats = []
        # 对每个特征层循环
        for feat, bn_list in zip(inputs, self.bn_list):
            for i, bn, conv in zip(range(self.num_layers), bn_list, self.conv_list):
                # 每个特征层需要进行num_layer次卷积+标准化+激活函数
                feat = conv(feat)
                feat = bn(feat)
                feat = self.swish(feat)
            feat = self.header(feat)

            feat = feat.permute(0, 2, 3, 1)
            feat = feat.contiguous().view(feat.shape[0], feat.shape[1], feat.shape[2], self.num_anchors,
                                          self.num_classes)
            feat = feat.contiguous().view(feat.shape[0], -1, self.num_classes)

            feats.append(feat)
        # 进行一个堆叠
        feats = torch.cat(feats, dim=1)
        # 取sigmoid表示概率
        feats = feats.sigmoid()

        return feats

4、预测结果的解码

在这里插入图片描述
我们通过对每一个特征层的处理，可以获得三个内容，分别是：

num_priors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。**

num_priors x num_classes的卷积 用于预测 该特征层上 每一个网格点上 每一个预测框对应的种类。

每一个有效特征层对应的先验框对应着该特征层上 每一个网格点上预先设定好的9个框。

我们利用 num_priors x 4的卷积 与 每一个有效特征层对应的先验框 获得框的真实位置。

每一个有效特征层对应的先验框就是，如图所示的作用：
每一个有效特征层将整个图片分成与其长宽对应的网格，如P3的特征层就是将整个图像分成64x64个网格；然后从每个网格中心建立9个先验框，一共64x64x9个，36864‬个先验框。
在这里插入图片描述
先验框虽然可以代表一定的框的位置信息与框的大小信息，但是其是有限的，无法表示任意情况，因此还需要调整，Efficientdet利用3次64通道的卷积+num_priors x 4的卷积的结果对先验框进行调整。

num_priors x 4中的num_priors表示了这个网格点所包含的先验框数量，其中的4表示了框的左上角xy轴，右下角xy的调整情况。

Efficientdet解码过程就是将对应的先验框的左上角和右下角进行位置的调整，调整完的结果就是预测框的位置了。

当然得到最终的预测结构后还要进行得分排序与非极大抑制筛选这一部分基本上是所有目标检测通用的部分。
1、取出每一类得分大于confidence_threshold的框和得分。
2、利用框的位置和得分进行非极大抑制。
实现代码如下：

def bbox_iou(box1, box2, x1y1x2y2=True):
    """
        计算IOU
    """
    if not x1y1x2y2:
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)

    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * \
                 torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)
                 
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou


def non_max_suppression(prediction, num_classes, conf_thres=0.5, nms_thres=0.4):

    output = [None for _ in range(len(prediction))]
    for image_i, image_pred in enumerate(prediction):

        # 获得种类及其置信度
        class_conf, class_pred = torch.max(image_pred[:, 4:], 1, keepdim=True)

        # 利用置信度进行第一轮筛选
        conf_mask = (class_conf >= conf_thres).squeeze()
        image_pred = image_pred[conf_mask]
        class_conf, class_pred = class_conf[conf_mask], class_pred[conf_mask]

        if not image_pred.size(0):
            continue

        # 获得的内容为(x1, y1, x2, y2, class_conf, class_pred)
        detections = torch.cat((image_pred[:, :4], class_conf.float(), class_pred.float()), 1)

        # 获得种类
        unique_labels = detections[:, -1].cpu().unique()

        if prediction.is_cuda:
            unique_labels = unique_labels.cuda()

        for c in unique_labels:
            # 获得某一类初步筛选后全部的预测结果
            detections_class = detections[detections[:, -1] == c]
            # 按照存在物体的置信度排序
            _, conf_sort_index = torch.sort(detections_class[:, 4], descending=True)
            detections_class = detections_class[conf_sort_index]
            # 进行非极大抑制
            max_detections = []
            while detections_class.size(0):
                # 取出这一类置信度最高的，一步一步往下判断，判断重合程度是否大于nms_thres，如果是则去除掉
                max_detections.append(detections_class[0].unsqueeze(0))
                if len(detections_class) == 1:
                    break
                ious = bbox_iou(max_detections[-1], detections_class[1:])
                detections_class = detections_class[1:][ious < nms_thres]
            # 堆叠
            max_detections = torch.cat(max_detections).data
            # Add max detections to outputs
            output[image_i] = max_detections if output[image_i] is None else torch.cat(
                (output[image_i], max_detections))

    return output

5、在原图上进行绘制

通过第三步，我们可以获得预测框在原图上的位置，而且这些预测框都是经过筛选的。这些筛选后的框可以直接绘制在图片上，就可以获得结果了。

二、训练部分

1、真实框的处理

从预测部分我们知道，每个特征层的预测结果，num_priors x 4的卷积 用于预测 该特征层上 每一个网格点上每一个先验框的变化情况。

也就是说，我们直接利用Efficientdet网络预测到的结果，并不是预测框在图片上的真实位置，需要解码才能得到真实位置。

而在训练的时候，我们需要计算loss函数，这个loss函数是相对于Efficientdet网络的预测结果的。我们需要把图片输入到当前的Efficientdet网络中，得到预测结果；同时还需要把真实框的信息，进行编码，这个编码是把真实框的位置信息格式转化为Efficientdet预测结果的格式信息。

也就是，我们需要找到 每一张用于训练的图片的每一个真实框对应的先验框，并求出如果想要得到这样一个真实框，我们的预测结果应该是怎么样的。

从预测结果获得真实框的过程被称作解码，而从真实框获得预测结果的过程就是编码的过程。

因此我们只需要将解码过程逆过来就是编码过程了。

编码过程首先需要找到真实框对应的先验框，在Efficientdet中采取的办法是求取真实框和所有先验框的重合程度IOU，IOU＞0.5以上的则认为可以利用这些先验框调整获得预测框。

而IOU在0.4-0.5之间的，给予忽略。

IOU在0.4以下的作为负样本。

实现代码如下：

def calc_iou(a, b):
    area = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])
    iw = torch.min(torch.unsqueeze(a[:, 3], dim=1), b[:, 2]) - torch.max(torch.unsqueeze(a[:, 1], 1), b[:, 0])
    ih = torch.min(torch.unsqueeze(a[:, 2], dim=1), b[:, 3]) - torch.max(torch.unsqueeze(a[:, 0], 1), b[:, 1])
    iw = torch.clamp(iw, min=0)
    ih = torch.clamp(ih, min=0)
    ua = torch.unsqueeze((a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1]), dim=1) + area - iw * ih
    ua = torch.clamp(ua, min=1e-8)
    intersection = iw * ih
    IoU = intersection / ua

    return IoU

def get_target(anchor, bbox_annotation, classification):
    IoU = calc_iou(anchor[:, :], bbox_annotation[:, :4])

    IoU_max, IoU_argmax = torch.max(IoU, dim=1)

    # compute the loss for classification
    targets = torch.ones_like(classification) * -1
    if torch.cuda.is_available():
        targets = targets.cuda()

    targets[torch.lt(IoU_max, 0.4), :] = 0

    positive_indices = torch.ge(IoU_max, 0.5)

    num_positive_anchors = positive_indices.sum()

    assigned_annotations = bbox_annotation[IoU_argmax, :]

    targets[positive_indices, :] = 0
    targets[positive_indices, assigned_annotations[positive_indices, 4].long()] = 1
    return targets, num_positive_anchors, positive_indices, assigned_annotations

def encode_bbox(assigned_annotations, positive_indices, anchor_widths, anchor_heights, anchor_ctr_x, anchor_ctr_y):
    assigned_annotations = assigned_annotations[positive_indices, :]

    anchor_widths_pi = anchor_widths[positive_indices]
    anchor_heights_pi = anchor_heights[positive_indices]
    anchor_ctr_x_pi = anchor_ctr_x[positive_indices]
    anchor_ctr_y_pi = anchor_ctr_y[positive_indices]

    gt_widths = assigned_annotations[:, 2] - assigned_annotations[:, 0]
    gt_heights = assigned_annotations[:, 3] - assigned_annotations[:, 1]
    gt_ctr_x = assigned_annotations[:, 0] + 0.5 * gt_widths
    gt_ctr_y = assigned_annotations[:, 1] + 0.5 * gt_heights

    # efficientdet style
    gt_widths = torch.clamp(gt_widths, min=1)
    gt_heights = torch.clamp(gt_heights, min=1)

    targets_dx = (gt_ctr_x - anchor_ctr_x_pi) / anchor_widths_pi
    targets_dy = (gt_ctr_y - anchor_ctr_y_pi) / anchor_heights_pi
    targets_dw = torch.log(gt_widths / anchor_widths_pi)
    targets_dh = torch.log(gt_heights / anchor_heights_pi)

    targets = torch.stack((targets_dy, targets_dx, targets_dh, targets_dw))
    targets = targets.t()
    return targets

2、利用处理完的真实框与对应图片的预测结果计算loss

loss的计算分为两个部分：
1、Smooth Loss：获取所有正标签的框的预测结果的回归loss。
2、Focal Loss：获取所有未被忽略的种类的预测结果的交叉熵loss。

由于在Efficientdet的训练过程中，正负样本极其不平衡，即存在对应真实框的先验框可能只有若干个，但是不存在对应真实框的负样本却有上万个，这就会导致负样本的loss值极大，因此引入了Focal Loss进行正负样本的平衡，关于Focal Loss的介绍可以看这个博客。
https://blog.csdn.net/weixin_44791964/article/details/102853782

实现代码如下：

class FocalLoss(nn.Module):
    def __init__(self):
        super(FocalLoss, self).__init__()

    def forward(self, classifications, regressions, anchors, annotations, alpha = 0.25, gamma = 2.0):
        # 设置
        dtype = regressions.dtype
        batch_size = classifications.shape[0]
        classification_losses = []
        regression_losses = []

        # 获得先验框，将先验框转换成中心宽高的形势
        anchor = anchors[0, :, :].to(dtype)
        # 转换成中心，宽高的形式
        anchor_widths = anchor[:, 3] - anchor[:, 1]
        anchor_heights = anchor[:, 2] - anchor[:, 0]
        anchor_ctr_x = anchor[:, 1] + 0.5 * anchor_widths
        anchor_ctr_y = anchor[:, 0] + 0.5 * anchor_heights

        for j in range(batch_size):
            # 取出真实框
            bbox_annotation = annotations[j]

            if bbox_annotation.shape[0] == 0:
                alpha_factor = torch.ones_like(classification) * alpha
                
                if torch.cuda.is_available():
                    alpha_factor = alpha_factor.cuda()
                alpha_factor = 1. - alpha_factor
                focal_weight = classification
                focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
                
                bce = -(torch.log(1.0 - classification))
                
                cls_loss = focal_weight * bce
                
                if torch.cuda.is_available():
                    regression_losses.append(torch.tensor(0).to(dtype).cuda())
                else:
                    regression_losses.append(torch.tensor(0).to(dtype))
                classification_losses.append(cls_loss.sum())

            # 获得每张图片的分类结果和回归预测结果
            classification = classifications[j, :, :]
            regression = regressions[j, :, :]
            # 平滑标签
            classification = torch.clamp(classification, 1e-4, 1.0 - 1e-4)
            # 获得目标预测结果
            targets, num_positive_anchors, positive_indices, assigned_annotations = get_target(anchor, bbox_annotation, classification)
            
            alpha_factor = torch.ones_like(targets) * alpha
            if torch.cuda.is_available():
                alpha_factor = alpha_factor.cuda()

            alpha_factor = torch.where(torch.eq(targets, 1.), alpha_factor, 1. - alpha_factor)
            focal_weight = torch.where(torch.eq(targets, 1.), 1. - classification, classification)
            focal_weight = alpha_factor * torch.pow(focal_weight, gamma)

            bce = -(targets * torch.log(classification) + (1.0 - targets) * torch.log(1.0 - classification))

            cls_loss = focal_weight * bce

            zeros = torch.zeros_like(cls_loss)
            if torch.cuda.is_available():
                zeros = zeros.cuda()
            cls_loss = torch.where(torch.ne(targets, -1.0), cls_loss, zeros)

            classification_losses.append(cls_loss.sum() / torch.clamp(num_positive_anchors.to(dtype), min=1.0))

            if positive_indices.sum() > 0:
                targets = encode_bbox(assigned_annotations, positive_indices, anchor_widths, anchor_heights, anchor_ctr_x, anchor_ctr_y)
               
                regression_diff = torch.abs(targets - regression[positive_indices, :])

                regression_loss = torch.where(
                    torch.le(regression_diff, 1.0 / 9.0),
                    0.5 * 9.0 * torch.pow(regression_diff, 2),
                    regression_diff - 0.5 / 9.0
                )
                regression_losses.append(regression_loss.mean())
            else:
                if torch.cuda.is_available():
                    regression_losses.append(torch.tensor(0).to(dtype).cuda())
                else:
                    regression_losses.append(torch.tensor(0).to(dtype))
        
        loss = torch.stack(classification_losses).mean() + torch.stack(regression_losses).mean()
        return loss

训练自己的Efficientdet模型

Efficientdet整体的文件夹构架如下：
在这里插入图片描述
本文使用VOC格式进行训练。
训练前将标签文件放在VOCdevkit文件夹下的VOC2007文件夹下的Annotation中。

训练前将图片文件放在VOCdevkit文件夹下的VOC2007文件夹下的JPEGImages中。

在训练前利用voc2efficientdet.py文件生成对应的txt。
在这里插入图片描述
再运行根目录下的voc_annotation.py，运行前需要将classes改成你自己的classes。

classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

在这里插入图片描述
就会生成对应的2007_train.txt，每一行对应其图片位置及其真实框的位置。

在训练前需要修改model_data里面的voc_classes.txt文件，需要将classes改成你自己的classes。

运行train.py即可开始训练。

修改train.py文件下的phi可以修改efficientdet的版本，训练前注意权重文件与Efficientdet版本的对齐。