说明：这几篇文章是讲解SSD，从算法原理、代码到部署到rk3588芯片上的过程。环境均是TF2.2，具体的安装过程请参考网上其他的文章。

一、SSD简介

SSD算法是一个优秀的one-stage目标检测算法。能够一次就完成目标的检测和分类过程。主要是的思路是利用CNN提前特征之后，在图像上进行不同位置的密集抽样，抽样时采用不同尺度和长宽比，物体分类和预测框回归同时完成，所以速度很快。

二、SSD实现思路

1. 主干网络

图1 SSD结构图

这张图，清楚的表明了SSD的网络结构。SSD的主干网络是VGG，但对VGG进行了修改，主要是：

1）将VGG的FC6、FC7转化为卷积层

2）去掉所有的Dropout和F8层

3）新增了Conv6、Conv7、Conv8、Conv9

总的过程：

1）输入：输入的图像为300×300×3(RGB三个通道)

2）Conv1，两次[3, 3]卷积，输出[300,300,64]，再[2,2]最大池化，步长为2，输出[150,150,64]

3）Conv2，两次[3, 3]卷积，输出[150,150,128]，再[2,2]最大池化，步长为2，输出[75,75,128]

4）Conv3，三次[3, 3]卷积，输出[75,75,256]，再[2,2]最大池化，步长为2，输出[38,38,256]

5）Conv4，三次[3, 3]卷积，输出[38,38,512]，再[2,2]最大池化，步长为2，输出[19,19,512]

6）Conv5，三次[3, 3]卷积，输出[19,19,512]，再[3,3]最大池化，步长为1，输出[19,19,512]

7）FC6、FC7，一次[3,3]卷积和一次[1,1]卷积，输出通道为1024，输出[19,19,1024]

8) Conv6，一次[1,1]卷积，调整通道数，一次步长2的[3,3]卷积，输出[10，10，512]

8) Conv7，一次[1,1]卷积，调整通道数，一次步长2的[3,3]卷积，输出[5，5，256]

9) Conv8，一次[1,1]卷积，调整通道数，一次padding为valid的[3,3]卷积，输出[3,3,256]

10) Conv9，一次[1,1]卷积，调整通道数，一次padding为valid的[3,3]卷积，输出[1,1,256]

2. 主干网络代码

SSD的网络整体结构比较清晰，下面是是实现代码

class Normalize(Layer):
    def __init__(self, scale, **kwargs):
        self.axis = 3
        self.scale = scale
        super(Normalize, self).__init__(**kwargs)

    def build(self, input_shape):
        self.input_spec = [InputSpec(shape=input_shape)]
        shape = (input_shape[self.axis],)
        init_gamma = self.scale * np.ones(shape)
        self.gamma = K.variable(init_gamma, name='{}_gamma'.format(self.name))

    def call(self, x, mask=None):
        output = K.l2_normalize(x, self.axis)
        output *= self.gamma
        return output

# class_num是检测的目标种类，必须有
# input_shape一般为[300, 300, 3]
def ssd_net(class_num, input_shape=[300, 300, 3], weight_decay=5e-4):
    # ssd的前几层网络是vgg,
    input_tensor = Input(shape=input_shape)
    print('input_tensor: ' + str(input_tensor))

    # SSD网络模型 net是字典
    net = {}

    # Block 0  输入层
    net['input'] = input_tensor
	
	# Block 1  300,300,3 -> 150,150,64
    # 2次[3, 3]网络卷积，输出的特征层为64，输出为[300, 300, 64],再2×2最大池化，该最大池化步长为2，输出为[150, 150, 64]
    net['conv1_1'] = Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv1_1')(net['input'])
    net['conv1_2'] = Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv1_2')(
        net['conv1_1'])
    net['pool1'] = MaxPooling2D((2, 2), strides=(2, 2), padding='same', name='pool1')(net['conv1_2'])

    # Block 2  150,150,64 -> 75,75,128
    # 2次[3, 3]网络卷积，输出的特征层为128，输出为[150, 150, 128],再2×2最大池化，该最大池化步长为2，输出为[75, 75, 128]
    net['conv2_1'] = Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv2_1')(
        net['pool1'])
    net['conv2_2'] = Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv2_2')(
        net['conv2_1'])
    net['pool2'] = MaxPooling2D((2, 2), strides=(2, 2), padding='same',  name='pool2')(net['conv2_2'])

    # Block 3   75,75,128 -> 38,38,256
    # 3次[3, 3]网络卷积，输出的特征层为256，输出为[75, 75, 256],再2×2最大池化，该最大池化步长为2，输出为[38, 38, 256]
    net['conv3_1'] = Conv2D(256, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv3_1')(
        net['pool2'])
    net['conv3_2'] = Conv2D(256, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv3_2')(
        net['conv3_1'])
    net['conv3_3'] = Conv2D(256, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv3_3')(
        net['conv3_2'])
    net['pool3'] = MaxPooling2D((2, 2), strides=(2, 2), padding='same', name='pool3')(net['conv3_3'])

    # Block 4   38,38,256 -> 19,19,512
    # 3次[3, 3]网络卷积，输出的特征层为512，输出为[38, 38, 512],再2×2最大池化，该最大池化步长为2，输出为[19, 19, 512]
    net['conv4_1'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv4_1')(
        net['pool3'])
    net['conv4_2'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv4_2')(
        net['conv4_1'])
    net['conv4_3'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv4_3')(
        net['conv4_2'])
    net['pool4'] = MaxPooling2D((2, 2), strides=(2, 2), padding='same', name='pool4')(net['conv4_3'])

    # Block 5   19,19,512 -> 19,19,512
    # 3次[3, 3]网络卷积，输出的特征层为512，输出为[19, 19, 512],再3×3最大池化，该最大池化步长为1，输出为[19, 19, 512]
    net['conv5_1'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv5_1')(
        net['pool4'])
    net['conv5_2'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv5_2')(
        net['conv5_1'])
    net['conv5_3'] = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv5_3')(
        net['conv5_2'])
    net['pool5'] = MaxPooling2D((3, 3), strides=(1, 1), padding='same', name='pool5')(net['conv5_3'])

    # FC6         19,19,512 -> 19,19,1024
    # 1次[3, 3]网络卷积，1次[1, 1]网络卷积，分别为fc6和fc7，输出的特征层为1024，输出为[19, 19, 1024]
    net['fc6'] = Conv2D(1024, kernel_size=(3, 3), dilation_rate=(6, 6), activation='relu', padding='same',
                        kernel_regularizer=l2(weight_decay), name='fc6')(net['pool5'])
    # FC7         19,19,1024 -> 19,19,1024
    net['fc7'] = Conv2D(1024, kernel_size=(1, 1), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='fc7')(net['fc6'])
    #  ---------------------- 以上是VGG网络（fc6、fc7有修改）------------------------ #

    # Block 6     19,19,512 -> 10,10,512
    # 1次[1, 1]网络卷积，调整通道数，1次步长为2的[3, 3]卷积网络，输出通道为512，输出为[10, 10, 512]
    net['conv6_1'] = Conv2D(256, kernel_size=(1, 1), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv6_1')(net['fc7'])
    # 表示将上一层的输出上下左右补充一行（一列）0,行数+2,列数+2。
    # Zeropadding2D即为2D输入的零填充层。为2D输入的零填充层,
    # 为下一层卷积做准备，保证卷积之后，尺寸不变
    net['conv6_2'] = ZeroPadding2D(padding=((1, 1), (1, 1)), name='conv6_padding')(net['conv6_1'])
    net['conv6_2'] = Conv2D(512, kernel_size=(3, 3), strides=(2, 2), activation='relu', kernel_regularizer=l2(weight_decay), name='conv6_2')(
        net['conv6_2'])

    # Block 7      10,10,512 -> 5,5,256
    # 1次[1, 1]网络卷积，调整通道数，1次步长为2的[3, 3]卷积网络，输出通道为256，输出为[5, 5, 256]
    net['conv7_1'] = Conv2D(128, kernel_size=(1, 1), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv7_1')(
        net['conv6_2'])
    net['conv7_2'] = ZeroPadding2D(padding=((1, 1), (1, 1)), name='conv7_padding')(net['conv7_1'])
    net['conv7_2'] = Conv2D(256, kernel_size=(3, 3), strides=(2, 2), activation='relu', padding='valid', kernel_regularizer=l2(weight_decay),
                            name='conv7_2')(net['conv7_2'])

    # Block 8      5,5,256 -> 3,3,256
    # 1次[1, 1]网络卷积，调整通道数，1次padding为valid的[3, 3]卷积网络，输出通道为256，输出为[3, 3, 256]
    net['conv8_1'] = Conv2D(128, kernel_size=(1, 1), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv8_1')(
        net['conv7_2'])
    net['conv8_2'] = Conv2D(256, kernel_size=(3, 3), strides=(1, 1), activation='relu', padding='valid', kernel_regularizer=l2(weight_decay),
                            name='conv8_2')(net['conv8_1'])

    # Block 9      3,3,256 -> 1,1,256
    # 1次[1, 1]网络卷积，调整通道数，1次padding为valid的[3, 3]卷积网络，输出通道为256，输出为[1, 1, 256]
    net['conv9_1'] = Conv2D(128, kernel_size=(1, 1), activation='relu', padding='same', kernel_regularizer=l2(weight_decay), name='conv9_1')(
        net['conv8_2'])
    net['conv9_2'] = Conv2D(256, kernel_size=(3, 3), strides=(1, 1), activation='relu', padding='valid', kernel_regularizer=l2(weight_decay),
                            name='conv9_2')(net['conv9_1'])
    # ----------------------------主干特征提取网络结束--------------------------- #

    # -----------------------将提取到的主干特征进行处理--------------------------- #
    # 对conv4_3的通道进行l2标准化处理
    # 38,38,512
    net['conv4_3_norm'] = Normalize(20, name='conv4_3_norm')(net['conv4_3'])
    num_priors = 4
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['conv4_3_norm_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                          name='conv4_3_norm_mbox_loc')(net['conv4_3_norm'])
    net['conv4_3_norm_mbox_loc_flat'] = Flatten(name='conv4_3_norm_mbox_loc_flat')(net['conv4_3_norm_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['conv4_3_norm_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                           name='conv4_3_norm_mbox_conf')(net['conv4_3_norm'])
    net['conv4_3_norm_mbox_conf_flat'] = Flatten(name='conv4_3_norm_mbox_conf_flat')(net['conv4_3_norm_mbox_conf'])

    # 对fc7层进行处理
    # 19,19,1024
    num_priors = 6
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['fc7_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay), name='fc7_mbox_loc')(
        net['fc7'])
    net['fc7_mbox_loc_flat'] = Flatten(name='fc7_mbox_loc_flat')(net['fc7_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['fc7_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay), name='fc7_mbox_conf')(
        net['fc7'])
    net['fc7_mbox_conf_flat'] = Flatten(name='fc7_mbox_conf_flat')(net['fc7_mbox_conf'])

    # 对conv6_2进行处理
    # 10,10,512
    num_priors = 6
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['conv6_2_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same',kernel_regularizer=l2(weight_decay),  name='conv6_2_mbox_loc')(
        net['conv6_2'])
    net['conv6_2_mbox_loc_flat'] = Flatten(name='conv6_2_mbox_loc_flat')(net['conv6_2_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['conv6_2_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                      name='conv6_2_mbox_conf')(net['conv6_2'])
    net['conv6_2_mbox_conf_flat'] = Flatten(name='conv6_2_mbox_conf_flat')(net['conv6_2_mbox_conf'])

    # 对conv7_2进行处理
    # 5,5,256
    num_priors = 6
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['conv7_2_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay), name='conv7_2_mbox_loc')(
        net['conv7_2'])
    net['conv7_2_mbox_loc_flat'] = Flatten(name='conv7_2_mbox_loc_flat')(net['conv7_2_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['conv7_2_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                      name='conv7_2_mbox_conf')(net['conv7_2'])
    net['conv7_2_mbox_conf_flat'] = Flatten(name='conv7_2_mbox_conf_flat')(net['conv7_2_mbox_conf'])

    # 对conv8_2进行处理
    # 3,3,256
    num_priors = 4
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['conv8_2_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay), name='conv8_2_mbox_loc')(
        net['conv8_2'])
    net['conv8_2_mbox_loc_flat'] = Flatten(name='conv8_2_mbox_loc_flat')(net['conv8_2_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['conv8_2_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                      name='conv8_2_mbox_conf')(net['conv8_2'])
    net['conv8_2_mbox_conf_flat'] = Flatten(name='conv8_2_mbox_conf_flat')(net['conv8_2_mbox_conf'])

    # 对conv9_2进行处理
    # 1,1,256
    num_priors = 4
    # 预测框的处理
    # num_priors表示每个网格点先验框的数量，4是x,y,h,w的调整
    net['conv9_2_mbox_loc'] = Conv2D(num_priors * 4, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay), name='conv9_2_mbox_loc')(
        net['conv9_2'])
    net['conv9_2_mbox_loc_flat'] = Flatten(name='conv9_2_mbox_loc_flat')(net['conv9_2_mbox_loc'])
    # num_priors表示每个网格点先验框的数量，class_num是所分的类
    net['conv9_2_mbox_conf'] = Conv2D(num_priors * class_num, kernel_size=(3, 3), padding='same', kernel_regularizer=l2(weight_decay),
                                      name='conv9_2_mbox_conf')(net['conv9_2'])
    net['conv9_2_mbox_conf_flat'] = Flatten(name='conv9_2_mbox_conf_flat')(net['conv9_2_mbox_conf'])
	
	# 最终的模型，输入层是net['input']，输出层为net['predictions']
    model = Model(net['input'], net['predictions'])
    # print('add net finish!')
    return model

到这里这里，图1中最上部横着的主干网络基本介绍完了。这个部分主要用来提取特征，下面开始介绍使用特征部分。

在图1中，可以看到，Conv4第三次卷积、FC7、Conv6第二次卷积、Conv7第二次卷积、Conv8第二次卷积、Conv9第二次卷积，都往下走，这6层得到的结果将进一步处理，通过他们得到不同尺寸的特征结果。这部分也是SSD算法能够识别不同尺寸物体的原因。

3.先验框

在进一步说明上述六层特征层之前，先说明下先验框。

图2 分成8×8

假设一张图[300,300]，分成[8,8]（注：SSD网络部分没有图像是分成[8,8]，仅仅举例使用）先验框就是图2中虚线部分。假设物体在左下方四个虚线框框里面，则通过这四个虚线框框能够找到物体。四个虚线框框尺寸不是固定，具体的参数需要训练。无论物体多大，总能够通过将图像分成某个比例，有一个虚线框框将其框住(仅仅简单说明，实际上还有一个图像放不下一个物体、物体只有一部分在图像中) 。所以，训练SSD模型，实际上就是训练虚线框框(x,y,w,h)这四个参数与实际物体的关系。而实际物体的具体(x,y,w,h)是提取通过软件标注好。加上物体的名字，一共是(x,y,w,h,name)五个参数。

这些虚线框框，叫做先验框，训练SSD模型，就是训练这些先验框的参数。在SSD中，不是直接训练(x,y,w,h) ,而是训练物体真实坐标与提取设定好先验框(x0,y0,w0,h0)之间的比例关系。训练之前需要转化一次，预测时，也需要将结果转化回来并显示。

对于六层特征层，先验框每一个像素点设置的数量为[4,6,6,6,4,4](每一层分好之后，分辨率都不一样)

对于Conv4来说，图像变成了[38,38]分辨率（与刚才说明的[8,8]相对应），每个像素点的先验框数量为4，则该层的先验框数量为38×38×4=5776。同理，其他的几层先验框数量分别为2166，600，150，36，4，一共8732个。所以，可以说，SSD模型，实际就是在[300,300]大小的图像上，设置一共8732个先验框，每个先验框大小、位置不同，去识别物体。对于一个物体来说，可能被多个先验框识别出来，因此，需要对识别出来的先验框进行非极大值抑制操作，选择最合适的做为最终结果。

对于这六层特征层来说，每一层都需要进行anchors_num×4的卷积（上面说的就是这个过程，4我理解的表示(x,y,w,h)），除此之外，还需要进行anchors_num×classes_num的卷积，用于预测物体的种类，每个先验框都有自己的物体种类名字。

先验框部分代码，这部分是获取8732个先验框

# 计算每个有效特征层的anchor box
# 六层分别为
# 38*38*4  19*19*6 10*10*6 5*5*6 3*3*4 1*1*4
#  5576     2166    600     150   36     4
class AnchorBox():
    def __init__(self, input_shape, min_size, max_size=None, aspect_ratios=None, flip=True):
        self.input_shape = input_shape

        self.min_size = min_size
        self.max_size = max_size

        self.aspect_ratios = []  # aspect_ratios 结果 [1, 1.0, 2, 0.5]
        for ar in aspect_ratios:
            self.aspect_ratios.append(ar)
            self.aspect_ratios.append(1.0 / ar)
        # print('AnchorBox aspect_ratios ' + str(self.aspect_ratios))
    def call(self, layer_shape, mask=None):
        # --------------------------------- #
        #   获取输入进来的特征层的宽和高
        #   比如38x38
        # --------------------------------- #
        layer_height = layer_shape[0]
        layer_width = layer_shape[1]
        # print('AnchorBox layer_height ' + str(layer_height))
        # print('AnchorBox layer_width ' + str(layer_width))
        # --------------------------------- #
        #   获取输入进来的图片的宽和高
        #   比如300x300
        # --------------------------------- #
        img_height = self.input_shape[0]
        img_width = self.input_shape[1]
        # print('AnchorBox img_height ' + str(img_height))
        # print('AnchorBox img_width ' + str(img_width))

        box_widths = []
        box_heights = []
        # --------------------------------- #
        #   self.aspect_ratios一般有两个值
        #   [1, 1, 2, 1/2]
        #   [1, 1, 2, 1/2, 3, 1/3]
        # --------------------------------- #
        for ar in self.aspect_ratios:
            # print('AnchorBox box_widths ' + str(len(box_widths)))
            # 首先添加一个较小的正方形
            if ar == 1 and len(box_widths) == 0:
                box_widths.append(self.min_size)
                box_heights.append(self.min_size)
            # 然后添加一个较大的正方形
            elif ar == 1 and len(box_widths) > 0:
                box_widths.append(np.sqrt(self.min_size * self.max_size))
                box_heights.append(np.sqrt(self.min_size * self.max_size))
            # 然后添加长方形
            elif ar != 1:
                box_widths.append(self.min_size * np.sqrt(ar))
                box_heights.append(self.min_size / np.sqrt(ar))
        # print('AnchorBox box_widths ' + str(box_widths))
        # print('AnchorBox box_heights ' + str(box_heights))
        # --------------------------------- #
        #   获得所有先验框的宽高1/2
        # --------------------------------- #
        box_widths = 0.5 * np.array(box_widths)
        box_heights = 0.5 * np.array(box_heights)
        # print('AnchorBox box_widths ' + str(box_widths))
        # print('AnchorBox box_heights ' + str(box_heights))
        # --------------------------------- #
        #   每一个特征层对应的步长
        #   每个特征层分成[layer_width,layer_height]大小，
        #   每个分成[layer_width,layer_height]之后的点对应[300, 300]上的长度
        #   比如[3, 3]，每层步长step_x = 300 / 3 = 100，分成[3, 3]之后的点对应[300, 300]就是100个点
        # --------------------------------- #
        step_x = img_width / layer_width
        step_y = img_height / layer_height
        # print('AnchorBox layer_width ' + str(layer_width))
        # print('AnchorBox layer_height ' + str(layer_height))
        # print('AnchorBox step_x ' + str(step_x))
        # print('AnchorBox step_y ' + str(step_y))

        # --------------------------------- #
        #   生成网格中心
        #   在每个特征层上，从最左往右生成所有的网格中心点
        #   linx, liny 数量对应于 layer_width，layer_height  [3, 3]
        # --------------------------------- #
        linx = np.linspace(0.5 * step_x, img_width - 0.5 * step_x, layer_width)  #[ 50. 150. 250.]
        liny = np.linspace(0.5 * step_y, img_height - 0.5 * step_y, layer_height) #[ 50. 150. 250.]
        # print('AnchorBox linx ' + str(linx))
        # print('AnchorBox liny ' + str(liny))

        #将(x,y)方向上的序列转化成坐标形式，变成一个(layer_width×layer_height)大小的矩阵
        centers_x, centers_y = np.meshgrid(linx, liny)  # 生成正方形的矩阵,,,长宽一样 变成3×3矩阵
        # print('AnchorBox centers_x ' + str(centers_x))
        centers_x = centers_x.reshape(-1, 1)  # 将3×3矩阵压缩成一维的矩阵  9行，1列
        centers_y = centers_y.reshape(-1, 1)
        # print('AnchorBox centers_x ' + str(centers_x))
        # print('AnchorBox centers_y ' + str(centers_y))


        # 每一个先验框需要两个(centers_x, centers_y)，前一个用来计算左上角，后一个计算右下角
        num_anchors_ = len(self.aspect_ratios)  # 4
        # print('AnchorBox num_anchors_ ' + str(num_anchors_))

        anchor_boxes = np.concatenate((centers_x, centers_y), axis=1)  # 将矩阵拼接起来，得到9个网格的中心点坐标，坐标(0, 0)是(50, 50)，9行，每行一个坐标 一共9 × 2个数据，2个坐标
        # print('AnchorBox anchor_boxes ' + str(anchor_boxes))

        anchor_boxes = np.tile(anchor_boxes, (1, 2 * num_anchors_))  # 变成9行，16列(num_anchors_ * 4 = 16),将每行的坐标，重复8次
        # print('AnchorBox anchor_boxes ' + str(len(anchor_boxes)))
        # print('AnchorBox anchor_boxes ' + str(anchor_boxes))
        # 获得先验框的左上角和右下角
        anchor_boxes[:, ::4] -= box_widths    # 将每行的数据，从第一个开始，每隔4个，将其值减去box_widths，因为box_widths有四个元素，所以相当去anchor_boxes中的4个元素减去box_widths对应的元素
        anchor_boxes[:, 1::4] -= box_heights
        anchor_boxes[:, 2::4] += box_widths   # 将每行的数据，从第一个开始，每隔4个，将其值加上box_widths
        anchor_boxes[:, 3::4] += box_heights

        # 这里anchor_boxes变成了9行，每行16个元素
        # print('AnchorBox anchor_boxes ' + str(len(anchor_boxes)))
        # print('AnchorBox anchor_boxes ' + str(anchor_boxes))
        # --------------------------------- #
        #   将先验框变成小数的形式
        #   归一化
        # --------------------------------- #
        anchor_boxes[:, ::2] /= img_width
        anchor_boxes[:, 1::2] /= img_height
        anchor_boxes = anchor_boxes.reshape(-1, 4)  # 分成4列，行自动计算，这里行为36，每一行对应一个锚点框参数，这里默认锚点框 def anchor boxes为4
        # print('AnchorBox anchor_boxes ' + str(len(anchor_boxes)))
        # print('AnchorBox anchor_boxes ' + str(anchor_boxes))
        anchor_boxes = np.minimum(np.maximum(anchor_boxes, 0.0), 1.0)  # 这里去掉anchor_boxes中的负值
        # print('AnchorBox anchor_boxes ' + str(len(anchor_boxes)))
        # print('AnchorBox anchor_boxes ' + str(anchor_boxes))
        return anchor_boxes



#---------------------------------------------------#
#   用于计算共享特征层的大小
#   当(height, width) = (300, 300)是，输出的为[150, 75, 38, 19, 10, 5, 3, 1]
#   后面的[38, 19, 10, 5, 3, 1]是六个有效特征层对应的分辨率
#   [38, 38, 512], [19, 19, 1024], [10, 10, 512],
#   [ 5,  5, 256], [ 3,  3,  256], [ 1,  1, 256]
#---------------------------------------------------#
def get_img_output_length(height, width):
    filter_sizes    = [3, 3, 3, 3, 3, 3, 3, 3]
    padding         = [1, 1, 1, 1, 1, 1, 0, 0]
    stride          = [2, 2, 2, 2, 2, 2, 1, 1]
    feature_heights = []
    feature_widths  = []
    # print('get_img_output_length height ' + str(height))
    # print('get_img_output_length width ' + str(width))
    # print('get_img_output_length filter_sizes ' + str(len(filter_sizes)))
    for i in range(len(filter_sizes)):
        height  = (height + 2 * padding[i] - filter_sizes[i]) // stride[i] + 1
        width   = (width + 2 * padding[i] - filter_sizes[i]) // stride[i] + 1
        # print(str(i) + ' height ' + str(height)  + ' width ' + str(width))
        feature_heights.append(height)
        feature_widths.append(width)
    return np.array(feature_heights)[-6:], np.array(feature_widths)[-6:]


# 获取所有的anchor框,在六层有效层上获取,
def get_anchors(input_shape = [300,300], anchors_size = [30, 60, 111, 162, 213, 264, 315]):
    feature_heights, feature_widths = get_img_output_length(input_shape[0], input_shape[1])  # w,h = (300,300)
    aspect_ratios = [[1, 2], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]
    anchors = []
    # print('get_anchors feature_heights ' + str(feature_heights))
    # print('get_anchors feature_heights ' + str(feature_heights))

    for i in range(len(feature_heights)):
        # 计算每层有效特征层的锚点框 分别对应
        # 38*38*4  19*19*6 10*10*6 5*5*6 3*3*4 1*1*4
        #  5576     2166    600     150   36     4
        anchors.append(AnchorBox(input_shape, anchors_size[i], max_size = anchors_size[i+1],
                    aspect_ratios = aspect_ratios[i]).call([feature_heights[i], feature_widths[i]]))

    # print('get_anchors anchors ' + str(len(anchors)))
    anchors = np.concatenate(anchors, axis=0)
    return anchors

前面谈到的编码、解码部分。

# 获取真实框
class BBoxUtility(object):
    def __init__(self, num_classes, nms_thresh=0.45, top_k=300):
        self.num_classes = num_classes
        self._nms_thresh = nms_thresh
        self._top_k = top_k

    def ssd_correct_boxes(self, box_xy, box_wh, input_shape, image_shape, letterbox_image):
        # -----------------------------------------------------------------#
        #   把y轴放前面是因为方便预测框和图像的宽高进行相乘
        # -----------------------------------------------------------------#
        box_yx = box_xy[..., ::-1]
        box_hw = box_wh[..., ::-1]
        input_shape = np.array(input_shape)
        image_shape = np.array(image_shape)

        if letterbox_image:
            # -----------------------------------------------------------------#
            #   这里求出来的offset是图像有效区域相对于图像左上角的偏移情况
            #   new_shape指的是宽高缩放情况
            # -----------------------------------------------------------------#
            new_shape = np.round(image_shape * np.min(input_shape / image_shape))
            offset = (input_shape - new_shape) / 2. / input_shape
            scale = input_shape / new_shape

            box_yx = (box_yx - offset) * scale
            box_hw *= scale

        box_mins = box_yx - (box_hw / 2.)
        box_maxes = box_yx + (box_hw / 2.)
        boxes = np.concatenate([box_mins[..., 0:1], box_mins[..., 1:2], box_maxes[..., 0:1], box_maxes[..., 1:2]],
                               axis=-1)
        boxes *= np.concatenate([image_shape, image_shape], axis=-1)
        return boxes

    # 对回归结果进行处理,得到真实框的位置信息
    # 针对一张图片进行解码预测结果
    # mbox_loc 8732个预测框所对应目标的坐标信息,一个物体有四个数据表示(x, y, w, h)
    # anchors所有的锚点框 8732个
    def decode_boxes(self, mbox_loc, anchors, variances=[0.1, 0.1, 0.2, 0.2]):
        # 获得先验框的宽与高
        # 每个锚点框有四个数据,四条边分别与左上点的距离
        # print('decode_boxes mbox_loc ' + str(len(mbox_loc[0])))

        anchor_width = anchors[:, 2] - anchors[:, 0]  # 每个锚点框对应的w,h
        anchor_height = anchors[:, 3] - anchors[:, 1]
        # print('decode_boxes anchors ' + str(len(anchors)))
        # print('decode_boxes anchors ' + str(len(anchors[0])))
        # print('decode_boxes anchor_width ' + str(len(anchor_width)))
        # print('decode_boxes anchor_height ' + str(len(anchor_height)))
        # print('decode_boxes anchor_width ' + str(anchor_width))


        # 获得每个先验框的中心点
        anchor_center_x = 0.5 * (anchors[:, 2] + anchors[:, 0])
        anchor_center_y = 0.5 * (anchors[:, 3] + anchors[:, 1])

        # 真实框距离先验框中心的xy轴偏移情况
        #decode_bbox_center_x = mbox_loc[:, 0] * anchor_width * variances[0]
        # decode_bbox_center_x = anchor_center_x + mbox_loc[:, 0] * anchor_width * variances[0]
        # decode_bbox_center_x += anchor_center_x
        # decode_bbox_center_y = mbox_loc[:, 1] * anchor_height * variances[1]
        # decode_bbox_center_y += anchor_center_y


        # 对应的预测框和先验框进行融合
        # 每个先验框的距离是固定的,预测框是经过模型运算的,两者相乘表示真实框相对于先验框的距离,
        decode_bbox_center_x = anchor_center_x + mbox_loc[:, 0] * anchor_width * variances[0]
        decode_bbox_center_y = anchor_center_y + mbox_loc[:, 1] * anchor_height * variances[1]

        # print('decode_boxes decode_bbox_center_x ' + str(len(decode_bbox_center_x)))
        # print('decode_boxes decode_bbox_center_y ' + str(len(decode_bbox_center_y)))

        # 真实框的宽与高的求取
        decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[2])  # 不知道为啥要用自然数e作为底数进行运算
        decode_bbox_width *= anchor_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[3])
        decode_bbox_height *= anchor_height

        # 获取真实框的左上角与右下角
        decode_bbox_xmin = decode_bbox_center_x - 0.5 * decode_bbox_width
        decode_bbox_ymin = decode_bbox_center_y - 0.5 * decode_bbox_height
        decode_bbox_xmax = decode_bbox_center_x + 0.5 * decode_bbox_width
        decode_bbox_ymax = decode_bbox_center_y + 0.5 * decode_bbox_height

        # 真实框的左上角与右下角进行堆叠
        # 所有得到结果的左上角与右下角数据进行合并
        decode_bbox = np.concatenate((decode_bbox_xmin[:, None],
                                      decode_bbox_ymin[:, None],
                                      decode_bbox_xmax[:, None],
                                      decode_bbox_ymax[:, None]), axis=-1)
        # 防止超出0与1
        decode_bbox = np.minimum(np.maximum(decode_bbox, 0.0), 1.0)
        return decode_bbox

    # 解码ssd模型得到的预测结果
    # anchors所有的锚点框
    # image_shape 输入图像尺寸 不确定,可以为如1330×1330
    # input_shape是SSD算法模型输入尺寸,固定为300×300
    def decode_box(self, predictions, anchors, image_shape, input_shape, letterbox_image,
                   variances=[0.1, 0.1, 0.2, 0.2], confidence=0.5):
        # print('decode_box anchors ' + str(len(anchors)))
        # print('decode_box image_shape ' + str(image_shape))
        # print('decode_box input_shape ' + str(input_shape))
        # ---------------------------------------------------#
        #   :4是回归预测结果
        # ---------------------------------------------------#
        mbox_loc = predictions[:, :, :4]   # 取得所有预测框的坐标信息,一共8732
        # print('decode_box mbox_loc ' + str(len(mbox_loc[0])))
        # print('decode_box mbox_loc ' + str(mbox_loc))
        # ---------------------------------------------------#
        #   获得种类的置信度
        # ---------------------------------------------------#
        mbox_conf = predictions[:, :, 4:]   #获得所有预测框的置信度,一共8732
        # print('decode_box mbox_conf ' + str(len(mbox_conf[0])))
        # print('decode_box mbox_conf ' + str(mbox_conf))
        results = []
        # ----------------------------------------------------------------------------------------------------------------#
        #   对每一张图片进行处理，由于在predict.py的时候，我们只输入一张图片，所以for i in range(len(mbox_loc))只进行一次
        # ----------------------------------------------------------------------------------------------------------------#
        for i in range(len(mbox_loc)):
            results.append([])
            # --------------------------------#
            #   利用回归结果对先验框进行解码
            # --------------------------------#
            decode_bbox = self.decode_boxes(mbox_loc[i], anchors, variances)  # 得到所有的真是狂信息,一共四个数据,左上角和右下角
            # print('decode_box decode_bbox ' + str(len(decode_bbox)))
            # print('decode_box decode_bbox ' + str(len(decode_bbox[0])))

            # 处理所有的真实框
            for c in range(1, self.num_classes):
                # --------------------------------#
                #   取出属于该类的所有框的置信度
                #   判断是否大于门限
                # --------------------------------#
                c_confs = mbox_conf[i, :, c]  # 获取某个类别所有预测框的置信度
                c_confs_m = c_confs > confidence  # 置信度大于一定值
                # print('decode_box c_confs_m ' + '   ' + str(c)  + '  ' + str(len(c_confs[c_confs_m])))
                #  len(c_confs[c_confs_m]) 就是某个类别所有置信度大于confidence的数量
                #  c_confs[c_confs_m]就是对应的大于confidence预测框
                if len(c_confs[c_confs_m]) > 0:   #
                    # -----------------------------------------#
                    #   取出得分高于confidence的框
                    # -----------------------------------------#
                    # boxes_to_process,confs_to_process的元素数量等于len(c_confs[c_confs_m])
                    # 这里得到置信度大于confidence的预测框的物体信息
                    boxes_to_process = decode_bbox[c_confs_m]  # 解码预测框对应的物体信息
                    confs_to_process = c_confs[c_confs_m]     #  大于confidence的所有预测框的集合
                    # print('decode_box boxes_to_process ' + str(len(boxes_to_process)))
                    # print('decode_box boxes_to_process ' + str(len(confs_to_process)))
                    # -----------------------------------------#
                    #   进行iou的非极大抑制,某些预测框可能重复在某个物体上面,重复的只保留物体对应一个最大的预测框
                    #   idx的数量就是某个种类物体最终被检测出来多少个
                    # -----------------------------------------#
                    idx = tf.image.non_max_suppression(tf.cast(boxes_to_process, tf.float32),
                                                       tf.cast(confs_to_process, tf.float32),
                                                       self._top_k,
                                                       iou_threshold=self._nms_thresh).numpy()
                    # print('decode_box idx ' + str(idx))
                    # -----------------------------------------#
                    #   取出在非极大抑制中效果较好的内容
                    #   每个被检测的物体有一个对应的位置信息和置信度
                    #   good_boxes 保存某个种类的所有被检测出出来的物体的位置信息,confs保存对应的置信度,这两个的列表的长度一样
                    # -----------------------------------------#
                    good_boxes = boxes_to_process[idx]
                    confs = confs_to_process[idx][:, None]
                    # [:, None]的用法是将横着的列表转化为竖着的列表,,,
                    # 如[0.9922133  0.9003193  0.81056666]转化为 [[0.9922133 ]
                    #                                            [0.9003193 ]
                    #                                            [0.81056666]]
                    # print('decode_box good_boxes ' + str(good_boxes))
                    # print('decode_box confs ' + str(confs))
                    # print('decode_box confs ' + str(confs_to_process[idx]))


                    labels = (c - 1) * np.ones((len(idx), 1))  # np.ones生成一个len(idx) × 1矩阵,元素均为1
                    # -----------------------------------------#
                    #   将框的位置、label、置信度进行堆叠。
                    # -----------------------------------------#
                    c_pred = np.concatenate((good_boxes, labels, confs), axis=1)
                    # print('decode_box c_pred ' + str(c_pred))
                    # 添加进result里
                    results[-1].extend(c_pred)

            if len(results[-1]) > 0:
                results[-1] = np.array(results[-1])
                box_xy, box_wh = (results[-1][:, 0:2] + results[-1][:, 2:4]) / 2, results[-1][:, 2:4] - results[-1][:,
                                                                                                        0:2]
                results[-1][:, :4] = self.ssd_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)

        return results

个人认为SSD算法的核心部分代码大概这么多，剩下的比较好写了。GitHub链接暂时不放了，有些还没做好，等全部写好了再开源。

三、测试

SSD模型以及训练好，只用的是自制的数据集，标注软件是labelImg

终端输出结果

图像输出，在图中标注了物体。

四、总结

SSD算法速度和经度都还可以。算法简单的说就是，对一张[300,300]的图像，在上面分布一共8732个先验框，每个先验框负责识别一个区域，将结果进行回归，得到最终结果。

SSD算法原理与代码（三）