SSD模型的train是基于predict之后进行的，在研究train之前需要先解读predict部分。

    if mode == "predict":
        '''
        1、如果想要进行检测完的图片的保存，利用r_image.save("img.jpg")即可保存，直接在predict.py里进行修改即可。 
        2、如果想要获得预测框的坐标，可以进入ssd.detect_image函数，在绘图部分读取top，left，bottom，right这四个值。
        3、如果想要利用预测框截取下目标，可以进入ssd.detect_image函数，在绘图部分利用获取到的top，left，bottom，right这四个值
        在原图上利用矩阵的方式进行截取。
        4、如果想要在预测图上写额外的字，比如检测到的特定目标的数量，可以进入ssd.detect_image函数，在绘图部分对predicted_class进行判断，
        比如判断if predicted_class == 'car': 即可判断当前目标是否为车，然后记录数量即可。利用draw.text即可写字。
        '''
        while True:
            # img = input('Input image filename:')
            img="E:\deeplearnning-project\ssd-pytorch-master\img\street.jpg"
            try:
                image = Image.open(img)
            except:
                print('Open Error! Try again!')
                continue
            else:
                r_image = ssd.detect_image(image, crop = crop, count=count)
                r_image.show()

上述代码所示，将图片读取进来后送入ssd.detect_image函数中，通过该函数即可获取带预测结果方框的图片

展开ssd.detect_image函数如下述代码所示

 def detect_image(self, image, crop = False, count = False):
        #---------------------------------------------------#
        #   计算输入图片的高和宽
        #---------------------------------------------------#
        image_shape = np.array(np.shape(image)[0:2])
        #---------------------------------------------------------#
        #   在这里将图像转换成RGB图像，防止灰度图在预测时报错。
        #   代码仅仅支持RGB图像的预测，所有其它类型的图像都会转化成RGB
        #---------------------------------------------------------#
        image       = cvtColor(image)
        #---------------------------------------------------------#
        #   给图像增加灰条，实现不失真的resize
        #   也可以直接resize进行识别
        #---------------------------------------------------------#
        image_data  = resize_image(image, (self.input_shape[1], self.input_shape[0]), self.letterbox_image)
        #---------------------------------------------------------#
        #   添加上batch_size维度，图片预处理，归一化。
        #---------------------------------------------------------#
        image_data = np.expand_dims(np.transpose(preprocess_input(np.array(image_data, dtype='float32')), (2, 0, 1)), 0)

        with torch.no_grad():
            #---------------------------------------------------#
            #   转化成torch的形式
            #---------------------------------------------------#
            images = torch.from_numpy(image_data).type(torch.FloatTensor)
            if self.cuda:
                images = images.cuda()
            #---------------------------------------------------------#
            #   将图像输入网络当中进行预测！
            #---------------------------------------------------------#
            # 获取该特征层 上每一个网格点上的锚框位置调整参数
            outputs     = self.net(images)
            #获取该特征层 上每一个网格点上的锚框位置调整参数
            #-----------------------------------------------------------#
            #   将预测结果进行解码
            #-----------------------------------------------------------#
            results     = self.bbox_util.decode_box(outputs, self.anchors, image_shape, self.input_shape, self.letterbox_image, 
                                                    nms_iou = self.nms_iou, confidence = self.confidence)

我们知道图片经过output = net（image）将输出模型的预测框与置信度，然后将输出的output送入decode_box函数进行处理，这里我们可以发现同时送入的还有anchors。

在这里首先我们要弄懂锚框（anchor），预测框（Prediction box），真实框（ground-truth ）三者的关系。

首先先捋一下这三种框的坐标表达格式

结论先放前面：anchor、解码后的预测框、真实框的格式都是[x1,y1,x2,y2]。因为这个格式可以直接进行交并集的计算。

先说锚框，也就是anchor ，anchor的格式是[x1,y1,x2,y2]也就是左上角，右下角的坐标。ssd输出6个特征层，取其中一个特征层为例子，anchor就是根据特征层大小如3*3也就是9个方块，每个方块生产6个anchor用来做初始预测方框，9个方块就生产54个anchor，其他特征层同理，最后把所有特征层生产的anchor通过np.concatenate(anchors, axis=0)函数拼接在一起得到anchor为（8732，4）即ssd输出特征层固定输出8732个，注意这里生成的anchor与要预测的图片无关，不管输入什么图片都固定输出这些anchor，其坐标的格式是[x1,y1,x2,y2].具体可以去别人的文章

预测框也就是上述的output是模型直接输出的数值结果，这里为什么用数值结果，是因为他需要解码才能得到坐标值，解码过程就用到了上述提到的anchor。一定要注意接下来坐标变换这件事。

预测框（Prediction box ）得到的也是一个矩形的坐标，但是他需要解码，在解码这一部分，首先是将anchor的格式变换为[x,y,w,h]的形式，（x,y）即矩形框的中心点。通过观察代码我们就可以发现，预测框的四个数值其实是相对于anchor[x,y,w,h]的偏移量,补偿偏移量后得到预测框的坐标[x,y,w,h],返回的时候将[x,y,w,h]格式重新转换为[x1,y1,x2,y2]的格式。且得到的也是（8732，4）

扫描二维码关注公众号，回复： 15264940 查看本文章

    def decode_boxes(self, mbox_loc, anchors, variances):
        # 获得锚框的宽与高
        anchor_width     = anchors[:, 2] - anchors[:, 0]
        anchor_height    = anchors[:, 3] - anchors[:, 1]
        # 获得锚框的中心点
        anchor_center_x  = 0.5 * (anchors[:, 2] + anchors[:, 0])
        anchor_center_y  = 0.5 * (anchors[:, 3] + anchors[:, 1])

        # 预测框距离锚框中心的xy轴偏移情况
        decode_bbox_center_x = mbox_loc[:, 0] * anchor_width * variances[0]
        decode_bbox_center_x += anchor_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * anchor_height * variances[0]
        decode_bbox_center_y += anchor_center_y
        
        # 预测框的宽与高的求取
        decode_bbox_width   = torch.exp(mbox_loc[:, 2] * variances[1])
        decode_bbox_width   *= anchor_width
        decode_bbox_height  = torch.exp(mbox_loc[:, 3] * variances[1])
        decode_bbox_height  *= anchor_height

        # 获取预测框的左上角与右下角
        decode_bbox_xmin = decode_bbox_center_x - 0.5 * decode_bbox_width
        decode_bbox_ymin = decode_bbox_center_y - 0.5 * decode_bbox_height
        decode_bbox_xmax = decode_bbox_center_x + 0.5 * decode_bbox_width
        decode_bbox_ymax = decode_bbox_center_y + 0.5 * decode_bbox_height

        # 预测框的左上角与右下角进行堆叠
        decode_bbox = torch.cat((decode_bbox_xmin[:, None],
                                      decode_bbox_ymin[:, None],
                                      decode_bbox_xmax[:, None],
                                      decode_bbox_ymax[:, None]), dim=-1)
        # 防止超出0与1
        #torch.max 取tensor中的最大值，谁大取谁 torch.min 取tensor中最小值 谁小取谁
        decode_bbox = torch.min(torch.max(decode_bbox, torch.zeros_like(decode_bbox)), torch.ones_like(decode_bbox))
        return decode_bbox

再说真实框，真实框也就是人工标注的框，在train阶段才用，预测阶段用不着，这里就不介绍了。

通过NMS对预测框进行处理

通过上述所示，已经获取到预测框，预测框的坐标是[x1,y1,x2,y2]，output还输出了每个预测框所对应的类别的置信度格式为(8732,21)。

NMS的处理参考https://blog.csdn.net/qq_38316300/article/details/120174900?spm=1001.2014.3001.5506

处理后得到的预测框，换一种说法也就是预测框是归一化的结果，需要对其进行真实的尺寸还原，还原后就得到预测结果的真实坐标点[x1,y1,x2,y2]。在将其绘制在原图上，最后展示即刻得到最上面那张图。

            if len(results[-1]) > 0:
                results[-1] = np.array(results[-1])
                #ssd_correct_boxes仅支持[x,y,w,h]输入 所以尺寸类型要从[x,y,x,y]转换为[x,y,w,h]表示
                box_xy, box_wh = (results[-1][:, 0:2] + results[-1][:, 2:4])/2, results[-1][:, 2:4] - results[-1][:, 0:2]
                #上面的结果是归一化的结果
                #下面correct——boxes是对比原图 在原图中的实际尺寸
                results[-1][:, :4] = self.ssd_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image)
                #解码得到的的图是[x,y,x,y]的格式
    def ssd_correct_boxes(self, box_xy, box_wh, input_shape, image_shape, letterbox_image):
        #-----------------------------------------------------------------#
        #   把y轴放前面是因为方便预测框和图像的宽高进行相乘
        #-----------------------------------------------------------------#
        box_yx = box_xy[..., ::-1]
        box_hw = box_wh[..., ::-1]
        input_shape = np.array(input_shape)
        image_shape = np.array(image_shape)

        if letterbox_image:
            #-----------------------------------------------------------------#
            #   这里求出来的offset是图像有效区域相对于图像左上角的偏移情况
            #   new_shape指的是宽高缩放情况
            #-----------------------------------------------------------------#
            new_shape = np.round(image_shape * np.min(input_shape/image_shape))
            offset  = (input_shape - new_shape)/2./input_shape
            scale   = input_shape/new_shape

            box_yx  = (box_yx - offset) * scale
            box_hw *= scale

        box_mins    = box_yx - (box_hw / 2.)
        box_maxes   = box_yx + (box_hw / 2.)
        boxes  = np.concatenate([box_mins[..., 0:1], box_mins[..., 1:2], box_maxes[..., 0:1], box_maxes[..., 1:2]], axis=-1)
        boxes *= np.concatenate([image_shape, image_shape], axis=-1)
        return boxes

参考：

睿智的目标检测23——Pytorch搭建SSD目标检测平台

SSD模型解读（一）predict部分检测图片部分代码细节解读

首先先捋一下这三种框的坐标表达格式

通过NMS对预测框进行处理

猜你喜欢