[Target detection series] Detailed explanation of yolov2's loss function (combined with pytorch code)

0. Speechless

1. Let's add the theory back

 

2. Process:

        a. Correspond the (x, y, h, w) of the predicted box to the anchor (5) to obtain the true predicted value (x0, y0, h0, w0) of the predicted box on the feature map. yolo_to_bbox function implementation

        b. Calculate the iou value of the prediction box of the true value and the box box of the prediction value. The bbox_np function is implemented. Is implemented on the pixel coordinate map

        c. Use the threshold card to determine that there is no target area, and prepare the loss function calculation of this area. Components of the targetless confidence loss

        d. Calculate the true value and the iou value of the anchor box. The anchor_intersections function is implemented. It is implemented on the feature map.

        e. According to the calculation of d, find the best anchor corresponding to the true value, think that this area contains the target, and then calculate the 3 loss components of the useful target. The 3 losses are the confidence loss, the classification loss, and the box box. loss.

1. Initial processing

# tx, ty, tw, th, to -> sig(tx), sig(ty), exp(tw), exp(th), sig(to)
        # xy_pred存储的是中心坐标相对于cell左上角的x坐标和y坐标的偏置,采用sigmod是为了将偏置控制在0和1之间。
        xy_pred = F.sigmoid(global_average_pool_reshaped[:, :, :, 0:2])
        # wh_pred存储的是预测的width和height,需要用指数函数解码
        wh_pred = torch.exp(global_average_pool_reshaped[:, :, :, 2:4])
        bbox_pred = torch.cat([xy_pred, wh_pred], 3) #bbox_pred [batchsize  w * h   num_anchors  4]
        # iou_pred存储的是IOU存在置信度,采用sigmod控制在0和1之间 二分类 存在与否
        iou_pred = F.sigmoid(global_average_pool_reshaped[:, :, :, 4:5])
        # score_pred存储的是分类置信度,采用softmax控制在0和1之间  多分类 属于哪个类别
        score_pred = global_average_pool_reshaped[:, :, :, 5:].contiguous()
        prob_pred = F.softmax(score_pred.view(-1, score_pred.size()[-1])).view_as(score_pred)  # noqa

2. How does the anchor correspond to the predicted value? The following code is very clear. This function calculates the predicted box value of each predicted box box on the feature map at 5 scales (anchor) (a mouthful, because the predicted value bbox_pred_np is a value between 0 and 1 relative to the upper left corner of the grid The offset, only after the anchor is multiplied, is the target a predicted value box under the current anchor of the current grid, and this value can be mapped back to the original image to get the pixel coordinates). In other words, only after the processing of this function can the prediction boundary on the feature map be obtained. The corresponding statement is bbox_np = bbox_np[0]

cimport cython
import numpy as np
cimport numpy as np

DTYPE = np.float
ctypedef np.float_t DTYPE_t

cdef extern from "math.h":
    double abs(double m)
    double log(double x)


def yolo_to_bbox(
        np.ndarray[DTYPE_t, ndim=4] bbox_pred,
        np.ndarray[DTYPE_t, ndim=2] anchors, int H, int W):
    return yolo_to_bbox_c(bbox_pred, anchors, H, W)

cdef yolo_to_bbox_c(
        np.ndarray[DTYPE_t, ndim=4] bbox_pred,
        np.ndarray[DTYPE_t, ndim=2] anchors, int H, int W):
    """
    Parameters
    ----------
    bbox_pred: (bsize, HxW, num_anchors, 4) ndarray of float (sig(tx), sig(ty), exp(tw), exp(th))
    anchors: (num_anchors, 2) (pw, ph)
    Returns
    -------
    bbox_out: (HxWxnum_anchors, 4) ndarray of bbox (x1, y1, x2, y2) rescaled to (0, 1)
    """
    cdef unsigned int bsize = bbox_pred.shape[0]
    cdef unsigned int num_anchors = anchors.shape[0]
    cdef np.ndarray[DTYPE_t, ndim=4] bbox_out = np.zeros((bsize, H*W, num_anchors, 4), dtype=DTYPE)

    cdef DTYPE_t cx, cy, bw, bh
    cdef unsigned int row, col, a, ind
    for b in range(bsize): #图像batchsize
        for row in range(H): #遍历网格纵向
            for col in range(W): #遍历网格横向
                ind = row * W + col #网格的唯一编号 例如 13*13的  编号为0~168
                for a in range(num_anchors): #遍历anchor数
                    cx = (bbox_pred[b, ind, a, 0] + col) / W
                    cy = (bbox_pred[b, ind, a, 1] + row) / H
                    # 预测边界框中心点相对于对应cell左上角位置的相对偏移值,为了将边界框中心点约束在当前cell中,
                    # 使用sigmoid函数处理偏移值,这样预测的偏移值在(0,1)范围内(每个cell的尺度看做1)。
                    bw = bbox_pred[b, ind, a, 2] * anchors[a][0] / W * 0.5 #预测的宽度 * anchors[a][0]
                    bh = bbox_pred[b, ind, a, 3] * anchors[a][1] / H * 0.5

                    bbox_out[b, ind, a, 0] = cx - bw
                    bbox_out[b, ind, a, 1] = cy - bh
                    bbox_out[b, ind, a, 2] = cx + bw
                    bbox_out[b, ind, a, 3] = cy + bh

    return bbox_out

3. Core code:

Code download address: https://github.com/longcw/yolo2-pytorch .

The core code of the loss function is as follows: This part is to calculate the true value and the iou value corresponding to the best anchor box, the category value, the box box information value and its mask. The anchor corresponding to the grid with no target needs to be filled in _iou_mask, and the others are filled in only when the target exists in the grid. This is an exception to the corresponding formula... I haven't mapped it yet

def _process_batch(data, size_index):
    W, H = cfg.multi_scale_out_size[size_index]#特征图尺寸呗 决定于size_index参数  
    inp_size = cfg.multi_scale_inp_size[size_index] #输入图像尺寸 320 * 320 
    out_size = cfg.multi_scale_out_size[size_index] #输出图像 也就是特征图的尺寸  10 * 10
    #bbox_pred_np:边框预测值(相对值)   gt_boxes:  边框标注值    gt_classes: 类别标注值    dontcares:  空  iou_pred_np:  IOU预测值
    bbox_pred_np, gt_boxes, gt_classes, dontcares, iou_pred_np = data #获得网格的特征图数据 包括


    # net output
    #bbox_pred_np 的shape 【h*w num_anchors 4】
    hw, num_anchors, _ = bbox_pred_np.shape # hw = 13*13    num_anchors = 5

    # gt
    _classes = np.zeros([hw, num_anchors, cfg.num_classes], dtype=np.float) #_classes尺寸是 【13*13 5 20 1】
    _class_mask = np.zeros([hw, num_anchors, 1], dtype=np.float)

    _ious = np.zeros([hw, num_anchors, 1], dtype=np.float) #_ious 【13*13 5 1】    iou信息
    _iou_mask = np.zeros([hw, num_anchors, 1], dtype=np.float)

    _boxes = np.zeros([hw, num_anchors, 4], dtype=np.float) #_boxes 【13*13 5 4】  box框信息 
    _boxes[:, :, 0:2] = 0.5  #初值设置在网格中心 w h设置为1
    _boxes[:, :, 2:4] = 1.0
    _box_mask = np.zeros([hw, num_anchors, 1], dtype=np.float) + 0.01
# 1.将预测相对值,转换为预测框,为后面的IOU计算铺垫 来确定哪些是背景哪些目标物体
    # scale pred_bbox
    anchors = np.ascontiguousarray(cfg.anchors, dtype=np.float)#ascontiguousarray 将内存不连续存储的数组转换为内存连续存储的数组,使得运行速度更快
    bbox_pred_np = np.expand_dims(bbox_pred_np, 0) # 升维操作  bbox_pred_np的shape由【h*w num_anchors 4】变为 【1 h*w num_anchors 4】
    #具体算法在utils文件夹下的yolo.pyx文件中 输出尺寸为【1 h*w num_anchors 4】 按anchor范围算出bbox_np【cx,cy,x2,y2】 左上角  右下角
    # 此函数计算的是每个预测的box框在5个尺度(anchor)下在特征图上的预测框值(拗口哈,因为预测值bbox_pred_np是一个0~1之间的相对于网格左上角的偏移量,
    #只有乘上anchor后才是目标在当前网格的当前anchor下的预测值框,这个值是特征图上的坐标,左上角右下角的浮点坐标
    bbox_np = yolo_to_bbox(
        np.ascontiguousarray(bbox_pred_np, dtype=np.float),
        anchors,
        H, W)
    # bbox_np[0] = (hw, num_anchors, (x1, y1, x2, y2))   range: 0 ~ 1
    # bbox_np 的shape由【1 hw,num_anchors,4 】变换为【hw,num_anchors,4】
    bbox_np = bbox_np[0]
    # 将anchors对应的坐标映射回原图 这回是得到了原图上的像素坐标
    bbox_np[:, :, 0::2] *= float(inp_size[0])  # rescale x
    bbox_np[:, :, 1::2] *= float(inp_size[1])  # rescale y

    # 标注框 是在原图上的像素坐标
    gt_boxes_b = np.asarray(gt_boxes, dtype=np.float)
    # for each cell, compare predicted_bbox and gt_bbox
    bbox_np_b = np.reshape(bbox_np, [-1, 4]) #bbox_np_b的shape为【h*w*num_anchors,4】
# 2.根据预测框和标注框,确定什么是背景,什么目标物体。这一部分是整个损失函数甚至是YOLO的核心,
# 通过这里我们知道他定义了什么,定义了哪些是背景哪些是目标物体,背景的话只计算IOU置信度误差,
# 目标的话计算全部误差
    #计算所有anchor对应的预测框与标注框的iou值   这个iou的shape是【h*w*num_anchors,标注框个数GT】--------->在原始图上计算iou
    ious = bbox_ious(
        np.ascontiguousarray(bbox_np_b, dtype=np.float),
        np.ascontiguousarray(gt_boxes_b, dtype=np.float)
    )
    # 这个类似在标注框上套圈,多个anchor预测框可能套到一个标注框上 是多对一关系  多个预测框对应1个标注框
    # best_ious变成形为【h*w num_anchors 1】形式 最后的1维存储的是最大IOU的值,这个值与anchor框一一对应,anchor框总个数为h*w*num_anchors
    best_ious = np.max(ious, axis=1).reshape(_iou_mask.shape)
    # 这个是我最不理解的地方了 best_ious < cfg.iou_thresh过程完成后得到一个[h*w num_anchors 1]大小的数组,其中的值为false与true
    # iou_pred_np[best_ious < cfg.iou_thresh]过程完成后得到一个1 * N大小的数组,
    # N为best_ious中阈值小于cfg.iou_thresh的值的个数。存储的值是 0 - iou预测值   正常我的理解 应该存的是1才对,代表的是无目标。
    iou_penalty = 0 - iou_pred_np[best_ious < cfg.iou_thresh] #cfg.iou_thresh = 0.6  
    # 小于iou_thresh的iou_mask被赋值,大于此阈值的anchor框iou_mask的值为0    后面会给大于阈值的标注框存在位置的iou_mask赋值
    # 为负样本的_iou_mask赋值 大于阈值赋值为0 小于阈值赋值为iou_penalty值
    _iou_mask[best_ious <= cfg.iou_thresh] = cfg.noobject_scale * iou_penalty

    # locate the cell of each gt_boxe
    cell_w = float(inp_size[0]) / W #一个网格占的像素宽
    cell_h = float(inp_size[1]) / H #一个网格占的像素高
    cx = (gt_boxes_b[:, 0] + gt_boxes_b[:, 2]) * 0.5 / cell_w #cx cy 是标注框中心在特征图网格中的浮点坐标 值介于 0~特征图宽 之间
    cy = (gt_boxes_b[:, 1] + gt_boxes_b[:, 3]) * 0.5 / cell_h
    cell_inds = np.floor(cy) * W + np.floor(cx) #标注框编号,cell_inds的维度就是标注框的个数 值是网格编号 存储标注框目标所在网格编号
    cell_inds = cell_inds.astype(np.int)
    # 标注框在某个网格的中心坐标(相对于网格左上角)以及在特征图上的长 宽
    target_boxes = np.empty(gt_boxes_b.shape, dtype=np.float)
    target_boxes[:, 0] = cx - np.floor(cx)  # cx  标注框在网格中的坐标 相对于某个网格左上角 值介于0~1之间
    target_boxes[:, 1] = cy - np.floor(cy)  # cy
    target_boxes[:, 2] = \
        (gt_boxes_b[:, 2] - gt_boxes_b[:, 0]) / inp_size[0] * out_size[0]  # tw   标注框在特征图上的宽
    target_boxes[:, 3] = \
        (gt_boxes_b[:, 3] - gt_boxes_b[:, 1]) / inp_size[1] * out_size[1]  # th   标注框在特征图上的高

    # for each gt boxes, match the best anchor   为每个标注框匹配最好的anchor预测框
    gt_boxes_resize = np.copy(gt_boxes_b)
    gt_boxes_resize[:, 0::2] *= (out_size[0] / float(inp_size[0]))#标注框在特征图上的x值和w值
    gt_boxes_resize[:, 1::2] *= (out_size[1] / float(inp_size[1]))#标注框在特征图上的y值和h值
    # 计算每个cell的anchors框与标注框的iou值  没有预测框的事---- 维度为【numanchors , 标注框个数GT】 --------->这回是特征图上计算iou
    # 
    anchor_ious = anchor_intersections(
        anchors,
        np.ascontiguousarray(gt_boxes_resize, dtype=np.float)
    )
    # 取出最大值对应的索引,尺寸是【1,标注框个数GT】 每个标注框对应一个anchors的下标 ,下标范围0~4 
    # 例如 10个标注框,anchor_inds的值为【4 3 1 0 2 2 4 3 2 1】
    anchor_inds = np.argmax(anchor_ious, axis=0)

    ious_reshaped = np.reshape(ious, [hw, num_anchors, len(cell_inds)])
    # cell_inds存的是标注值在特征图上的网格编号,根据它能找到标注值在特征图上的坐标(网格坐标),i是标注框的
    # 编号,例如一共有3个标注框,cell_inds值可能为 4 7 25  i的值为0 1 2
    for i, cell_ind in enumerate(cell_inds):
        if cell_ind >= hw or cell_ind < 0:#最大目标检测个数
            print('cell inds size {}'.format(len(cell_inds)))
            print('cell over {} hw {}'.format(cell_ind, hw))
            continue
        a = anchor_inds[i]#标注框对应的那个最佳anchor编号
        
        # 0 ~ 1, should be close to 1         iou_pred_np的size【w * h  num_anchors 1】
        iou_pred_cell_anchor = iou_pred_np[cell_ind, a, :] #取出标注框对应的最佳anchor对应的iou预测值
        # 目标存在的网格对应的anchor的_iou_mask赋值 赋值为 1 - iou预测值
        _iou_mask[cell_ind, a, :] = cfg.object_scale * (1 - iou_pred_cell_anchor)
        # _ious[cell_ind, a, :] = anchor_ious[a, i]
        # 某网格预测框与最佳anchor标注框的iou值 例如网格编号为10   此网格存在标注值,此值对应于第3个anchor框。
        # 那么_ious存储的就是第cell_ind = 10  a = 3 值为ious中相应位置的值
        _ious[cell_ind, a, :] = ious_reshaped[cell_ind, a, i] #最佳anchor对应的标注框和对应位置的预测框二者的IOU值

        _box_mask[cell_ind, a, :] = cfg.coord_scale #目标存在的网格对应的最佳anchor的box框的缩放比例值设置为1
        target_boxes[i, 2:4] /= anchors[a]
        # _boxes的0 1 位置存的是相对这个网格左上角的坐标,值介于0~1之间
        # _boxes的0的2 3 位置存的是【标注框在特征图上的宽/最佳anchors宽,标注框在特征图上的高/最佳anchors高】
        _boxes[cell_ind, a, :] = target_boxes[i]#是loss函数中的标注框部分

        _class_mask[cell_ind, a, :] = cfg.class_scale #设置为1
        _classes[cell_ind, a, gt_classes[i]] = 1.#目标存在的网格对应的最佳anchor的类别置信度 直接设置为1  是loss函数中的标注框部分
    return _boxes, _ious, _classes, _box_mask, _iou_mask, _class_mask

 The next step is to use the true value and predicted value obtained above to make the maximum square error. The combination of this error is the loss function.

# 计算坐标损失 bbox_pred预测值   _boxes 标注值
self.bbox_loss = nn.MSELoss(size_average=False)(bbox_pred * box_mask, _boxes * box_mask) / num_boxes  # noqa
# 计算置信度损失  如果存在真值的网格   (iou_pred预测值*(1 - iou_pred预测值) - ((iou_pred预测框与标注框的IOU值) * (1 - iou_pred预测值))
#                 如果不存在真值的网格(iou_pred预测值*(0 - iou_pred预测值) - 0 * (0 - iou_pred预测值))
# 不理解的地方    计算均方误差时乘iou_mask与不乘完全不影响结果啊。完全可以把它设置为1啊....还是我理解有误区呢
self.iou_loss = nn.MSELoss(size_average=False)(iou_pred * iou_mask, _ious * iou_mask) / num_boxes  # noqa
class_mask = class_mask.expand_as(prob_pred)
#计算类别损失  prob_pred预测值  _classes 标注值
self.cls_loss = nn.MSELoss(size_average=False)(prob_pred * class_mask, _classes * class_mask) / num_boxes  # noqa

       For example, the understanding of IOU loss consists of two parts: there is no target loss value as the square of the predicted value of iou_pred , and obviously this value is as small as possible. There is a target loss value of ( the square of the predicted value of iou_pred-the square  of the IOU value of the predicted frame of iou_pred and the label frame ) and the square of (1-predicted value of iou_pred) . Obviously this value is also as small as possible. This value should be smaller. The predicted value of iou_pred should be closer to 1, ( the square of the predicted value of iou_pred-the square  of the IOU value of the predicted frame of iou_pred and the labeled frame ) should be closer to 0. The other losses will not be explained. One of the core of yolov2 is the construction of the loss function. The loss function continuously back-propagates to optimize the network parameters, and the network parameters come back to reduce the loss function value, and finally get an optimal model.     

Guess you like

Origin blog.csdn.net/gbz3300255/article/details/109224190