Yolov1 source code explanation loss.py

structure

1. I don't think lt rb is very suitable. To be correct, it is lb rt because the comparisons are all lower left and upper right coordinates

For example, the first two are both from max. Select the largest of the two boxes in the lower left coordinates, and the last two are the smallest in the upper right coordinates, which also forms the intersection area.   

But the code is still lt rb, so I just said it directly

After calculating lt and rb, calculating their difference can calculate the height and width. As long as there is no intersection w or h must be negative, you can draw a picture to verify it

Behind is the ordinary iou algorithm

    def compute_iou(self, bbox1, bbox2):
        """ Compute the IoU (Intersection over Union) of two set of bboxes, each bbox format: [x1, y1, x2, y2].
        Args:
            bbox1: (Tensor) bounding bboxes, sized [N, 4].
            bbox2: (Tensor) bounding bboxes, sized [M, 4].
        Returns:
            (Tensor) IoU, sized [N, M].
        """
        N = bbox1.size(0)
        M = bbox2.size(0)

        # Compute left-top coordinate of the intersections
        lt = torch.max(
            bbox1[:, :2].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
            bbox2[:, :2].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
        )
        # Conpute right-bottom coordinate of the intersections
        rb = torch.min(
            bbox1[:, 2:].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
            bbox2[:, 2:].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
        )
        # Compute area of the intersections from the coordinates
        wh = rb - lt   # width and height of the intersection, [N, M, 2]
        wh[wh < 0] = 0 # clip at 0
        inter = wh[:, :, 0] * wh[:, :, 1] # [N, M]

        # Compute area of the bboxes
        area1 = (bbox1[:, 2] - bbox1[:, 0]) * (bbox1[:, 3] - bbox1[:, 1]) # [N, ]
        area2 = (bbox2[:, 2] - bbox2[:, 0]) * (bbox2[:, 3] - bbox2[:, 1]) # [M, ]
        area1 = area1.unsqueeze(1).expand_as(inter) # [N, ] -> [N, 1] -> [N, M]
        area2 = area2.unsqueeze(0).expand_as(inter) # [M, ] -> [1, M] -> [N, M]

        # Compute IoU from the areas
        union = area1 + area2 - inter # [N, M, 2]
        iou = inter / union           # [N, M, 2]

        return iou

2. The more difficult part is the highlight

        coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
        #没有目标的张量[n_batch, S, S]
        noobj_mask = target_tensor[..., 4] == 0 
        #扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)  
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)  

There is a real confidence in the target, if it is 1, if it is not, it is 0 instead of the train value during training. The train value is fuzzy from 0-1 and the network output value     

And the confidence assignment here is assigned from the encode method of voc

Here, the part where the object exists and the object does not exist is selected as coord_mask and noobj_mask respectively, and the size is also (batch_size, S, S) respectively. It means whether there is an object in this pixel in this batch batch, and the value is True False

Expand the dimension to correspond to (batch_size,S,S,30) 30 dimensions

If you are responsible, all 30 dimensions are true or if you are not responsible, all are false

next part

        #预测值里含有目标的张量取出来,[n_coord, N] view类似于reshape 这里可以当作reshape看 就是变形
        coord_pred = pred_tensor[coord_mask].view(-1, N)        
        
        #提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5)   #防止内存不连续报错
        # 预测值的分类信息[n_coord, C]
        class_pred = coord_pred[:, 5*B:]                            

        #含有目标的标签张量,[n_coord, N]
        coord_target = target_tensor[coord_mask].view(-1, N)        
        
        #提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5) 
        #标签的分类信息
        class_target = coord_target[:, 5*B:]                        

 From the prediction tensor output from the network, here it is called the prediction value (but this is only the prediction value output by the training network, not the prediction value of detect), we take all the pixels containing objects in the target from the prediction , called coord_pred

coord_pred is the tensor corresponding to the real pixels, and it is divided into bbox and class below.

The target also divides bbox and class out for later comparison and cuts into 10 and 20 lengths

ps: coord_pred is to flatten the first three dimensions after taking out the result, leaving only the last dimension N which is 30,

On the whole, the shape of pred_tensor.view(-1, N) is (batch_size*S*S, N) coord_pred is the tensor that is analogous to this shape after being taken out (all the corresponding pictures in all batches that contain objects Pixel cell, N), in human terms, it is the sum of the number of real frames in all batches

#没有目标的处理
        #找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
        noobj_pred = pred_tensor[noobj_mask].view(-1, N)         
        #标签的没有目标的网格张量 [n_noobj, N]                                                     
        noobj_target = target_tensor[noobj_mask].view(-1, N)            
        
        noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
        for b in range(B):
            noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
        

        noobj_pred_conf = noobj_pred[noobj_conf_mask]       # [n_noobj x 2=len([conf1, conf2])] 这里目标是
        noobj_target_conf = noobj_target[noobj_conf_mask]   # [n_noobj x 2=len([conf1, conf2])]
        #计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
        #loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
        loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')

Here is the part that takes out all the predicted tensors that do not contain objects. The for loop is to assign the position to 1 in advance and mark it out to extract and find this part.

After finding out the corresponding two partial confidences, do mse, and the square difference loss gets the loss of the pixel confidence of the unresponsible object. In fact, it is (0-predicted confidence)^2. There is also a weight in the paper, because There are too many irresponsible pixels. For fairness, based on relatively low weights

        coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0)    # [n_coord x B, 5]
        coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
        bbox_target_iou = torch.zeros(bbox_target.size()).cuda()                    # [n_coord x B, 5], only the last 1=(conf,) is used

Initialize the variables needed for the following loop

        for i in range(0, bbox_target.size(0), B):
            pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
            pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            pred_xyxy[:,  :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
            pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]

            target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
            target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
            target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            target_xyxy[:,  :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
            target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]

            iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
            max_iou, max_index = iou.max(0)
            max_index = max_index.data.cuda()

            coord_response_mask[i+max_index] = 1
            coord_not_response_mask[i+max_index] = 0

            # "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
            # from the original paper of YOLO.
            bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()

Training is to compare every two cycles in the prediction box,

The two boxes of the predicted value are independently different, and in the real value target, we give the same value to the two boxes in voc.

pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float( S) + 0.5 * pred[:, 2:4] But dividing by S here, I didn’t understand it, so I jumped first

Here the preds and target numbers are in one-to-one correspondence

The coordinates in pred and target are the center x and y and w and h respectively. These four targets are normalized relative to the picture and can be known from voc analysis, while pred is a random value predicted by the network.

The parameters that iou needs to pass in are the upper left and lower right coordinates that need to be calculated here 

So in a cycle, the maximum iou is obtained, and the index of that iou, that is, which box is the largest iou, is taken out in coord_response_mask and coord_notresponse_mask. The shape of these two variables is exactly the same as that of bbox

bbox_target_iou saves the result in the gpu, which is used to calculate the loss below, and the loop ends

The following is the content outside the loop

        bbox_target_iou = Variable(bbox_target_iou).cuda()

        # BBox location/size and objectness loss for the response bboxes.
        bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5)      # [n_response, 5]
        bbox_target_response = bbox_target[coord_response_mask].view(-1, 5)  # [n_response, 5], only the first 4=(x, y, w, h) are used
        target_iou = bbox_target_iou[coord_response_mask].view(-1, 5)        # [n_response, 5], only the last 1=(conf,) is used
        loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
        loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
        loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')

        ################################################################################


        # Class probability loss for the cells which contain objects.
        loss_class = F.mse_loss(class_pred, class_target, reduction='sum')

        # Total loss
        loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
        loss = loss / float(batch_size)

responsible for the loss of part of the object 

The xy coordinates are used to find the loss, because there are few pixels containing objects, the paper gives a weight of 5 to increase the loss tendency,

wh to find the loss, because the difference between the large frame and the small frame will be very large, which is easy to cause losses, so the root sign is added in the paper so that the difference will not be too large, (the value of the difference between the large frame and the small frame is several Times or even dozens of times this is intolerable), here also because of the small number of pixels, give a weight of 5

Confidence loss. The predicted value output by the network here is random. The idea of ​​the paper is to approach the value where she should be, that is, the real confidence. Wherever his position is, he should be given a degree of confidence. This is reflected in the IOU intersection and comparison with the real frame. However, in yolov3, the label value that is compared with the iou of the real frame is replaced with 1. If you don’t understand it, don’t worry about it.

Here is a simple calculation of the loss. The calculation method maintains the requirements mentioned in the paper. If mse_loss is also the default parameter, it will automatically calculate the average, and add a parameter sum to simply calculate the sum of the squared differences.

The returned loss is finally divided by an average of batch_size

    def forward(self, pred_tensor, target_tensor):#target_tensor[2,0,0,:]
        """ Compute loss for YOLO training. #
        Args:
            pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
            target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
        Returns:
            (Tensor): loss, sized [1, ].
        """
        # TODO: Romove redundant dimensions for some Tensors.
        #获取网格参数S=7,每个网格预测的边框数目B=2,和分类数C=20
        S, B, C = self.S, self.B, self.C
        N = 5 * B + C    # 5=len([x, y, w, h, conf],N=30

        #批的大小
        batch_size = pred_tensor.size(0)
        #有目标的张量[n_batch, S, S]
        coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
        #没有目标的张量[n_batch, S, S]
        noobj_mask = target_tensor[..., 4] == 0 
        #扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)  
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)  

        #int8-->bool
        noobj_mask = noobj_mask.bool()  #不是已经bool了?
        coord_mask = coord_mask.bool()  

        ##################################################
        #预测值里含有目标的张量取出来,[n_coord, N]
        coord_pred = pred_tensor[coord_mask].view(-1, N)        
        
        #提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5)   #防止内存不连续报错
        # 预测值的分类信息[n_coord, C]
        class_pred = coord_pred[:, 5*B:]                            

        #含有目标的标签张量,[n_coord, N]
        coord_target = target_tensor[coord_mask].view(-1, N)        
        
        #提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5) 
        #标签的分类信息
        class_target = coord_target[:, 5*B:]                         
        ######################################################

        # ##################################################
        #没有目标的处理
        #找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
        noobj_pred = pred_tensor[noobj_mask].view(-1, N)         
        #标签的没有目标的网格张量 [n_noobj, N]                                                     
        noobj_target = target_tensor[noobj_mask].view(-1, N)            
        
        noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
        for b in range(B):
            noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
        

        noobj_pred_conf = noobj_pred[noobj_conf_mask]       # [n_noobj x 2=len([conf1, conf2])] 这里目标是
        noobj_target_conf = noobj_target[noobj_conf_mask]   # [n_noobj x 2=len([conf1, conf2])]
        #计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
        #loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
        loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
        #################################################################################

        #################################################################################
        # Compute loss for the cells with objects.
        coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0)    # [n_coord x B, 5]
        coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
        bbox_target_iou = torch.zeros(bbox_target.size()).cuda()                    # [n_coord x B, 5], only the last 1=(conf,) is used

        # Choose the predicted bbox having the highest IoU for each target bbox.
        for i in range(0, bbox_target.size(0), B):
            pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
            pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            pred_xyxy[:,  :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
            pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]

            target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
            target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
            target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            target_xyxy[:,  :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
            target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]

            iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
            max_iou, max_index = iou.max(0)
            max_index = max_index.data.cuda()

            coord_response_mask[i+max_index] = 1
            coord_not_response_mask[i+max_index] = 0

            # "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
            # from the original paper of YOLO.
            bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
        bbox_target_iou = Variable(bbox_target_iou).cuda()

        # BBox location/size and objectness loss for the response bboxes.
        bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5)      # [n_response, 5]
        bbox_target_response = bbox_target[coord_response_mask].view(-1, 5)  # [n_response, 5], only the first 4=(x, y, w, h) are used
        target_iou = bbox_target_iou[coord_response_mask].view(-1, 5)        # [n_response, 5], only the last 1=(conf,) is used
        loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
        loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
        loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')

        ################################################################################


        # Class probability loss for the cells which contain objects.
        loss_class = F.mse_loss(class_pred, class_target, reduction='sum')

        # Total loss
        loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
        loss = loss / float(batch_size)

        return loss

 

Total

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


class Loss(nn.Module):

    def __init__(self, feature_size=7, num_bboxes=2, num_classes=20, lambda_coord=5.0, lambda_noobj=0.5):
        """ Constructor.
        Args:
            feature_size: (int) size of input feature map.
            num_bboxes: (int) number of bboxes per each cell.
            num_classes: (int) number of the object classes.
            lambda_coord: (float) weight for bbox location/size losses.
            lambda_noobj: (float) weight for no-objectness loss.
        """
        super(Loss, self).__init__()

        self.S = feature_size
        self.B = num_bboxes
        self.C = num_classes
        self.lambda_coord = lambda_coord
        self.lambda_noobj = lambda_noobj


    def compute_iou(self, bbox1, bbox2):
        """ Compute the IoU (Intersection over Union) of two set of bboxes, each bbox format: [x1, y1, x2, y2].
        Args:
            bbox1: (Tensor) bounding bboxes, sized [N, 4].
            bbox2: (Tensor) bounding bboxes, sized [M, 4].
        Returns:
            (Tensor) IoU, sized [N, M].
        """
        N = bbox1.size(0)
        M = bbox2.size(0)

        # Compute left-top coordinate of the intersections
        lt = torch.max(
            bbox1[:, :2].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
            bbox2[:, :2].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
        )
        # Conpute right-bottom coordinate of the intersections
        rb = torch.min(
            bbox1[:, 2:].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
            bbox2[:, 2:].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
        )
        # Compute area of the intersections from the coordinates
        wh = rb - lt   # width and height of the intersection, [N, M, 2]
        wh[wh < 0] = 0 # clip at 0
        inter = wh[:, :, 0] * wh[:, :, 1] # [N, M]

        # Compute area of the bboxes
        area1 = (bbox1[:, 2] - bbox1[:, 0]) * (bbox1[:, 3] - bbox1[:, 1]) # [N, ]
        area2 = (bbox2[:, 2] - bbox2[:, 0]) * (bbox2[:, 3] - bbox2[:, 1]) # [M, ]
        area1 = area1.unsqueeze(1).expand_as(inter) # [N, ] -> [N, 1] -> [N, M]
        area2 = area2.unsqueeze(0).expand_as(inter) # [M, ] -> [1, M] -> [N, M]

        # Compute IoU from the areas
        union = area1 + area2 - inter # [N, M, 2]
        iou = inter / union           # [N, M, 2]

        return iou

    def forward(self, pred_tensor, target_tensor):#target_tensor[2,0,0,:]
        """ Compute loss for YOLO training. #
        Args:
            pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
            target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
        Returns:
            (Tensor): loss, sized [1, ].
        """
        # TODO: Romove redundant dimensions for some Tensors.
        #获取网格参数S=7,每个网格预测的边框数目B=2,和分类数C=20
        S, B, C = self.S, self.B, self.C
        N = 5 * B + C    # 5=len([x, y, w, h, conf],N=30

        #批的大小
        batch_size = pred_tensor.size(0)
        #有目标的张量[n_batch, S, S]
        coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
        #没有目标的张量[n_batch, S, S]
        noobj_mask = target_tensor[..., 4] == 0 
        #扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)  
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)  

        #int8-->bool
        noobj_mask = noobj_mask.bool()  #不是已经bool了?
        coord_mask = coord_mask.bool()  

        ##################################################
        #预测值里含有目标的张量取出来,[n_coord, N]
        coord_pred = pred_tensor[coord_mask].view(-1, N)        
        
        #提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5)   #防止内存不连续报错
        # 预测值的分类信息[n_coord, C]
        class_pred = coord_pred[:, 5*B:]                            

        #含有目标的标签张量,[n_coord, N]
        coord_target = target_tensor[coord_mask].view(-1, N)        
        
        #提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
        bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5) 
        #标签的分类信息
        class_target = coord_target[:, 5*B:]                         
        ######################################################

        # ##################################################
        #没有目标的处理
        #找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
        noobj_pred = pred_tensor[noobj_mask].view(-1, N)         
        #标签的没有目标的网格张量 [n_noobj, N]                                                     
        noobj_target = target_tensor[noobj_mask].view(-1, N)            
        
        noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
        for b in range(B):
            noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
        

        noobj_pred_conf = noobj_pred[noobj_conf_mask]       # [n_noobj x 2=len([conf1, conf2])] 这里目标是
        noobj_target_conf = noobj_target[noobj_conf_mask]   # [n_noobj x 2=len([conf1, conf2])]
        #计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
        #loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
        loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
        #################################################################################

        #################################################################################
        # Compute loss for the cells with objects.
        coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0)    # [n_coord x B, 5]
        coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
        bbox_target_iou = torch.zeros(bbox_target.size()).cuda()                    # [n_coord x B, 5], only the last 1=(conf,) is used

        # Choose the predicted bbox having the highest IoU for each target bbox.
        for i in range(0, bbox_target.size(0), B):
            pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
            pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            pred_xyxy[:,  :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
            pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]

            target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
            target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
            target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
            # Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
            # rescale (center_x,center_y) for the image-size to compute IoU correctly.
            target_xyxy[:,  :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
            target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]

            iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
            max_iou, max_index = iou.max(0)
            max_index = max_index.data.cuda()

            coord_response_mask[i+max_index] = 1
            coord_not_response_mask[i+max_index] = 0

            # "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
            # from the original paper of YOLO.
            bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
        bbox_target_iou = Variable(bbox_target_iou).cuda()

        # BBox location/size and objectness loss for the response bboxes.
        bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5)      # [n_response, 5]
        bbox_target_response = bbox_target[coord_response_mask].view(-1, 5)  # [n_response, 5], only the first 4=(x, y, w, h) are used
        target_iou = bbox_target_iou[coord_response_mask].view(-1, 5)        # [n_response, 5], only the last 1=(conf,) is used
        loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
        loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
        loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')

        ################################################################################


        # Class probability loss for the cells which contain objects.
        loss_class = F.mse_loss(class_pred, class_target, reduction='sum')

        # Total loss
        loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
        loss = loss / float(batch_size)

        return loss

Guess you like

Origin blog.csdn.net/qq_36632604/article/details/130492629