structure
1. I don't think lt rb is very suitable. To be correct, it is lb rt because the comparisons are all lower left and upper right coordinates
For example, the first two are both from max. Select the largest of the two boxes in the lower left coordinates, and the last two are the smallest in the upper right coordinates, which also forms the intersection area.
But the code is still lt rb, so I just said it directly
After calculating lt and rb, calculating their difference can calculate the height and width. As long as there is no intersection w or h must be negative, you can draw a picture to verify it
Behind is the ordinary iou algorithm
def compute_iou(self, bbox1, bbox2):
""" Compute the IoU (Intersection over Union) of two set of bboxes, each bbox format: [x1, y1, x2, y2].
Args:
bbox1: (Tensor) bounding bboxes, sized [N, 4].
bbox2: (Tensor) bounding bboxes, sized [M, 4].
Returns:
(Tensor) IoU, sized [N, M].
"""
N = bbox1.size(0)
M = bbox2.size(0)
# Compute left-top coordinate of the intersections
lt = torch.max(
bbox1[:, :2].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
bbox2[:, :2].unsqueeze(0).expand(N, M, 2) # [M, 2] -> [1, M, 2] -> [N, M, 2]
)
# Conpute right-bottom coordinate of the intersections
rb = torch.min(
bbox1[:, 2:].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
bbox2[:, 2:].unsqueeze(0).expand(N, M, 2) # [M, 2] -> [1, M, 2] -> [N, M, 2]
)
# Compute area of the intersections from the coordinates
wh = rb - lt # width and height of the intersection, [N, M, 2]
wh[wh < 0] = 0 # clip at 0
inter = wh[:, :, 0] * wh[:, :, 1] # [N, M]
# Compute area of the bboxes
area1 = (bbox1[:, 2] - bbox1[:, 0]) * (bbox1[:, 3] - bbox1[:, 1]) # [N, ]
area2 = (bbox2[:, 2] - bbox2[:, 0]) * (bbox2[:, 3] - bbox2[:, 1]) # [M, ]
area1 = area1.unsqueeze(1).expand_as(inter) # [N, ] -> [N, 1] -> [N, M]
area2 = area2.unsqueeze(0).expand_as(inter) # [M, ] -> [1, M] -> [N, M]
# Compute IoU from the areas
union = area1 + area2 - inter # [N, M, 2]
iou = inter / union # [N, M, 2]
return iou
2. The more difficult part is the highlight
coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
#没有目标的张量[n_batch, S, S]
noobj_mask = target_tensor[..., 4] == 0
#扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)
There is a real confidence in the target, if it is 1, if it is not, it is 0 instead of the train value during training. The train value is fuzzy from 0-1 and the network output value
And the confidence assignment here is assigned from the encode method of voc
Here, the part where the object exists and the object does not exist is selected as coord_mask and noobj_mask respectively, and the size is also (batch_size, S, S) respectively. It means whether there is an object in this pixel in this batch batch, and the value is True False
Expand the dimension to correspond to (batch_size,S,S,30) 30 dimensions
If you are responsible, all 30 dimensions are true or if you are not responsible, all are false
next part
#预测值里含有目标的张量取出来,[n_coord, N] view类似于reshape 这里可以当作reshape看 就是变形
coord_pred = pred_tensor[coord_mask].view(-1, N)
#提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5) #防止内存不连续报错
# 预测值的分类信息[n_coord, C]
class_pred = coord_pred[:, 5*B:]
#含有目标的标签张量,[n_coord, N]
coord_target = target_tensor[coord_mask].view(-1, N)
#提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5)
#标签的分类信息
class_target = coord_target[:, 5*B:]
From the prediction tensor output from the network, here it is called the prediction value (but this is only the prediction value output by the training network, not the prediction value of detect), we take all the pixels containing objects in the target from the prediction , called coord_pred
coord_pred is the tensor corresponding to the real pixels, and it is divided into bbox and class below.
The target also divides bbox and class out for later comparison and cuts into 10 and 20 lengths
ps: coord_pred is to flatten the first three dimensions after taking out the result, leaving only the last dimension N which is 30,
On the whole, the shape of pred_tensor.view(-1, N) is (batch_size*S*S, N) coord_pred is the tensor that is analogous to this shape after being taken out (all the corresponding pictures in all batches that contain objects Pixel cell, N), in human terms, it is the sum of the number of real frames in all batches
#没有目标的处理
#找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
noobj_pred = pred_tensor[noobj_mask].view(-1, N)
#标签的没有目标的网格张量 [n_noobj, N]
noobj_target = target_tensor[noobj_mask].view(-1, N)
noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
for b in range(B):
noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
noobj_pred_conf = noobj_pred[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])] 这里目标是
noobj_target_conf = noobj_target[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])]
#计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
#loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
Here is the part that takes out all the predicted tensors that do not contain objects. The for loop is to assign the position to 1 in advance and mark it out to extract and find this part.
After finding out the corresponding two partial confidences, do mse, and the square difference loss gets the loss of the pixel confidence of the unresponsible object. In fact, it is (0-predicted confidence)^2. There is also a weight in the paper, because There are too many irresponsible pixels. For fairness, based on relatively low weights
coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0) # [n_coord x B, 5]
coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
bbox_target_iou = torch.zeros(bbox_target.size()).cuda() # [n_coord x B, 5], only the last 1=(conf,) is used
Initialize the variables needed for the following loop
for i in range(0, bbox_target.size(0), B):
pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]
target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
target_xyxy[:, :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]
iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
max_iou, max_index = iou.max(0)
max_index = max_index.data.cuda()
coord_response_mask[i+max_index] = 1
coord_not_response_mask[i+max_index] = 0
# "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
# from the original paper of YOLO.
bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
Training is to compare every two cycles in the prediction box,
The two boxes of the predicted value are independently different, and in the real value target, we give the same value to the two boxes in voc.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float( S) + 0.5 * pred[:, 2:4] But dividing by S here, I didn’t understand it, so I jumped first
Here the preds and target numbers are in one-to-one correspondence
The coordinates in pred and target are the center x and y and w and h respectively. These four targets are normalized relative to the picture and can be known from voc analysis, while pred is a random value predicted by the network.
The parameters that iou needs to pass in are the upper left and lower right coordinates that need to be calculated here
So in a cycle, the maximum iou is obtained, and the index of that iou, that is, which box is the largest iou, is taken out in coord_response_mask and coord_notresponse_mask. The shape of these two variables is exactly the same as that of bbox
bbox_target_iou saves the result in the gpu, which is used to calculate the loss below, and the loop ends
The following is the content outside the loop
bbox_target_iou = Variable(bbox_target_iou).cuda()
# BBox location/size and objectness loss for the response bboxes.
bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5) # [n_response, 5]
bbox_target_response = bbox_target[coord_response_mask].view(-1, 5) # [n_response, 5], only the first 4=(x, y, w, h) are used
target_iou = bbox_target_iou[coord_response_mask].view(-1, 5) # [n_response, 5], only the last 1=(conf,) is used
loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')
################################################################################
# Class probability loss for the cells which contain objects.
loss_class = F.mse_loss(class_pred, class_target, reduction='sum')
# Total loss
loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
loss = loss / float(batch_size)
responsible for the loss of part of the object
The xy coordinates are used to find the loss, because there are few pixels containing objects, the paper gives a weight of 5 to increase the loss tendency,
wh to find the loss, because the difference between the large frame and the small frame will be very large, which is easy to cause losses, so the root sign is added in the paper so that the difference will not be too large, (the value of the difference between the large frame and the small frame is several Times or even dozens of times this is intolerable), here also because of the small number of pixels, give a weight of 5
Confidence loss. The predicted value output by the network here is random. The idea of the paper is to approach the value where she should be, that is, the real confidence. Wherever his position is, he should be given a degree of confidence. This is reflected in the IOU intersection and comparison with the real frame. However, in yolov3, the label value that is compared with the iou of the real frame is replaced with 1. If you don’t understand it, don’t worry about it.
Here is a simple calculation of the loss. The calculation method maintains the requirements mentioned in the paper. If mse_loss is also the default parameter, it will automatically calculate the average, and add a parameter sum to simply calculate the sum of the squared differences.
The returned loss is finally divided by an average of batch_size
def forward(self, pred_tensor, target_tensor):#target_tensor[2,0,0,:]
""" Compute loss for YOLO training. #
Args:
pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
Returns:
(Tensor): loss, sized [1, ].
"""
# TODO: Romove redundant dimensions for some Tensors.
#获取网格参数S=7,每个网格预测的边框数目B=2,和分类数C=20
S, B, C = self.S, self.B, self.C
N = 5 * B + C # 5=len([x, y, w, h, conf],N=30
#批的大小
batch_size = pred_tensor.size(0)
#有目标的张量[n_batch, S, S]
coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
#没有目标的张量[n_batch, S, S]
noobj_mask = target_tensor[..., 4] == 0
#扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)
#int8-->bool
noobj_mask = noobj_mask.bool() #不是已经bool了?
coord_mask = coord_mask.bool()
##################################################
#预测值里含有目标的张量取出来,[n_coord, N]
coord_pred = pred_tensor[coord_mask].view(-1, N)
#提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5) #防止内存不连续报错
# 预测值的分类信息[n_coord, C]
class_pred = coord_pred[:, 5*B:]
#含有目标的标签张量,[n_coord, N]
coord_target = target_tensor[coord_mask].view(-1, N)
#提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5)
#标签的分类信息
class_target = coord_target[:, 5*B:]
######################################################
# ##################################################
#没有目标的处理
#找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
noobj_pred = pred_tensor[noobj_mask].view(-1, N)
#标签的没有目标的网格张量 [n_noobj, N]
noobj_target = target_tensor[noobj_mask].view(-1, N)
noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
for b in range(B):
noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
noobj_pred_conf = noobj_pred[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])] 这里目标是
noobj_target_conf = noobj_target[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])]
#计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
#loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
#################################################################################
#################################################################################
# Compute loss for the cells with objects.
coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0) # [n_coord x B, 5]
coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
bbox_target_iou = torch.zeros(bbox_target.size()).cuda() # [n_coord x B, 5], only the last 1=(conf,) is used
# Choose the predicted bbox having the highest IoU for each target bbox.
for i in range(0, bbox_target.size(0), B):
pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]
target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
target_xyxy[:, :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]
iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
max_iou, max_index = iou.max(0)
max_index = max_index.data.cuda()
coord_response_mask[i+max_index] = 1
coord_not_response_mask[i+max_index] = 0
# "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
# from the original paper of YOLO.
bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
bbox_target_iou = Variable(bbox_target_iou).cuda()
# BBox location/size and objectness loss for the response bboxes.
bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5) # [n_response, 5]
bbox_target_response = bbox_target[coord_response_mask].view(-1, 5) # [n_response, 5], only the first 4=(x, y, w, h) are used
target_iou = bbox_target_iou[coord_response_mask].view(-1, 5) # [n_response, 5], only the last 1=(conf,) is used
loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')
################################################################################
# Class probability loss for the cells which contain objects.
loss_class = F.mse_loss(class_pred, class_target, reduction='sum')
# Total loss
loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
loss = loss / float(batch_size)
return loss
Total
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class Loss(nn.Module):
def __init__(self, feature_size=7, num_bboxes=2, num_classes=20, lambda_coord=5.0, lambda_noobj=0.5):
""" Constructor.
Args:
feature_size: (int) size of input feature map.
num_bboxes: (int) number of bboxes per each cell.
num_classes: (int) number of the object classes.
lambda_coord: (float) weight for bbox location/size losses.
lambda_noobj: (float) weight for no-objectness loss.
"""
super(Loss, self).__init__()
self.S = feature_size
self.B = num_bboxes
self.C = num_classes
self.lambda_coord = lambda_coord
self.lambda_noobj = lambda_noobj
def compute_iou(self, bbox1, bbox2):
""" Compute the IoU (Intersection over Union) of two set of bboxes, each bbox format: [x1, y1, x2, y2].
Args:
bbox1: (Tensor) bounding bboxes, sized [N, 4].
bbox2: (Tensor) bounding bboxes, sized [M, 4].
Returns:
(Tensor) IoU, sized [N, M].
"""
N = bbox1.size(0)
M = bbox2.size(0)
# Compute left-top coordinate of the intersections
lt = torch.max(
bbox1[:, :2].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
bbox2[:, :2].unsqueeze(0).expand(N, M, 2) # [M, 2] -> [1, M, 2] -> [N, M, 2]
)
# Conpute right-bottom coordinate of the intersections
rb = torch.min(
bbox1[:, 2:].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
bbox2[:, 2:].unsqueeze(0).expand(N, M, 2) # [M, 2] -> [1, M, 2] -> [N, M, 2]
)
# Compute area of the intersections from the coordinates
wh = rb - lt # width and height of the intersection, [N, M, 2]
wh[wh < 0] = 0 # clip at 0
inter = wh[:, :, 0] * wh[:, :, 1] # [N, M]
# Compute area of the bboxes
area1 = (bbox1[:, 2] - bbox1[:, 0]) * (bbox1[:, 3] - bbox1[:, 1]) # [N, ]
area2 = (bbox2[:, 2] - bbox2[:, 0]) * (bbox2[:, 3] - bbox2[:, 1]) # [M, ]
area1 = area1.unsqueeze(1).expand_as(inter) # [N, ] -> [N, 1] -> [N, M]
area2 = area2.unsqueeze(0).expand_as(inter) # [M, ] -> [1, M] -> [N, M]
# Compute IoU from the areas
union = area1 + area2 - inter # [N, M, 2]
iou = inter / union # [N, M, 2]
return iou
def forward(self, pred_tensor, target_tensor):#target_tensor[2,0,0,:]
""" Compute loss for YOLO training. #
Args:
pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
Returns:
(Tensor): loss, sized [1, ].
"""
# TODO: Romove redundant dimensions for some Tensors.
#获取网格参数S=7,每个网格预测的边框数目B=2,和分类数C=20
S, B, C = self.S, self.B, self.C
N = 5 * B + C # 5=len([x, y, w, h, conf],N=30
#批的大小
batch_size = pred_tensor.size(0)
#有目标的张量[n_batch, S, S]
coord_mask = target_tensor[..., 4] > 0 #三个点自动判断维度 自动找到最后一维 用4找出第五个 也就是置信度,为什么30维 第二个框是怎么样的 等下再看
#没有目标的张量[n_batch, S, S]
noobj_mask = target_tensor[..., 4] == 0
#扩展维度的布尔值相同,[n_batch, S, S] -> [n_batch, S, S, N]
coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)
#int8-->bool
noobj_mask = noobj_mask.bool() #不是已经bool了?
coord_mask = coord_mask.bool()
##################################################
#预测值里含有目标的张量取出来,[n_coord, N]
coord_pred = pred_tensor[coord_mask].view(-1, N)
#提取bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5) #防止内存不连续报错
# 预测值的分类信息[n_coord, C]
class_pred = coord_pred[:, 5*B:]
#含有目标的标签张量,[n_coord, N]
coord_target = target_tensor[coord_mask].view(-1, N)
#提取标签bbox和C,[n_coord x B, 5=len([x, y, w, h, conf])]
bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5)
#标签的分类信息
class_target = coord_target[:, 5*B:]
######################################################
# ##################################################
#没有目标的处理
#找到预测值里没有目标的网格张量[n_noobj, N],n_noobj=SxS-n_coord
noobj_pred = pred_tensor[noobj_mask].view(-1, N)
#标签的没有目标的网格张量 [n_noobj, N]
noobj_target = target_tensor[noobj_mask].view(-1, N)
noobj_conf_mask = torch.cuda.BoolTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
for b in range(B):
noobj_conf_mask[:, 4 + b*5] = 1 # 没有目标置信度置1,noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1 目标是下面把置信度拿出来再并排
noobj_pred_conf = noobj_pred[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])] 这里目标是
noobj_target_conf = noobj_target[noobj_conf_mask] # [n_noobj x 2=len([conf1, conf2])]
#计算没有目标的置信度损失 加法》? #如果 reduction 参数未指定,默认值为 'mean',表示对所有元素的误差求平均值。
#loss_noobj=F.mse_loss(noobj_pred_conf, noobj_target_conf,)*len(noobj_pred_conf)
loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')
#################################################################################
#################################################################################
# Compute loss for the cells with objects.
coord_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(0) # [n_coord x B, 5]
coord_not_response_mask = torch.cuda.BoolTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
bbox_target_iou = torch.zeros(bbox_target.size()).cuda() # [n_coord x B, 5], only the last 1=(conf,) is used
# Choose the predicted bbox having the highest IoU for each target bbox.
for i in range(0, bbox_target.size(0), B):
pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
pred_xyxy[:, :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]
target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
# Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
# rescale (center_x,center_y) for the image-size to compute IoU correctly.
target_xyxy[:, :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]
iou = self.compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
max_iou, max_index = iou.max(0)
max_index = max_index.data.cuda()
coord_response_mask[i+max_index] = 1
coord_not_response_mask[i+max_index] = 0
# "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
# from the original paper of YOLO.
bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
bbox_target_iou = Variable(bbox_target_iou).cuda()
# BBox location/size and objectness loss for the response bboxes.
bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5) # [n_response, 5]
bbox_target_response = bbox_target[coord_response_mask].view(-1, 5) # [n_response, 5], only the first 4=(x, y, w, h) are used
target_iou = bbox_target_iou[coord_response_mask].view(-1, 5) # [n_response, 5], only the last 1=(conf,) is used
loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')
################################################################################
# Class probability loss for the cells which contain objects.
loss_class = F.mse_loss(class_pred, class_target, reduction='sum')
# Total loss
loss = self.lambda_coord * (loss_xy + loss_wh) + loss_obj + self.lambda_noobj * loss_noobj + loss_class
loss = loss / float(batch_size)
return loss