Using PyTorch to implement the YOLO-V3 target detection algorithm from scratch (4)
Click to view the original blog post
This is part 4 of the tutorial implementing the YOLO v3 detector from scratch, in the previous part we implemented the forward pass of the network. In this part, we plan to use non-maximum suppression for confidence thresholding.
Our goal is to design the forward propagation of the network
The code used in this tutorial needs to run on Python 3.5 and PyTorch 0.4. It can be found in this Github repository .
This tutorial is divided into 5 parts:
- Part 1: Understanding How YOLO Works
- Part 2: Creating the Network Structure
- Part 3: Implementing forward propagation of the network
- Part 4: Object Confidence Thresholding and Non-Maximum Suppression
- Part 5: Designing Input and Output Pipelines
Previous preparations
1. The first 3 parts of the tutorial
2. Basic knowledge about PyTorch, including creating custom network structures using classes such as nn.Module, nn.Sequentual, torch.nn.parameter, etc.
3. Basic knowledge about Numpy
In the previous 3 parts, we have built a model that can output multiple object detection results for a given input image. Specifically, our output is a tensor of shape B x 10647 x 85; where B is the number of images in a batch, 10647 is the number of predicted bounding boxes in each image, and 85 is Refers to the number of bounding box properties.
However, as described in Part 1, we must make our output satisfy an objectness score threshold and non-maximum suppression (NMS) to get the "true" detections mentioned later. To do this, we will create a function called write_results in the util.py file.
def write_results(prediction, confidence, num_classes, nms=True, nms_conf=0.4):
The input to this function is the prediction result, confidence (objectness score threshold), num_classes (80 in our case) and nms_conf (NMS IoU threshold).
target confidence threshold
Our prediction tensor contains information about B x 10647 bounding boxes. For each bounding box with an objectness score below a threshold, we set the value of each of its attributes (representing an entire row of the bounding box) to zero.
conf_mask = (prediction[:, :, 4] > confidence).float().unsqueeze(2)
prediction = prediction * conf_mask
Perform non-maximum suppression
Note: I assume you already understand what IoU (Intersection over union) and Non-maximum suppression mean. If you still don't understand, please refer to the link provided at the end of the article.
The bounding box properties we have now are determined by the center coordinates and the height and width of the bounding box. However, it is easier to compute the IoU of both boxes using the two diagonal coordinates of each box. So, we can convert the (center x, center y, height, width) properties of our box to (upper left x, upper left y, lower right x, lower right y).
box_a = prediction.new(prediction.shape)
box_a[:, :, 0] = (prediction[:, :, 0] - prediction[:, :, 2] / 2)
box_a[:, :, 1] = (prediction[:, :, 1] - prediction[:, :, 3] / 2)
box_a[:, :, 2] = (prediction[:, :, 0] + prediction[:, :, 2] / 2)
box_a[:, :, 3] = (prediction[:, :, 1] + prediction[:, :, 3] / 2)
prediction[:, :, :4] = box_a[:, :, :4]
The number of "true" detections in each image may vary. For example, a batch of size 3 has 3 images 1, 2, and 3, each of which has 5, 2, and 4 "true" detections. Therefore, confidence thresholding and NMS can only be done for one image at a time. That is, we cannot vectorize the operations involved and must loop over the first dimension of the prediction (which contains the indices of the images in a batch).
batch_size = prediction.size(0)
write = False
for ind in range(batch_size):
# select the image from the batch
image_pred = prediction[ind]
# confidence threshholding
# NMS
As mentioned before, the write tag is there to indicate that we haven't initialized the output yet, and we'll use a tensor to collect "real" detections for the entire batch.
Once we're in the loop, let's make it more clear. Note that each bounding box row has 85 attributes, 80 of which are class scores. At this point, we only care about the class score with the largest value. So, we remove the 80 class scores for each row, and instead add the index of the class with the largest value and the class score for that class.
# Get the class having maximum score, and the index of that class
# Get rid of num_classes softmax scores
# Add the class index and the class score of class having maximum score
max_conf, max_conf_score = torch.max(image_pred[:, 5:5 + num_classes], 1)
max_conf = max_conf.float().unsqueeze(1)
max_conf_score = max_conf_score.float().unsqueeze(1)
seq = (image_pred[:, :5], max_conf, max_conf_score)
image_pred = torch.cat(seq, 1)
Remember we set the bounding box rows where the object confidence is less than the threshold to zero? let's drop them
# Get rid of the zero entries
non_zero_ind = (torch.nonzero(image_pred[:, 4]))
try:
image_pred_ = image_pred[non_zero_ind.squeeze(), :].view(-1, 7)
except:
continue
The purpose of the try-except module is to handle the case of no detection result. In this case, we use continue to skip the loop for this image.
Now, let's get the detected classes in an image.
# Get the various classes detected in the image
img_classes = unique(image_pred_[:, -1])
Because there may be multiple "true" detections for the same class, we use a function called unique to get the class present in any given image.
def unique(tensor):
tensor_np = tensor.cpu().numpy()
unique_np = np.unique(tensor_np)
unique_tensor = torch.from_numpy(unique_np)
tensor_res = tensor.new(unique_tensor.shape)
tensor_res.copy_(unique_tensor)
return tensor_res
Then, we perform NMS by category.
# WE will do NMS classwise
for cls in img_classes:
Once we enter the loop, the first thing we do is extract detections for a particular class (represented by the variable cls).
# get the detections with one particular class
cls_mask = image_pred_ * (image_pred_[:, -1] == cls).float().unsqueeze(1)
class_mask_ind = torch.nonzero(cls_mask[:, -2]).squeeze()
image_pred_class = image_pred_[class_mask_ind].view(-1, 7)
# sort the detections such that the entry with the maximum objectness
# confidence is at the top
conf_sort_index = torch.sort(image_pred_class[:, 4], descending=True)[1]
image_pred_class = image_pred_class[conf_sort_index]
idx = image_pred_class.size(0)
Now, we perform NMS.
# For each detection
for i in range(idx):
# Get the IOUs of all boxes that come after the one we are looking at
# in the loop
try:
ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i + 1:])
except ValueError:
break
except IndexError:
break
# Zero out all the detections that have IoU > treshhold
iou_mask = (ious < nms_conf).float().unsqueeze(1)
image_pred_class[i + 1:] *= iou_mask
# Remove the non-zero entries
non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)
Here, we use the function bbox_iou. The first input is the bounding box row, which is indexed by the variable i in the loop. The second input to bbox_iou is a tensor of bounding box rows. The output of the bbox_iou function is a tensor containing the bounding box represented by the first input and the IoU of each bounding box in the second input.
If we have 2 bounding boxes of the same class and their IoU is greater than a threshold, then remove the one with lower class confidence. We've sorted the bounding boxes with the ones with higher confidence on top.
In the loop part, the code below gives the IoU of the box, where all bounding boxes with an index higher than i are sorted by i index.
ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i + 1:])
At each iteration, if there is a bounding box with an index greater than i and an IoU greater than the threshold nms_thresh (with the box with index i), then remove that particular box.
# Zero out all the detections that have IoU > treshhold
iou_mask = (ious < nms_conf).float().unsqueeze(1)
image_pred_class[i + 1:] *= iou_mask
# Remove the non-zero entries
non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)
Also notice that we've put the code for computing ious in a try-catch module. This is because the loop is designed to run idx iterations (number of rows in image_pred_class). However, as we continue the loop, some bounding boxes may be removed from image_pred_class. This means that we cannot have idx iterations even if only one value is removed from image_pred_class. Therefore, we might try to index a value that is out of bounds (IndexError), and the flaky image_pred_class[i+1:] might return an empty tensor, specifying the amount to trigger a ValueError. At this point, we can determine that NMS cannot remove the bounding box any further, and break out of the loop.
Calculate IoPU
def bbox_iou(box1, box2):
"""
Returns the IoU of two bounding boxes
"""
# Get the coordinates of bounding boxes
b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]
# get the corrdinates of the intersection rectangle
inter_rect_x1 = torch.max(b1_x1, b2_x1)
inter_rect_y1 = torch.max(b1_y1, b2_y1)
inter_rect_x2 = torch.min(b1_x2, b2_x2)
inter_rect_y2 = torch.min(b1_y2, b2_y2)
# Intersection area
if torch.cuda.is_available():
inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1, torch.zeros(inter_rect_x2.shape).cuda()) * torch.max(
inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape).cuda())
else:
inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1, torch.zeros(inter_rect_x2.shape)) * torch.max(
inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape))
# Union Area
b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)
iou = inter_area / (b1_area + b2_area - inter_area)
return iou
write predictions
The write_results function outputs a tensor of shape Dx8; where D are the "true" detection results in all images, each represented by a row. Each detection result has 8 attributes, namely: the index of the image in the batch to which the detection result belongs, the coordinates of the 4 corners, the objectness score, the score of the category with the highest confidence, and the index of the category.
As before, we don't initialize our output tensor unless we have detections to assign to it. Once it is initialized, we connect subsequent detections with it. We use the write tag to indicate whether the tensor is initialized. At the end of the loop iterating over the categories, we add the resulting detections to the tensor output.
batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)
seq = batch_ind, image_pred_class
if not write:
output = torch.cat(seq, 1)
write = True
else:
out = torch.cat(seq, 1)
output = torch.cat((output, out))
At the end of the function, we check to see if the output has been initialized. If not, it means that there is no single detection in any image in the batch. In this case we return 0.
try:
return output
except:
return 0
That's it for this part. At the end of this section, we finally have the predictions in the form of a tensor with each prediction listed as a row. Now all that's left: create an input pipeline that reads images from disk, computes predictions, draws bounding boxes on the images, and displays/writes those images. That's what we'll cover in the next section.