Using PyTorch to implement the YOLO-V3 target detection algorithm from scratch (4)

Using PyTorch to implement the YOLO-V3 target detection algorithm from scratch (4)

Click to view the original blog post

This is part 4 of the tutorial implementing the YOLO v3 detector from scratch, in the previous part we implemented the forward pass of the network. In this part, we plan to use non-maximum suppression for confidence thresholding.
Our goal is to design the forward propagation of the network

The code used in this tutorial needs to run on Python 3.5 and PyTorch 0.4. It can be found in this Github repository .

This tutorial is divided into 5 parts:

Previous preparations
1. The first 3 parts of the tutorial
2. Basic knowledge about PyTorch, including creating custom network structures using classes such as nn.Module, nn.Sequentual, torch.nn.parameter, etc.
3. Basic knowledge about Numpy


In the previous 3 parts, we have built a model that can output multiple object detection results for a given input image. Specifically, our output is a tensor of shape B x 10647 x 85; where B is the number of images in a batch, 10647 is the number of predicted bounding boxes in each image, and 85 is Refers to the number of bounding box properties.
However, as described in Part 1, we must make our output satisfy an objectness score threshold and non-maximum suppression (NMS) to get the "true" detections mentioned later. To do this, we will create a function called write_results in the util.py file.

def write_results(prediction, confidence, num_classes, nms=True, nms_conf=0.4):

The input to this function is the prediction result, confidence (objectness score threshold), num_classes (80 in our case) and nms_conf (NMS IoU threshold).

target confidence threshold

Our prediction tensor contains information about B x 10647 bounding boxes. For each bounding box with an objectness score below a threshold, we set the value of each of its attributes (representing an entire row of the bounding box) to zero.

    conf_mask = (prediction[:, :, 4] > confidence).float().unsqueeze(2)
    prediction = prediction * conf_mask

Perform non-maximum suppression

Note: I assume you already understand what IoU (Intersection over union) and Non-maximum suppression mean. If you still don't understand, please refer to the link provided at the end of the article.
The bounding box properties we have now are determined by the center coordinates and the height and width of the bounding box. However, it is easier to compute the IoU of both boxes using the two diagonal coordinates of each box. So, we can convert the (center x, center y, height, width) properties of our box to (upper left x, upper left y, lower right x, lower right y).

    box_a = prediction.new(prediction.shape)
    box_a[:, :, 0] = (prediction[:, :, 0] - prediction[:, :, 2] / 2)
    box_a[:, :, 1] = (prediction[:, :, 1] - prediction[:, :, 3] / 2)
    box_a[:, :, 2] = (prediction[:, :, 0] + prediction[:, :, 2] / 2)
    box_a[:, :, 3] = (prediction[:, :, 1] + prediction[:, :, 3] / 2)
    prediction[:, :, :4] = box_a[:, :, :4]

The number of "true" detections in each image may vary. For example, a batch of size 3 has 3 images 1, 2, and 3, each of which has 5, 2, and 4 "true" detections. Therefore, confidence thresholding and NMS can only be done for one image at a time. That is, we cannot vectorize the operations involved and must loop over the first dimension of the prediction (which contains the indices of the images in a batch).

 batch_size = prediction.size(0)
    write = False
    for ind in range(batch_size):
        # select the image from the batch
        image_pred = prediction[ind]
        # confidence threshholding
        # NMS

As mentioned before, the write tag is there to indicate that we haven't initialized the output yet, and we'll use a tensor to collect "real" detections for the entire batch.
Once we're in the loop, let's make it more clear. Note that each bounding box row has 85 attributes, 80 of which are class scores. At this point, we only care about the class score with the largest value. So, we remove the 80 class scores for each row, and instead add the index of the class with the largest value and the class score for that class.

        # Get the class having maximum score, and the index of that class
        # Get rid of num_classes softmax scores
        # Add the class index and the class score of class having maximum score
        max_conf, max_conf_score = torch.max(image_pred[:, 5:5 + num_classes], 1)
        max_conf = max_conf.float().unsqueeze(1)
        max_conf_score = max_conf_score.float().unsqueeze(1)
        seq = (image_pred[:, :5], max_conf, max_conf_score)
        image_pred = torch.cat(seq, 1)

Remember we set the bounding box rows where the object confidence is less than the threshold to zero? let's drop them

# Get rid of the zero entries
        non_zero_ind = (torch.nonzero(image_pred[:, 4]))

        try:
            image_pred_ = image_pred[non_zero_ind.squeeze(), :].view(-1, 7)
        except:
            continue

The purpose of the try-except module is to handle the case of no detection result. In this case, we use continue to skip the loop for this image.

Now, let's get the detected classes in an image.

        # Get the various classes detected in the image
        img_classes = unique(image_pred_[:, -1])

Because there may be multiple "true" detections for the same class, we use a function called unique to get the class present in any given image.

def unique(tensor):
    tensor_np = tensor.cpu().numpy()
    unique_np = np.unique(tensor_np)
    unique_tensor = torch.from_numpy(unique_np)

    tensor_res = tensor.new(unique_tensor.shape)
    tensor_res.copy_(unique_tensor)
    return tensor_res

Then, we perform NMS by category.

        # WE will do NMS classwise
        for cls in img_classes:

Once we enter the loop, the first thing we do is extract detections for a particular class (represented by the variable cls).

            # get the detections with one particular class
            cls_mask = image_pred_ * (image_pred_[:, -1] == cls).float().unsqueeze(1)
            class_mask_ind = torch.nonzero(cls_mask[:, -2]).squeeze()

            image_pred_class = image_pred_[class_mask_ind].view(-1, 7)

            # sort the detections such that the entry with the maximum objectness
            # confidence is at the top
            conf_sort_index = torch.sort(image_pred_class[:, 4], descending=True)[1]
            image_pred_class = image_pred_class[conf_sort_index]
            idx = image_pred_class.size(0)

Now, we perform NMS.

                # For each detection
                for i in range(idx):
                    # Get the IOUs of all boxes that come after the one we are looking at
                    # in the loop
                    try:
                        ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i + 1:])
                    except ValueError:
                        break

                    except IndexError:
                        break

                    # Zero out all the detections that have IoU > treshhold
                    iou_mask = (ious < nms_conf).float().unsqueeze(1)
                    image_pred_class[i + 1:] *= iou_mask

                    # Remove the non-zero entries
                    non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
                    image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)

Here, we use the function bbox_iou. The first input is the bounding box row, which is indexed by the variable i in the loop. The second input to bbox_iou is a tensor of bounding box rows. The output of the bbox_iou function is a tensor containing the bounding box represented by the first input and the IoU of each bounding box in the second input.
write picture description here

If we have 2 bounding boxes of the same class and their IoU is greater than a threshold, then remove the one with lower class confidence. We've sorted the bounding boxes with the ones with higher confidence on top.

In the loop part, the code below gives the IoU of the box, where all bounding boxes with an index higher than i are sorted by i index.

ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i + 1:])

At each iteration, if there is a bounding box with an index greater than i and an IoU greater than the threshold nms_thresh (with the box with index i), then remove that particular box.

                    # Zero out all the detections that have IoU > treshhold
                    iou_mask = (ious < nms_conf).float().unsqueeze(1)
                    image_pred_class[i + 1:] *= iou_mask

                    # Remove the non-zero entries
                    non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
                    image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)

Also notice that we've put the code for computing ious in a try-catch module. This is because the loop is designed to run idx iterations (number of rows in image_pred_class). However, as we continue the loop, some bounding boxes may be removed from image_pred_class. This means that we cannot have idx iterations even if only one value is removed from image_pred_class. Therefore, we might try to index a value that is out of bounds (IndexError), and the flaky image_pred_class[i+1:] might return an empty tensor, specifying the amount to trigger a ValueError. At this point, we can determine that NMS cannot remove the bounding box any further, and break out of the loop.

Calculate IoPU

def bbox_iou(box1, box2):
    """
    Returns the IoU of two bounding boxes


    """
    # Get the coordinates of bounding boxes
    b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
    b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    # get the corrdinates of the intersection rectangle
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)

    # Intersection area
    if torch.cuda.is_available():
        inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1, torch.zeros(inter_rect_x2.shape).cuda()) * torch.max(
            inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape).cuda())
    else:
        inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1, torch.zeros(inter_rect_x2.shape)) * torch.max(
            inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape))

    # Union Area
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area)

    return iou

write predictions

The write_results function outputs a tensor of shape Dx8; where D are the "true" detection results in all images, each represented by a row. Each detection result has 8 attributes, namely: the index of the image in the batch to which the detection result belongs, the coordinates of the 4 corners, the objectness score, the score of the category with the highest confidence, and the index of the category.

As before, we don't initialize our output tensor unless we have detections to assign to it. Once it is initialized, we connect subsequent detections with it. We use the write tag to indicate whether the tensor is initialized. At the end of the loop iterating over the categories, we add the resulting detections to the tensor output.

            batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)
            seq = batch_ind, image_pred_class
            if not write:
                output = torch.cat(seq, 1)
                write = True
            else:
                out = torch.cat(seq, 1)
                output = torch.cat((output, out))

At the end of the function, we check to see if the output has been initialized. If not, it means that there is no single detection in any image in the batch. In this case we return 0.

    try:
        return output
    except:
        return 0

That's it for this part. At the end of this section, we finally have the predictions in the form of a tensor with each prediction listed as a row. Now all that's left: create an input pipeline that reads images from disk, computes predictions, draws bounding boxes on the images, and displays/writes those images. That's what we'll cover in the next section.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325856213&siteId=291194637