PyTorch Deep Learning Practice (18) - Target Detection

0. Preface

Object detection is an important task in the field of computer vision, which aims to identify specific categories of objects in images or videos and determine their locations. Unlike the image classification task, which only needs to determine which category the entire image belongs to, object detection also needs to mark the bounding box of the object in the image. For example, in an autonomous driving scenario, it is not only necessary to detect whether a road image contains vehicles, sidewalks, and pedestrians, but also to determine their locations in the image. Target detection has a wide range of applications, including intelligent surveillance, autonomous driving, face recognition, object tracking, image search, etc. In this section, the relevant basics of object detection will be introduced, using to ybatmark target object bounding boxes, using selective search to extract region proposals, and using the intersection-to-union ratio ( Intersection over Union, IoU) and average precision mean to measure the accuracy of bounding box predictions.

1. Target detection

1.1 Basic concepts

The purpose of target detection ( Object Detection) is to find all the targets (objects) of interest in the image and determine the categories and locations of these targets. It is one of the core issues in the field of computer vision. With the rise of applications such as self-driving cars, face detection, and intelligent video surveillance, people are paying more and more attention to faster and more accurate target detection systems. These systems not only need to identify and classify objects in an image, but also need to locate each object in the image by drawing an appropriate rectangular box around the target object. The output of target detection is more complex than image classification. The difference between the two can be clearly seen in the following figure:

Target Detection

As can be seen in the above figure, image classification simply describes the categories of objects present in the image, object localization draws bounding boxes to mark the objects present in the image, and object detection requires drawing bounding boxes around each object in the image and Determine the object category.

1.2 Target detection application

Object detection can draw bounding boxes around various objects present in the image. It is widely used in various fields and provides important basic technical support for various visual tasks, including:

  • Intelligent security: Target detection can be used in security monitoring systems to detect the presence of people, vehicles or other suspicious objects in real time for security alarms, behavior analysis, etc.
  • Autonomous driving: Object detection is an important part of the autonomous driving system. It is used to identify and track vehicles, pedestrians, traffic lights, etc. on the road to help the vehicle make decisions and plan driving paths.
  • Face recognition: Target detection can be used in face recognition systems to detect faces and determine their locations, and then extract features for identity verification or comparison.
  • Object tracking: Target detection can realize real-time tracking of specific objects in the video, such as moving target detection, motion trajectory analysis, etc.
  • Image search: Through target detection, objects in images can be identified and used for object recognition, similar picture recommendation and other functions in image search engines.
  • Video analysis: Object detection can be applied to video analysis tasks, such as behavior recognition, activity monitoring, traffic flow statistics, etc.
  • Medical imaging: Target detection can be used for disease diagnosis in medical imaging, such as tumor detection, organ segmentation, etc.
  • Industrial quality inspection: Target detection can be used for quality control and defect detection in industrial production, such as product surface defect detection, product counting, etc.

1.3 Model training process

Typically, training an object detection model requires the following steps:

  1. Data collection and annotation: Collect image or video data containing target objects and annotate them to mark the location and category of the target object. The annotation can use a rectangular bounding box ( ), and the category label of the target can also be annotated bounding box.
  2. Data preprocessing: Preprocess the collected data, including image size adjustment, color space conversion, image enhancement, etc. Data enhancement operations such as mirror flipping, rotation, scaling, etc. can also be performed to expand the data set and improve the generalization of the model. capabilities
  3. Build a model: Choose an appropriate target detection model architecture. Based on task requirements and data set characteristics, you can choose to use a pre-trained model as the basis and fine-tune it or customize the model structure.
  4. Define the loss function: According to the nature of the target detection task, define an appropriate loss function. Common loss functions include target classification loss and bounding box regression loss. By minimizing the loss function, the model parameters are optimized so that it can more accurately predict the location and category of the target object.
  5. Model training: Use the annotated data set to train the target detection model, update the model parameters through the back propagation algorithm, and continuously optimize the performance of the model. During the training process, you need to select appropriate training hyperparameters such as learning rate and batch size to balance Convergence speed and generalization ability of the model
  6. Model evaluation and tuning: Use the validation set or test set to evaluate the trained model, and tune the model based on the evaluation results. Common evaluation indicators include accuracy ( ), recall ( ), precision accuracy( recall) precisionand average precision mean( mAP) etc.
  7. Model deployment and application: After completing the training and tuning of the model, it can be deployed to the production environment for application in actual target detection tasks. During the deployment process, the performance optimization, hardware limitations, and real-time performance of the model may need to be considered. And other factors

2. Create a custom target detection data set

Object detection is able to output the category of each object in the image and its bounding box. In order to train the model, we must create input-output data, where the input is the image and the output is the bounding box around the object in the given image and the category corresponding to the object. It should be noted that detecting bounding boxes requires the pixel positions of the four corners of the bounding box around the image.
In order to train an object detection model, we need to label the bounding box coordinates of all objects in the image. In this section, we will learn how to create a training dataset using images as input and the object categories in the image and their corresponding bounding boxes stored in a XMLfile as output. Next, we will use ybatthe tool to annotate the bounding boxes and corresponding categories of objects in the image. Additionally, we will examine files containing annotation class and bounding box information XML.

2.1 Install the image annotation tool

It can be downloaded from GitHubybat-master.zip and unzipped. Open the decompressed folder and use a browser to open it ybat.html. You can see the original blank page:

Data set annotation

2.2 Dataset annotation

Before starting to create the annotation data corresponding to the image, you first need to classes.txtstore all possible categories in the file, as follows:

classes.txt
Next, annotate the training dataset for the object detection model, including annotating bounding boxes around objects and assigning class labels to objects in the bounding boxes:

  • Upload all images that need to be annotated
  • Upload classes.txtfiles
  • Select the image file that needs to be annotated, then draw a bounding box around each object that you want to label, and be sure to select the correct category (in the area) for the objects in the bounding box before drawing the bounding box Classes.
  • Save data in desired format

Data annotation

For example, if you want to save it in PascalVOCthe format, the file will be .zipdownloaded in the form of a compressed file ( ) , and the file content XMLafter drawing the rectangular bounding box is as follows:XML

XML file content
As can be seen from the above image, the field contains the coordinates of the minimum and maximum values ​​of the and coordinates bndboxcorresponding to the object of interest in the image . You can use the field to extract the category corresponding to the object in the image. The field contains the coordinates of the upper left corner of the object of interest in the image ( and corresponds to and coordinates respectively) and the lower right corner ( and corresponds to and coordinates respectively). Through the field, we can extract the category label corresponding to the object in the bounding box, such as " ” or “ ” etc. The above information will be used to train an object detection model to accurately identify and classify these objects. Now that we've seen how to annotate objects in images (including class labels and bounding boxes), we'll delve into the key techniques for identifying objects in images. First, we will introduce region proposals (regions of an image that are most likely to contain objects).xyname
bndboxxminyminxyxmaxymaxxynamepersoncar

3. Regional Proposal

3.1 Basic concepts

Region proposal ( Region Proposal) is an important technique in object detection, used to generate candidate regions that may contain target objects. Assume that in an image, the objects we are interested in include the sky and people. It is assumed that the pixel intensity of the background (sky) does not change much, while the pixel intensity of the foreground (person) changes greatly. Based on the above description, we can draw the following conclusion that the image contains two types of areas: people and sky; in the human image area, the pixels corresponding to the hair and the pixels corresponding to the face have different intensities, so it means that within an area Multiple subregions can exist.
Region proposal can be used to identify regions with similar pixels. The position and size of these regions are usually uncertain, so a region proposal algorithm is needed to propose possible positions and sizes for further processing. Region proposal can help us in object detection. Identify the location of objects present in an image. Furthermore, the use of region proposal algorithms facilitates target object localization, i.e. determining a bounding box that perfectly fits the object in the image.
Region proposal algorithms are usually based on region segmentation and merging methods to segment the input image, identify areas where targets may exist, and then use these areas as input to subsequent target detection algorithms. Region proposal can speed up the object detection process because the object detection algorithm can only be executed in these regions that may contain objects instead of scanning the entire image, thereby increasing the detection speed. Next, we first look at how to generate region proposals from images.

3.2 Use SelectiveSearch to generate region proposals

Selective search ( Selective Search) is a classic region proposal algorithm used to generate candidate regions that may contain target objects. Based on the ideas of image segmentation and region merging, candidate regions are generated by gradually merging similar regions. Its basic idea is to generate image regions of different scales and sizes by hierarchically grouping images. These areas are considered candidate detection areas so that subsequent detectors can target these areas for further processing. Selective SearchThe algorithm generates candidate regions through the following operations:

  1. Segment an image into regions, each region consisting of similar color, texture, and structural features
  2. Use image segmentation results to form an initial set of candidate regions
  3. Calculate the similarity between candidate areas, merge the candidate areas with the highest similarity into larger areas, and update the candidate area set
  4. SelectiveSearchMerge regions multiple times based on similarities at different scales and scales to obtain a more refined set of candidate regions.
  5. Repeat steps 1 3and 2 4to obtain a set of non-overlapping and reliable candidate regions.

SelectiveSearchThe advantage of the algorithm is that it can generate high-quality candidate regions, has good robustness, and is suitable for a variety of target detection tasks. Next, we Pythonimplement the process of selective search using .
In Python, Selective Search for Object Recognition( selectivesearch) is a commonly used selective search algorithm library that can easily use selective search algorithms to generate candidate regions.

(1) Install the required libraries:

pip install selectivesearch

(2) Load images and required libraries:

import selectivesearch
from skimage.segmentation import felzenszwalb
import cv2
from matplotlib import pyplot as plt
import numpy as np

img_r = cv2.imread('4.jpeg')
img = cv2.cvtColor(img_r, cv2.COLOR_BGR2GRAY)

(3) Extract segmentation from the image based on its color, texture, size and shape felzenszwalb:

segments_fz = felzenszwalb(img, scale=200)

In felzenszwalbthe method, scalerepresents the number of clusters that can be formed in image segmentation, scalethe higher the value, the higher the degree of preservation of the original image details. In other words, scalethe higher the value, the finer the segmentation content generated.

(4) Draw the original image and the segmented image:

plt.figure(figsize=(10,10))
plt.subplot(121)
plt.imshow(cv2.cvtColor(img_r, cv2.COLOR_BGR2RGB))
plt.title('Original Image')
plt.subplot(122)
plt.imshow(segments_fz)
plt.title('Image post \nfelzenszwalb segmentation')
plt.show()

selective search

As can be seen from the above output, pixels belonging to the same group have similar pixel values ​​in the segmentation result map. Pixels with similar values ​​form a region proposal. Using region proposals helps object detection because we can pass each region proposal to the network and predict whether the region proposal is a background or a target object. Furthermore, if the region proposal is a target object, the region can be used to identify offsets to obtain classes corresponding to the object bounding box and to the content in the region proposal. After understanding SelectiveSearchthe algorithm principle, we use the selective search function to obtain region proposals for a given image.

3.3 Generate area proposals

In this section, we will use selective search to define extract_candidatesthe function to lay the foundation for subsequent target detection model training.

(1) Define a function for extracting region proposals from images extract_candidates().

Take image as input parameter:

def extract_candidates(img):

Use the methods selectivesearchprovided in the library selective_searchto obtain candidate regions in the image:

    img_lbl, regions = selectivesearch.selective_search(img, scale=200, min_size=2000)

Compute image regions and initialize a list (candidate regions), use this list to store candidate regions that pass a defined threshold:

    candidates = []

Only get the area that exceeds the total image area 5%and does not exceed the image area 100%as the candidate area and return:

    for r in regions:
        if r['rect'] in candidates:
            continue
        if r['size'] < (0.05*img_area):
            continue
        if r['size'] > (1*img_area):
            continue
        x, y, w, h = r['rect']
        candidates.append(list(r['rect']))
    return candidates

(2) Import related libraries and images:

img = cv2.imread('4.jpeg')
candidates = extract_candidates(img)

(3) Extract candidate areas and visualize them on the image:

import matplotlib.patches as mpatches
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6, 6))
ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
for x, y, w, h in candidates:
    rect = mpatches.Rectangle(
            (x, y), w, h,
            fill=False,
            edgecolor='red',
            linewidth=1)
    ax.add_patch(rect)
plt.show()

area proposal
The grid in the figure above represents selective_searchthe region proposals (candidate regions) obtained using the method.
Now that we have seen how to generate region proposals, let’s continue learning how to use region proposals for object detection and localization. If a region proposal has a high overlap area with any target object position in the image, it will be marked as a proposal containing the object, while a region proposal with a small intersection with it will be marked as a background. In the next section, we will introduce how to calculate the intersection of the candidate region with the ground-truth bounding box.

4. Cross-union ratio

4.1 The concept of intersection and union ratio

The intersection ratio ( Intersection over Union, IoU) is a commonly used indicator to evaluate the performance of target detection and image segmentation algorithms. It is used to measure the degree of overlap between two regions. In object detection, IoUthe similarity between the predicted box and the real box is measured by calculating the ratio of the intersection and union between the predicted bounding box and the real bounding box.
Intersection over Unionin calculates the overlapping area of ​​the predicted and actual bounding boxes, while calculates the combined area of ​​the predicted and actual Intersectionbounding boxes , which is the ratio of the overlapping area between the two bounding boxes to the combined area of ​​the two bounding boxes, As shown below:UnionIoU

cross-over ratio

In the above figure, the blue bounding box is used as the true bounding box, and the red bounding box is used as the predicted bounding box. IoUIt is the ratio of the overlapping area and the combined area between the two bounding boxes. In the image below, you can observe how the IoUmetrics change as the degree of overlap between bounding boxes changes:

IoU value changes

From the image above, you can see that as the overlapping area decreases, so does the value IoUwhen the two bounding boxes do not overlap . After understanding the calculation principle of , we use to create a function that calculates for use in the target detection model.IoU0IoUPythonIoU

4.2 Implement IoU calculation function

Define a function that takes two bounding boxes as input and returns IoUas output.

(1) Define get_iou()the function, taking boxAand boxBas input, where boxAand boxBare two different bounding boxes (can be boxAregarded as the real bounding box, and boxBas the region proposal):

def get_iou(boxA, boxB, epsilon=1e-5):

We need to define an additional epsilonparameter to handle the case when the union between two bounding boxes is 0, to avoid division by zero errors.

(2) Calculate the coordinates of the intersection box:

    x1 = max(boxA[0], boxB[0])
    y1 = max(boxA[1], boxB[1])
    x2 = min(boxA[2], boxB[2])
    y2 = min(boxA[3], boxB[3])

x1stores xthe maximum value of the leftmost coordinate between the two bounding boxes, y1stores the maximum value of the top ycoordinate, x2and y2stores the minimum value of the rightmost xcoordinate and the bottommost ycoordinate between the two bounding boxes respectively, corresponding to The intersection between two bounding boxes.

(3) Calculate the width and height corresponding to the intersection area (overlapping area):

    width = (x2 - x1)
    height = (y2 - y1)

(4) Calculate the overlapping area ( area_overlap):

    if (width<0) or (height <0):
        return 0.0
    area_overlap = width * height

In the above code, it is specified that if the corresponding width or height of the overlapping area is less than 0, the area of ​​intersection is 0. Otherwise, the overlapping (intersecting) area is equal to the width times the height of the intersection area.

(5) Calculate the combined area corresponding to the two bounding boxes:

    area_a = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    area_b = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
    area_combined = area_a + area_b - area_overlap

In the above code, the combined area of ​​the two bounding boxes is calculated first area_a+area_b, and then area_combinedthe overlapping area is subtracted when calculating area_overlap, because area_overlapis calculated twice ( area_aonce when calculating and area_bonce when calculating ).

(6) Calculate IoUand return:

    iou = area_overlap / (area_combined+epsilon)
    return iou

In the above code, is ioucalculated as the ratio of the overlapping area ( area_overlap) to the combined area ( area_combined) and returns it.
We have seen how to create a training data set and the calculation of IoU. Next, we will learn about non-maximum suppression, which helps to remove redundant prediction boxes obtained when the model predicts from different possible bounding boxes around the object. Filter out the most representative candidate boxes.

5. Non-maximum suppression

In object detection, multiple prediction boxes (such as region proposals) are often obtained, and these prediction boxes may overlap each other. For example, in the image below, multiple region proposals are generated around the people in the image:

area proposal

Using non-maximum suppression ( non-maximum suppression, NMS) can identify the bounding box containing the target object from multiple candidate regions and discard other bounding boxes. Non-maximum ( Non-maximum) refers to those boxes that do not contain the highest probability (but contain the target object), while suppression ( ) Suppressionrefers to discarding those boxes that do not contain the highest probability (but contain the target object). In non-maximum suppression, we identify the bounding box with the highest probability and discard all other IoUbounding boxes that are more similar to that bounding box than a certain threshold, which have a lower probability and may not contain objects.
In , non-maximum suppression can be performed PyTorchusing the function torchvision.opsin the module . The function identifies bounding boxes to keep based on the bounding box coordinates, the confidence that the object is within the bounding box, and a threshold. Using non-maximum suppression can avoid excessive redundant detection results, improve detection efficiency, and also reduce false detections and missed detections.nmsnmsIoU

6. Average precision mean

mAP( mean Average Precision) represents the average accuracy mean, which is a commonly used indicator to evaluate the performance of target detection algorithms. It comprehensively considers the accuracy of different categories, sorts and sets thresholds for detection results. Next, we first explain the accuracy, then introduce the average accuracy, and finally explain mAPthe calculation method of .
The calculation formula of precision ( Precision) is as follows:
P recision = T rue P ositive T rue P ositive + F alse P ositive Precision=\frac{True\ Positive}{True\ Positive+False\ Positive}Precision=True Positive+False PositiveTrue Positive
A true example ( True Positive) means that the predicted bounding box correctly predicts the corresponding target category, and the intersection ratio ( ) with the real bounding box is IoUgreater than the given threshold; a false positive example ( False Positive) means that the predicted bounding box is incorrectly predicted The target category is out of the target category or the intersection ratio with the ground-truth bounding box is lower than the defined threshold. In addition, if there are multiple predicted bounding boxes for the same real bounding box, only one bounding box can be defined as a true example, and the remaining bounding boxes are classified as false positives.
The average accuracy ( Average Precision, AP) represents IoUthe average of the accuracy values ​​calculated under different thresholds. It is the average of the accuracy values ​​calculated under mAPdifferent thresholds for all target categories in the data set .IoU

summary

Object detection is an important task in the field of computer vision. It aims to accurately locate and identify target objects of interest from images or videos. The goal is to frame the target area in the input image and provide the correct target for each target. Category labels are widely used in many application fields, including intelligent monitoring, autonomous driving, face recognition, etc. In this section, you'll learn how to prepare ybata training dataset using , SelectiveSearchimplement the region proposal algorithm using the library, perform non-maximum suppression on the model's predictions, and measure model performance.

Series link

PyTorch Deep Learning Practice (1) - Detailed explanation of the neural network and model training process
PyTorch Deep Learning Practice (2) - PyTorch basics
PyTorch Deep Learning Practice (3) - Use PyTorch to build neural networks
PyTorch Deep Learning Practice (4) - Detailed explanation of commonly used activation functions and loss functions
PyTorch deep learning practice (5) - Computer vision basics
PyTorch deep learning practice (6) - Neural network performance optimization technology
PyTorch deep learning practice (7) - The impact of batch size on neural network training
PyTorch Deep Learning Practice (8) - Batch Normalization
PyTorch Deep Learning Practice (9) - Learning Rate Optimization
PyTorch Deep Learning Practice (10) - Overfitting and its solution
PyTorch Deep Learning Practice (11) - Convolutional Neural Network
PyTorch Deep Learning Practice (12) - Data Enhancement
PyTorch Deep Learning Practice (13) - Visualizing the output of the middle layer of the neural network
PyTorch Deep Learning Practice (14) - Class Activation Map
PyTorch Deep Learning Practice (15) - — Transfer learning
PyTorch deep learning practice (16) — Facial key point detection
PyTorch deep learning practice (17) — Multi-task learning

Guess you like

Origin blog.csdn.net/LOVEmy134611/article/details/133398319