PyTorch Deep Learning Practice (18) - Target Detection
0. Preface
Object detection is an important task in the field of computer vision, which aims to identify specific categories of objects in images or videos and determine their locations. Unlike the image classification task, which only needs to determine which category the entire image belongs to, object detection also needs to mark the bounding box of the object in the image. For example, in an autonomous driving scenario, it is not only necessary to detect whether a road image contains vehicles, sidewalks, and pedestrians, but also to determine their locations in the image. Target detection has a wide range of applications, including intelligent surveillance, autonomous driving, face recognition, object tracking, image search, etc. In this section, the relevant basics of object detection will be introduced, using to ybat
mark target object bounding boxes, using selective search to extract region proposals, and using the intersection-to-union ratio ( Intersection over Union
, IoU
) and average precision mean to measure the accuracy of bounding box predictions.
1. Target detection
1.1 Basic concepts
The purpose of target detection ( Object Detection
) is to find all the targets (objects) of interest in the image and determine the categories and locations of these targets. It is one of the core issues in the field of computer vision. With the rise of applications such as self-driving cars, face detection, and intelligent video surveillance, people are paying more and more attention to faster and more accurate target detection systems. These systems not only need to identify and classify objects in an image, but also need to locate each object in the image by drawing an appropriate rectangular box around the target object. The output of target detection is more complex than image classification. The difference between the two can be clearly seen in the following figure:
As can be seen in the above figure, image classification simply describes the categories of objects present in the image, object localization draws bounding boxes to mark the objects present in the image, and object detection requires drawing bounding boxes around each object in the image and Determine the object category.
1.2 Target detection application
Object detection can draw bounding boxes around various objects present in the image. It is widely used in various fields and provides important basic technical support for various visual tasks, including:
- Intelligent security: Target detection can be used in security monitoring systems to detect the presence of people, vehicles or other suspicious objects in real time for security alarms, behavior analysis, etc.
- Autonomous driving: Object detection is an important part of the autonomous driving system. It is used to identify and track vehicles, pedestrians, traffic lights, etc. on the road to help the vehicle make decisions and plan driving paths.
- Face recognition: Target detection can be used in face recognition systems to detect faces and determine their locations, and then extract features for identity verification or comparison.
- Object tracking: Target detection can realize real-time tracking of specific objects in the video, such as moving target detection, motion trajectory analysis, etc.
- Image search: Through target detection, objects in images can be identified and used for object recognition, similar picture recommendation and other functions in image search engines.
- Video analysis: Object detection can be applied to video analysis tasks, such as behavior recognition, activity monitoring, traffic flow statistics, etc.
- Medical imaging: Target detection can be used for disease diagnosis in medical imaging, such as tumor detection, organ segmentation, etc.
- Industrial quality inspection: Target detection can be used for quality control and defect detection in industrial production, such as product surface defect detection, product counting, etc.
1.3 Model training process
Typically, training an object detection model requires the following steps:
- Data collection and annotation: Collect image or video data containing target objects and annotate them to mark the location and category of the target object. The annotation can use a rectangular bounding box ( ), and the category label of the target can also be annotated
bounding box
. - Data preprocessing: Preprocess the collected data, including image size adjustment, color space conversion, image enhancement, etc. Data enhancement operations such as mirror flipping, rotation, scaling, etc. can also be performed to expand the data set and improve the generalization of the model. capabilities
- Build a model: Choose an appropriate target detection model architecture. Based on task requirements and data set characteristics, you can choose to use a pre-trained model as the basis and fine-tune it or customize the model structure.
- Define the loss function: According to the nature of the target detection task, define an appropriate loss function. Common loss functions include target classification loss and bounding box regression loss. By minimizing the loss function, the model parameters are optimized so that it can more accurately predict the location and category of the target object.
- Model training: Use the annotated data set to train the target detection model, update the model parameters through the back propagation algorithm, and continuously optimize the performance of the model. During the training process, you need to select appropriate training hyperparameters such as learning rate and batch size to balance Convergence speed and generalization ability of the model
- Model evaluation and tuning: Use the validation set or test set to evaluate the trained model, and tune the model based on the evaluation results. Common evaluation indicators include accuracy ( ), recall ( ), precision
accuracy
(recall
)precision
and average precision mean(mAP
) etc. - Model deployment and application: After completing the training and tuning of the model, it can be deployed to the production environment for application in actual target detection tasks. During the deployment process, the performance optimization, hardware limitations, and real-time performance of the model may need to be considered. And other factors
2. Create a custom target detection data set
Object detection is able to output the category of each object in the image and its bounding box. In order to train the model, we must create input-output data, where the input is the image and the output is the bounding box around the object in the given image and the category corresponding to the object. It should be noted that detecting bounding boxes requires the pixel positions of the four corners of the bounding box around the image.
In order to train an object detection model, we need to label the bounding box coordinates of all objects in the image. In this section, we will learn how to create a training dataset using images as input and the object categories in the image and their corresponding bounding boxes stored in a XML
file as output. Next, we will use ybat
the tool to annotate the bounding boxes and corresponding categories of objects in the image. Additionally, we will examine files containing annotation class and bounding box information XML
.
2.1 Install the image annotation tool
It can be downloaded from GitHubybat-master.zip
and unzipped. Open the decompressed folder and use a browser to open it ybat.html
. You can see the original blank page:
2.2 Dataset annotation
Before starting to create the annotation data corresponding to the image, you first need to classes.txt
store all possible categories in the file, as follows:
Next, annotate the training dataset for the object detection model, including annotating bounding boxes around objects and assigning class labels to objects in the bounding boxes:
- Upload all images that need to be annotated
- Upload
classes.txt
files - Select the image file that needs to be annotated, then draw a bounding box around each object that you want to label, and be sure to select the correct category (in the area) for the objects in the bounding box before drawing the bounding box
Classes
. - Save data in desired format
For example, if you want to save it in PascalVOC
the format, the file will be .zip
downloaded in the form of a compressed file ( ) , and the file content XML
after drawing the rectangular bounding box is as follows:XML
As can be seen from the above image, the field contains the coordinates of the minimum and maximum values of the and coordinates bndbox
corresponding to the object of interest in the image . You can use the field to extract the category corresponding to the object in the image. The field contains the coordinates of the upper left corner of the object of interest in the image ( and corresponds to and coordinates respectively) and the lower right corner ( and corresponds to and coordinates respectively). Through the field, we can extract the category label corresponding to the object in the bounding box, such as " ” or “ ” etc. The above information will be used to train an object detection model to accurately identify and classify these objects. Now that we've seen how to annotate objects in images (including class labels and bounding boxes), we'll delve into the key techniques for identifying objects in images. First, we will introduce region proposals (regions of an image that are most likely to contain objects).x
y
name
bndbox
xmin
ymin
x
y
xmax
ymax
x
y
name
person
car
3. Regional Proposal
3.1 Basic concepts
Region proposal ( Region Proposal
) is an important technique in object detection, used to generate candidate regions that may contain target objects. Assume that in an image, the objects we are interested in include the sky and people. It is assumed that the pixel intensity of the background (sky) does not change much, while the pixel intensity of the foreground (person) changes greatly. Based on the above description, we can draw the following conclusion that the image contains two types of areas: people and sky; in the human image area, the pixels corresponding to the hair and the pixels corresponding to the face have different intensities, so it means that within an area Multiple subregions can exist.
Region proposal can be used to identify regions with similar pixels. The position and size of these regions are usually uncertain, so a region proposal algorithm is needed to propose possible positions and sizes for further processing. Region proposal can help us in object detection. Identify the location of objects present in an image. Furthermore, the use of region proposal algorithms facilitates target object localization, i.e. determining a bounding box that perfectly fits the object in the image.
Region proposal algorithms are usually based on region segmentation and merging methods to segment the input image, identify areas where targets may exist, and then use these areas as input to subsequent target detection algorithms. Region proposal can speed up the object detection process because the object detection algorithm can only be executed in these regions that may contain objects instead of scanning the entire image, thereby increasing the detection speed. Next, we first look at how to generate region proposals from images.
3.2 Use SelectiveSearch to generate region proposals
Selective search ( Selective Search
) is a classic region proposal algorithm used to generate candidate regions that may contain target objects. Based on the ideas of image segmentation and region merging, candidate regions are generated by gradually merging similar regions. Its basic idea is to generate image regions of different scales and sizes by hierarchically grouping images. These areas are considered candidate detection areas so that subsequent detectors can target these areas for further processing. Selective Search
The algorithm generates candidate regions through the following operations:
- Segment an image into regions, each region consisting of similar color, texture, and structural features
- Use image segmentation results to form an initial set of candidate regions
- Calculate the similarity between candidate areas, merge the candidate areas with the highest similarity into larger areas, and update the candidate area set
SelectiveSearch
Merge regions multiple times based on similarities at different scales and scales to obtain a more refined set of candidate regions.- Repeat steps 1
3
and 24
to obtain a set of non-overlapping and reliable candidate regions.
SelectiveSearch
The advantage of the algorithm is that it can generate high-quality candidate regions, has good robustness, and is suitable for a variety of target detection tasks. Next, we Python
implement the process of selective search using .
In Python
, Selective Search for Object Recognition
( selectivesearch
) is a commonly used selective search algorithm library that can easily use selective search algorithms to generate candidate regions.
(1) Install the required libraries:
pip install selectivesearch
(2) Load images and required libraries:
import selectivesearch
from skimage.segmentation import felzenszwalb
import cv2
from matplotlib import pyplot as plt
import numpy as np
img_r = cv2.imread('4.jpeg')
img = cv2.cvtColor(img_r, cv2.COLOR_BGR2GRAY)
(3) Extract segmentation from the image based on its color, texture, size and shape felzenszwalb
:
segments_fz = felzenszwalb(img, scale=200)
In felzenszwalb
the method, scale
represents the number of clusters that can be formed in image segmentation, scale
the higher the value, the higher the degree of preservation of the original image details. In other words, scale
the higher the value, the finer the segmentation content generated.
(4) Draw the original image and the segmented image:
plt.figure(figsize=(10,10))
plt.subplot(121)
plt.imshow(cv2.cvtColor(img_r, cv2.COLOR_BGR2RGB))
plt.title('Original Image')
plt.subplot(122)
plt.imshow(segments_fz)
plt.title('Image post \nfelzenszwalb segmentation')
plt.show()
As can be seen from the above output, pixels belonging to the same group have similar pixel values in the segmentation result map. Pixels with similar values form a region proposal. Using region proposals helps object detection because we can pass each region proposal to the network and predict whether the region proposal is a background or a target object. Furthermore, if the region proposal is a target object, the region can be used to identify offsets to obtain classes corresponding to the object bounding box and to the content in the region proposal. After understanding SelectiveSearch
the algorithm principle, we use the selective search function to obtain region proposals for a given image.
3.3 Generate area proposals
In this section, we will use selective search to define extract_candidates
the function to lay the foundation for subsequent target detection model training.
(1) Define a function for extracting region proposals from images extract_candidates()
.
Take image as input parameter:
def extract_candidates(img):
Use the methods selectivesearch
provided in the library selective_search
to obtain candidate regions in the image:
img_lbl, regions = selectivesearch.selective_search(img, scale=200, min_size=2000)
Compute image regions and initialize a list (candidate regions), use this list to store candidate regions that pass a defined threshold:
candidates = []
Only get the area that exceeds the total image area 5%
and does not exceed the image area 100%
as the candidate area and return:
for r in regions:
if r['rect'] in candidates:
continue
if r['size'] < (0.05*img_area):
continue
if r['size'] > (1*img_area):
continue
x, y, w, h = r['rect']
candidates.append(list(r['rect']))
return candidates
(2) Import related libraries and images:
img = cv2.imread('4.jpeg')
candidates = extract_candidates(img)
(3) Extract candidate areas and visualize them on the image:
import matplotlib.patches as mpatches
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6, 6))
ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
for x, y, w, h in candidates:
rect = mpatches.Rectangle(
(x, y), w, h,
fill=False,
edgecolor='red',
linewidth=1)
ax.add_patch(rect)
plt.show()
The grid in the figure above represents selective_search
the region proposals (candidate regions) obtained using the method.
Now that we have seen how to generate region proposals, let’s continue learning how to use region proposals for object detection and localization. If a region proposal has a high overlap area with any target object position in the image, it will be marked as a proposal containing the object, while a region proposal with a small intersection with it will be marked as a background. In the next section, we will introduce how to calculate the intersection of the candidate region with the ground-truth bounding box.
4. Cross-union ratio
4.1 The concept of intersection and union ratio
The intersection ratio ( Intersection over Union
, IoU
) is a commonly used indicator to evaluate the performance of target detection and image segmentation algorithms. It is used to measure the degree of overlap between two regions. In object detection, IoU
the similarity between the predicted box and the real box is measured by calculating the ratio of the intersection and union between the predicted bounding box and the real bounding box.
Intersection over Union
in calculates the overlapping area of the predicted and actual bounding boxes, while calculates the combined area of the predicted and actual Intersection
bounding boxes , which is the ratio of the overlapping area between the two bounding boxes to the combined area of the two bounding boxes, As shown below:Union
IoU
In the above figure, the blue bounding box is used as the true bounding box, and the red bounding box is used as the predicted bounding box. IoU
It is the ratio of the overlapping area and the combined area between the two bounding boxes. In the image below, you can observe how the IoU
metrics change as the degree of overlap between bounding boxes changes:
From the image above, you can see that as the overlapping area decreases, so does the value IoU
when the two bounding boxes do not overlap . After understanding the calculation principle of , we use to create a function that calculates for use in the target detection model.IoU
0
IoU
Python
IoU
4.2 Implement IoU calculation function
Define a function that takes two bounding boxes as input and returns IoU
as output.
(1) Define get_iou()
the function, taking boxA
and boxB
as input, where boxA
and boxB
are two different bounding boxes (can be boxA
regarded as the real bounding box, and boxB
as the region proposal):
def get_iou(boxA, boxB, epsilon=1e-5):
We need to define an additional epsilon
parameter to handle the case when the union between two bounding boxes is 0
, to avoid division by zero errors.
(2) Calculate the coordinates of the intersection box:
x1 = max(boxA[0], boxB[0])
y1 = max(boxA[1], boxB[1])
x2 = min(boxA[2], boxB[2])
y2 = min(boxA[3], boxB[3])
x1
stores x
the maximum value of the leftmost coordinate between the two bounding boxes, y1
stores the maximum value of the top y
coordinate, x2
and y2
stores the minimum value of the rightmost x
coordinate and the bottommost y
coordinate between the two bounding boxes respectively, corresponding to The intersection between two bounding boxes.
(3) Calculate the width and height corresponding to the intersection area (overlapping area):
width = (x2 - x1)
height = (y2 - y1)
(4) Calculate the overlapping area ( area_overlap
):
if (width<0) or (height <0):
return 0.0
area_overlap = width * height
In the above code, it is specified that if the corresponding width or height of the overlapping area is less than 0
, the area of intersection is 0
. Otherwise, the overlapping (intersecting) area is equal to the width times the height of the intersection area.
(5) Calculate the combined area corresponding to the two bounding boxes:
area_a = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
area_b = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
area_combined = area_a + area_b - area_overlap
In the above code, the combined area of the two bounding boxes is calculated first area_a+area_b
, and then area_combined
the overlapping area is subtracted when calculating area_overlap
, because area_overlap
is calculated twice ( area_a
once when calculating and area_b
once when calculating ).
(6) Calculate IoU
and return:
iou = area_overlap / (area_combined+epsilon)
return iou
In the above code, is iou
calculated as the ratio of the overlapping area ( area_overlap
) to the combined area ( area_combined
) and returns it.
We have seen how to create a training data set and the calculation of IoU. Next, we will learn about non-maximum suppression, which helps to remove redundant prediction boxes obtained when the model predicts from different possible bounding boxes around the object. Filter out the most representative candidate boxes.
5. Non-maximum suppression
In object detection, multiple prediction boxes (such as region proposals) are often obtained, and these prediction boxes may overlap each other. For example, in the image below, multiple region proposals are generated around the people in the image:
Using non-maximum suppression ( non-maximum suppression
, NMS
) can identify the bounding box containing the target object from multiple candidate regions and discard other bounding boxes. Non-maximum ( Non-maximum
) refers to those boxes that do not contain the highest probability (but contain the target object), while suppression ( ) Suppression
refers to discarding those boxes that do not contain the highest probability (but contain the target object). In non-maximum suppression, we identify the bounding box with the highest probability and discard all other IoU
bounding boxes that are more similar to that bounding box than a certain threshold, which have a lower probability and may not contain objects.
In , non-maximum suppression can be performed PyTorch
using the function torchvision.ops
in the module . The function identifies bounding boxes to keep based on the bounding box coordinates, the confidence that the object is within the bounding box, and a threshold. Using non-maximum suppression can avoid excessive redundant detection results, improve detection efficiency, and also reduce false detections and missed detections.nms
nms
IoU
6. Average precision mean
mAP
( mean Average Precision
) represents the average accuracy mean, which is a commonly used indicator to evaluate the performance of target detection algorithms. It comprehensively considers the accuracy of different categories, sorts and sets thresholds for detection results. Next, we first explain the accuracy, then introduce the average accuracy, and finally explain mAP
the calculation method of .
The calculation formula of precision ( Precision
) is as follows:
P recision = T rue P ositive T rue P ositive + F alse P ositive Precision=\frac{True\ Positive}{True\ Positive+False\ Positive}Precision=True Positive+False PositiveTrue Positive
A true example ( True Positive
) means that the predicted bounding box correctly predicts the corresponding target category, and the intersection ratio ( ) with the real bounding box is IoU
greater than the given threshold; a false positive example ( False Positive
) means that the predicted bounding box is incorrectly predicted The target category is out of the target category or the intersection ratio with the ground-truth bounding box is lower than the defined threshold. In addition, if there are multiple predicted bounding boxes for the same real bounding box, only one bounding box can be defined as a true example, and the remaining bounding boxes are classified as false positives.
The average accuracy ( Average Precision
, AP
) represents IoU
the average of the accuracy values calculated under different thresholds. It is the average of the accuracy values calculated under mAP
different thresholds for all target categories in the data set .IoU
summary
Object detection is an important task in the field of computer vision. It aims to accurately locate and identify target objects of interest from images or videos. The goal is to frame the target area in the input image and provide the correct target for each target. Category labels are widely used in many application fields, including intelligent monitoring, autonomous driving, face recognition, etc. In this section, you'll learn how to prepare ybat
a training dataset using , SelectiveSearch
implement the region proposal algorithm using the library, perform non-maximum suppression on the model's predictions, and measure model performance.
Series link
PyTorch Deep Learning Practice (1) - Detailed explanation of the neural network and model training process
PyTorch Deep Learning Practice (2) - PyTorch basics
PyTorch Deep Learning Practice (3) - Use PyTorch to build neural networks
PyTorch Deep Learning Practice (4) - Detailed explanation of commonly used activation functions and loss functions
PyTorch deep learning practice (5) - Computer vision basics
PyTorch deep learning practice (6) - Neural network performance optimization technology
PyTorch deep learning practice (7) - The impact of batch size on neural network training
PyTorch Deep Learning Practice (8) - Batch Normalization
PyTorch Deep Learning Practice (9) - Learning Rate Optimization
PyTorch Deep Learning Practice (10) - Overfitting and its solution
PyTorch Deep Learning Practice (11) - Convolutional Neural Network
PyTorch Deep Learning Practice (12) - Data Enhancement
PyTorch Deep Learning Practice (13) - Visualizing the output of the middle layer of the neural network
PyTorch Deep Learning Practice (14) - Class Activation Map
PyTorch Deep Learning Practice (15) - — Transfer learning
PyTorch deep learning practice (16) — Facial key point detection
PyTorch deep learning practice (17) — Multi-task learning