[Target detection] YOLO series - YOLOv1 detailed explanation

This article is a study note about the target detection algorithm YOLOv1. Most of the blogs on the Internet are excerpts of the key content in the translated papers plus my own understanding, or sort out the general ideas of the papers. If you haven’t read the original text, there may always be some things you don’t understand. Therefore, it is strongly recommended that the blog be eaten with the original text.
Original link: You Only Look Once: Unified, Real-Time Object Detection

1. Innovations and advantages of YOLOv1

YOLOv1 is an article of CVPR2016. Compared with the excellent target detection algorithms (such as R-CNN and DPM) at that time, YOLO has the following innovations and advantages:

1. Main innovations:

  • Unlike other target detection algorithms (such as R-CNN) that use region_proposal (regression problem) + classifiers (classification problem) detection method, target detection is treated as a regression problem .
  • Using a single network, inputting an entire image can obtain the detection boxes and categories of all objects in the image after only one inference. At the same time, the network can be trained and optimized directly end-to-end .

2. Algorithm advantages:

  • The primary advantage of YOLO is that it is fast and can maintain a high accuracy rate. Its standard version model can achieve real-time 45fps on Tian X GPU. In addition, YOLO also provides a fast version with a speed of up to 150fps, and its detection accuracy is about twice that of other real-time detection systems.
  • YOLO uses the context information of the whole image, which makes it have fewer background errors than Fast R-CNN (wrongly detects the target in the background area without the target)
  • The generalization ability and versatility are better, and the performance in the face of new fields and unexpected inputs is more stable than that of R-CNN.

2. Target detection logic and model output representation

The basic process of YOLO target detection is as follows:
Figure 1: Detection process
a. Resize the input image size to 448*448 (YOLOv1 can only input fixed-size images); b
. Feature extraction and prediction through a single CNN network;
c. Filter detection by setting a threshold frame.

1. Target detection logic (key and difficult points)

Figure 2: Detection logic
YOLO first cuts the input image into S*Sgrids:
(1) Each grid predicts Ba Bbox (bounding box) , and the confidence value of each bounding box . Confidence represents the grasp of the target in the corresponding bounding box, as well as the box position, Accuracy of size.
Each bounding box corresponds to 5 parameters: x, y, w, h, confidence. in,

  • ( x, y) represent the coordinates of the bounding box center relative to the grid;
  • w, hindicating the width and height of the bounding box relative to the entire image (meaning that the center coordinates of the bounding box must be within the grid, but its width and height are not limited by the grid and can exceed the grid size at will);
  • confidenceIt is defined as: when there is no target in the bounding box, confienec should be equal to 0; when there is a target in the bounding box, confidence should be equal to the IoU of this bounding box and the real bounding box of the object, which can be expressed as: [It should be
    confidence expression
    noted that : The "target exists in the bounding box" mentioned here means that the center position of the target falls into the bounding box. For example, in the second picture above, the center position of "dog" falls in the grid of row 5 and column 2, then only the B bounding boxes generated by this grid can be regarded as the target of "dog". In the end, YOLO will Assign the B bounding box and the highest IoU of the real label box of "dog" to be responsible for detecting this "dog", which is called specialization in the article]

(2) Each grid also predicts Ca conditional category probability value , which indicates that when the central position of a target "falls" into the grid, the target belongs to the probability distribution of C categories.

2. Model output representation

According to the above target detection logic, the model finally outputs S*S*(B*5+C)a tensor of size. In the YOLOv1 paper, S=7 and B=2 were selected. Since the PASCAL VOC data set used has 20 category labels, C=20, so the model finally predicts and outputs a tensor 7*7*30.
In the graphic YOLO article, the output tensor is visualized to help us understand.
img.png

3. Network design and training

The YOLO network draws on the GoogleLeNet image classification model (that is, Inception v1). The difference is that YOLO does not use the Inception module, but simply uses 1 1 reduction layer followed by 3 3 convolutional layers. The standard version of YOLO has a total of 24 convolutional layers followed by 2 fully connected layers (the network structure is shown below). Fast YOLO has only 9 convolutional layers, and other parameters are consistent with the standard version.
network structure
The relevant settings for model training are as follows:

  1. Pre-training : The model is first pre-trained on the ImageNet image dataset. The model participating in the pre-training is not the complete model structure shown in the figure above, but only the first 20 convolutional layers + 1 average pooling layer + 1 fully connected layer, and the input of the pre-training model is 244*244 .

  2. Model fine-tuning : After pre-training is completed, fine-tune the pre-trained image classification model to achieve object detection. The specific method is: (1) Add 4 convolutional layers and 2 fully connected layers (personal understanding here should replace the last fully connected layer of the pre-trained model), and the newly added layer uses random initialization parameters; (2) In order to better extract visual information, the resolution of the input image is 244*224increased from 448*448.

  3. Normalization : Normalize the w, h in the output to the [0,1] interval by dividing by the width and height of the image respectively; by restricting the coordinates (x, y) to relative positions in a specific grid, the same Normalize it to the [0,1] interval.

  4. Activation function : The final fully connected layer uses a linear activation function (linear activation) ; all other layers use a leaky ReLU activation function.
    Leaky ReLU

  5. Loss function : YOLO uses sum-squared error as the loss function because it is easy to optimize. However, it is not very ideal for improving the average accuracy of the flat model, so it is optimized for the following problems:

    1. The contribution of position error and classification error to loss should be different, so λ coord = 5 \lambda_{coord}=5 is introduced when calculating losslcoord=5 to correct for coordinate loss.
    2. There are many grids in each image that do not contain any targets (that is, the center points of no targets fall into these grids), which will make the confidence value of the bounding box in most grids biased towards 0, which is magnified in disguise. The effect of the confidence error of the grid containing the target in computing the gradient. YOLO introduces λ noobj = 0.5 \lambda_{noobj}=0.5ln o o b j=0.5 to correct for this effect.
    3. The same position deviation has much less impact on the IoU error of large objects than on small objects, so YOLO corrects and balances the influence of the two by taking the square root of the information item (w, h) of the size of the object.

    The expression of the final corrected loss function is:

  6. training hyperparameters

    • num_epochs (number of training epochs) is about 135
    • batch_size = 64
    • Optimizer: momentum=0.9, decay=0.0005
    • The learning rate uses warm up in the early stage, and then gradually decays as the training progresses
  7. Regularization (suppresses overfitting)

    • dropout(rate=0.5)
    • Data augmentation with random scaling and translation

4. Model reasoning

YOLO performs the same reasoning as training, and only needs to be reasoned once by a single model for a picture. For the PASCAL VOC data set, the model predicts 98 detection boxes ( S*S*B=7*7*2).
YOLO uses the method of dividing the grid for detection. In most cases, the model can clearly determine which grid the center position of the object "falls into", and then use only one bounding box for an object to predict its position. However, for some relatively large objects, or objects located at the boundaries of multiple grids, they may be better positioned by multiple grids at the same time. At this time, the non-maximum suppression (NMS) method is used, that is, only Keep the bounding box with the largest confidence (here equal to IoU).

Five, YOLOv1 algorithm defect

This article no longer introduces the relevant content on the experimental effect of YOLOv1. If you are interested, you can read the original text directly. Some defects of the YOLOv1 algorithm are mentioned in the original text:

  • Since only 2 bounding boxes and 1 category can be predicted per grid, this limits the number of detections for close objects, especially for small object detection (e.g. flocks of birds) in comparison sets.
  • Since the model learns from data, it can generalize well for conventional data, but it is difficult to generalize to some unusual aspect ratios or configurations, that is, the generalization ability is limited.
  • The loss function design is not reasonable enough, mainly due to the influence of positioning errors, especially in the processing of large and small objects.

references

  1. Detailed explanation of YOLO-Knowledge
  2. Graphical YOLO
  3. Overview of YOLO series: from V1 to V4

Guess you like

Origin blog.csdn.net/qq_43673118/article/details/123418161