Application of Computer Vision 14-Detailed explanation of the model architecture and improvement process of the classic target detection algorithms YOLOv1-YOLOv5 for easy memory

Hello everyone, I am Wei Xue AI. Today I will introduce to you the application of computer vision 14-the classic target detection algorithm YOLOv1-YOLOv5. The model architecture and improvement process are explained in detail to facilitate memory. YOLO (You Only Look Once) is a deep learning model for target detection. Imagine that traditional target detection methods are like detectives who need to carefully observe the entire scene and study every detail one by one to find the target. But YOLO is just like a superhero. You only need to scan the entire screen at a glance to capture all targets immediately. YOLO can achieve such efficient target detection because it transforms the target detection problem into a regression problem. It predicts the object's bounding box and class probability directly from image pixels in a single forward pass through a magical neural network. This means that YOLO is not only fast, but also accurate. It can capture targets of different sizes and locations at the same time, and it can also tell you which category they belong to.

Whether you are tracking a moving vehicle or looking for a walking pedestrian, YOLO can give you an accurate answer in an instant. Its speed and accuracy make YOLO the method of choice in many computer vision applications and a superstar in the field of object detection!

1. Detailed introduction of YOLO network

1. Design idea:
Traditional target detection algorithms, such as the R-CNN series, adopt two steps: first generate candidate areas, and then classify these areas. YOLO adopts a "read it all at once" strategy. It treats the target detection task as a regression problem and directly completes the prediction of bounding boxes and category judgments in a network.
2. Network structure:
YOLO uses a fully convolutional network and introduces a fully connected layer at the end for prediction. The input image is divided into SxS grids. If the center of an object falls within the grid, this grid is responsible for predicting the object. Each grid predicts B bounding boxes and the confidence that these bounding boxes contain objects, and also predicts C conditional class probabilities.
3. Loss function:
Since the task involves coordinate regression and classification, the loss function also consists of these two parts. Coordinate errors and confidence errors are calculated using squared difference loss; category errors are calculated using cross-entropy loss.
4. Advantages:
Fast: Because you only need to run the forward propagation once to get the result.
Strong generalization ability: can handle new colors, scales, angles and other changes well.
5. Disadvantages:
Poor handling of small objects and relatively low positioning accuracy.

Below I will introduce the network structure from YOLOv1 to YOLOv5 in detail, and explain the improvements made by each version compared to the previous version:

YOLOv1 model

YOLOv1 is the first model to introduce the concept of end-to-end object detection, which implements bounding box prediction and class probability in a single neural network.
Network structure: A single convolutional network is used, followed by 2 fully connected layers and a linear regression layer. The input is an image of size 448x448, and the output is a 7x7 grid, each grid predicts 2 bounding boxes and 20 class probabilities.

Model improvements

Compared with previous methods that required multi-stage processing to complete the target detection task (such as the R-CNN series), YOLOv1 significantly improved the speed and performed well when processing multiple targets, small targets, and blurred targets in the image. .
Insert image description here

YOLOv2 model

YOLOv2 improves accuracy while maintaining high speed. It introduces Darknet-19 as the underlying architecture and adds "anchor boxes" to better handle objects of different shapes and sizes.
Network structure: Darknet-19 contains 19 convolutional layers and 5 maxpooling layers. The last fully connected layer is removed and replaced with a new convolutional layer, which can directly output the 13x13 grid result.

Model improvements

1. Batch Normalization algorithm implementation: After convolution or pooling and before activation function, each data output is normalized. This can greatly increase the training speed and improve the training effect.
2. Introducing "anchor boxes" to solve the difficult problem of identifying objects of different shapes and sizes;
3. Adding "multi-scale training" to allow the model to adapt to various resolutions;
4. Adding the "Darknet-19" architecture to make the model deeper , but the computational efficiency is still very high.
Darknet-19 overall network architecture:
Insert image description here

YOLOv3 model

YOLOv3 improves detection of small objects by using three different scales for prediction, and three different sizes of "anchor boxes" to better match real object sizes.
Network structure: Darknet-53 is used, including 53 convolutional layers, and residual connections are added to improve the training process. Finally, predictions of 3 different scales (13x13, 26x26, 52x52) are output.

Model improvements

1. Prediction at three different scales:
YOLOv3 introduces prediction at three different scales (13x13, 26x26, 52x52), and each scale generates a set of bounding boxes. This is achieved by adding more feature layers to the network and making predictions at different levels. Doing so helps the model better detect objects of different sizes. Specifically, the small scale (13x13) is mainly used to detect large objects, the medium scale (26x26) is used to detect medium-sized objects, and the large scale (52x52) is mainly used to detect small objects.
2. Use three different sizes of "anchor boxes":
At each prediction scale, YOLOv3 uses three fixed-proportion "anchor boxes" for each grid unit. These "anchor boxes" are derived from the width-to-height ratio distribution of all ground-truth bounding boxes in the training set. "Anchor boxes" can help the model better match and predict real object sizes.
3.Darknet-53 network structure:
YOLOv3 uses Darknet-53 as its backbone network. Darknet-53 contains 53 convolutional layers and reduces the computational effort by nearly half compared to ResNet-50 at similar performance. It extracts features by alternately using 1×1 convolution and 3×3 convolution, and utilizes continuous blocks to increase the depth of the network to extract more complex, high-level abstract features.
4. Residual connection:
Darknet-53 also introduces residual connection or called shortcut connection or skip connection (similar to ResNet). This is a typical technique designed to solve the vanishing gradient and representation bottleneck problems during deep neural network training. Residual connections allow the input to flow directly to the output or subsequent layers, ensuring that information can be effectively propagated and helping to capture low-level and high-level features.
5.Multi-label classification:
In the classifier part, YOLOv3 uses the sigmoid function instead of softmax for multi-label classification. Doing so enables the model to predict multiple categories for an object, which helps to handle some complex scenarios, such as an object belonging to multiple categories at the same time.
Insert image description here

YOLOv4 model

YOLOv4 further improves accuracy while maintaining its speed advantage. It introduces new technologies such as CSPDarknet53, PANet and SAM block.
Network structure: CSPDarknet53 is used as the backbone network, PANet is used for feature fusion, and SAM block is used for the attention mechanism. It also introduces technologies such as Mish activation function and CIOU loss to improve performance.
The improvement introduces several new technologies:
1. CSPDarknet53:
As the backbone network, CSPDarknet53 is the main component of YOLOv4. It is based on the Darknet53 network structure and adopts the Cross Stage Partial connections (CSP) strategy to improve efficiency and performance. The CSP strategy can effectively reduce memory consumption during network forward propagation and increase information fluidity to improve model performance.
2.PANet:
PANet (Path Aggregation Network) is a module used for feature fusion. Through information exchange and fusion through bottom-up and top-down paths, PANet can better utilize the semantic information between feature maps at each level, thereby improving target detection accuracy.
3.SAM block:
SAM (Spatial Attention Module) block is an attention mechanism module. By performing spatial attention modulation on the input feature map, the SAM block can enhance valuable areas (i.e., target locations that require attention) and suppress the influence of unimportant areas.
4. Mish activation function:
The Mish activation function is a new type of nonlinear activation function that performs better than ReLU and other common activation functions on certain tasks. The Mish activation function can maintain monotonically increasing and smooth continuity in the positive value area, and has a larger range of close to zero but non-zero output in the negative value area, which makes the neuron more likely to remain active during backpropagation.

The mathematical formula of the Mish activation function is expressed as:
Mish ( x ) = x ⋅ tanh ⁡ ( softplus ( x ) ) \text{Mish}(x) = x \cdot \tanh(\text{softplus}(x))Mish(x)=xtanh(softplus(x))
其中, softplus ( x ) = log ⁡ ( 1 + e x ) \text{softplus}(x) = \log(1+e^x) softplus(x)=log(1+ex )represents a soft positive function.

5.CIOU loss:
CIOU loss is a new type of loss function. Compared with the original IoU loss, GIoU loss and other methods, it has a more comprehensive consideration of factors including shape, size, position and other aspects of difference evaluation, and has better performance during the training process. Good stability.

The following is the mathematical formula representation of the CIOU loss function:

CIOU = 1 − IoU + d ( g , p ) c 2 + α v \text{CIOU} = 1 - \text{IoU} + \frac{ {\text{d}(g, p)}}{ { \ text{c}^2}} + \alpha vCIAU=1IoU+c2d(g,p)+αv

Among them, ggg represents the ground truth target frame,ppp represents the predicted target box,IoU \text{IoU}IoU means Intersection over Union,d ( g , p ) \text{d}(g, p)d(g,p ) represents the distance between target boxes,c \text{c}c represents the normalization coefficient of the diagonal length,α \alphaα is a balancing term,vvv represents an auxiliary term used to penalize the offset and scale differences between the predicted box and the true box.
Insert image description here

YOLOv5 model

YOLOv5 is a real-time object detection algorithm. Although its name contains "YOLO" (You Only Look Once), it was not developed by Joseph Redmon, the original author of YOLO, but a project promoted by an open source community. Despite its name, "v5" actually offers no significant innovation or breakthrough. Mainly, some adjustments have been made to the model structure to optimize speed and accuracy, and a complete set of training, detection, and deployment tool chains have been provided.
In terms of network structure, YOLOv5 adopts a design similar to YOLOv3/v4, and makes some adjustments on this basis: 1. Modify the
convolution block configuration:
This change is mainly aimed at the convolution layer in the network, by modifying the configuration parameters of each layer (such as convolution kernel size, step size, etc.) can change the network structure, thereby affecting model performance and computational complexity.
Adding PANet: PANet (Path Aggregation Network) is a feature pyramid network that can effectively aggregate multi-scale and multi-level feature information. By introducing PANet, the model's ability to recognize targets of different sizes can be improved.
2. Fine-tuning and optimizing speed and accuracy:
Developers have performed a large number of fine-tuning operations on the model to optimize its running speed and prediction accuracy. This includes but is not limited to selecting activation functions and loss functions that are more suitable for task requirements; modifying the learning rate strategy; using data enhancement technology, etc.
Provides a comprehensive and easy-to-use tool chain: this includes an automatic hyperparameter search function. Users only need to set the search range and target evaluation indicators to automatically find the best hyperparameter combination; the model pruning function can help users remove errors in the model. Redundant or insignificant parts to reduce model size and increase runtime.
3. Focus module:
YOLOv5 introduces the Focus module, which is a lightweight convolution structure used to replace the downsampling operation in YOLOv4. The Focus module can retain more information while reducing the size of the feature map, thereby improving the detection performance and accuracy of small targets.
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/132856697