YOLOv4 study notes


Preface

In the field of target detection in recent years, the emergence of YOLOv4 marks an important technological breakthrough. YOLOv4 not only inherits the fast and efficient characteristics of the YOLO series, but also introduces a series of innovative technologies and strategies to significantly improve the performance of target detection. This article will briefly introduce the main contributions and improvements of YOLOv4, core concepts, detailed network architecture, and its innovations in data enhancement and loss functions. Through this series of analyses, we can better understand the importance of YOLOv4 in the field of target detection and its application potential.
YOLOv4 effect


1. YOLOv4 contributions and improvements

The contributions and improvements of YOLOv4 can be summarized from the following aspects:

  1. Efficient and powerful target detection model:

    • YOLOv4 develops an efficient and powerful object detection model, allowing users with 1080 Ti or 2080 Ti GPUs to train a fast and accurate object detector.
  2. Selection and optimization of network architecture:

    • YOLOv4 uses CSPDarknet53 as its backbone network (Backbone), SPP (Spatial Pyramid Pooling) and PAN (Path Aggregation Network) as its neck (Neck), and YOLOv3 as its detection head (Head).
    • In order to meet the special requirements of object detection, such as detecting multiple small-sized objects, covering larger input network size and more parameters to detect multiple objects of different sizes in a single image, CSPDarknet53 proved to be the optimal choice.
  3. Verification of the impact of training improvement techniques:

    • YOLOv4 tested the impact of various training improvement techniques on the accuracy of the classifier on the ImageNet dataset and the accuracy of the object detector on the MS COCO dataset.
  4. Key technologies used (BoF and BoS):

    • YOLOv4 utilizes a series of "Bag of Freebies (BoF)" and "Bag of Specials (BoS)" methods to improve performance. These include CutMix and Mosaic data augmentation, DropBlock regularization, class label smoothing, Mish activation function, Cross Stage Partial Connection (CSP), Multi-input Weighted Residual Connection (MiWRC), CIoU loss, Adaptive Training (SAT), Elimination Network Lattice sensitivity, using multiple anchors to a single ground-truth box, cosine annealing scheduler, optimal hyperparameters, random training shapes, SPP block, SAM block, PAN path aggregation block and DIoU-NMS.

These improvements and innovations enable YOLOv4 to achieve significant performance improvements in the field of object detection, especially the balance between speed and accuracy, making it an important milestone in the field of object detection.

2. YOLOv4 core concepts

  1. CSPDarknet53 Backbone:

    • CSPDarknet53 is the backbone network of YOLOv4, specially designed to improve the learning ability and speed of the network. It combines the structure of Darknet53 and the optimization strategy of Cross Stage Partial Network (CSPNet). CSPNet reduces the amount of computation and improves the propagation efficiency of feature maps by segmenting feature maps and merging them in the crossover stage.
  2. SPP和PAN Neck:

    • The SPP (Spatial Pyramid Pooling) block is used to increase the receptive field, isolate the most important contextual features, and have less impact on network operation speed.
    • PAN (Path Aggregation Network) is used to improve the transmission of feature information and improve detection performance through different levels of feature fusion, especially in small-size target detection.
  3. YOLOv3 detection head (Head):

    • YOLOv4 follows the detection head of YOLOv3. This header is designed to generate prediction boxes (bounding boxes) and calculate the class probability and object confidence of each box.
  4. Bag of Freebies (BoF) 和 Bag of Specials (BoS):

    • BoF is used to improve the effectiveness of the training process without increasing the cost of inference. For example, Mosaic data enhancement, DropBlock regularization, CIoU loss, etc.
    • BoS refers to a technique that adds a small amount of computational cost during the inference phase to significantly improve detection performance. This includes Mish activation functions, cross-stage partial connections (CSP), multi-input weighted residual connections (MiWRC), etc.
  5. Data augmentation and regularization techniques:

    • YOLOv4 introduces new data augmentation methods such as Mosaic and Adaptive Training (SAT), as well as DropBlock as a regularization method. Mosaic detects objects by blending four training images, while SAT alters the original image in two forward-backward stages.
  6. Hyperparameter optimization and training strategy:

    • YOLOv4 was designed with the adaptability of single-GPU training in mind, including using genetic algorithms to select optimal hyperparameters and improving some existing methods to make them more suitable for efficient training and detection.

3. YOLOv4 network architecture

The network architecture of YOLOv4 is mainly divided into three parts: Backbone, Neck, and Head. The following is a detailed description of these three parts:

  1. Backbone: CSPDarknet53

    • CSPDarknet53 is the backbone network of YOLOv4. It is built on the basis of Darknet53 and introduces the concept of CSPNet. This structure is designed to improve the network's learning capabilities and operating speed.
    • It reduces the amount of computation and improves the propagation efficiency of feature maps by splitting feature maps and merging them in the crossover stage. In addition, CSPDarknet53 contains 29 convolutional layers (3x3), providing a large receptive field of 725x725 and a parameter amount of 27.6M, which makes it suitable as the backbone network of the detector.
  2. Neck: SPP and PAN

    • The SPP (Spatial Pyramid Pooling) block is located after the backbone network and is used to increase the receptive field, isolate the most important contextual features, and has less impact on the network operation speed. SPP uses pooling operations to aggregate features of different scales and enhance the model's adaptability to targets of different sizes.
    • PAN (Path Aggregation Network) is used to improve the delivery of feature information. The PAN structure improves detection performance by fusing features at different levels, especially in small-size target detection. It enhances feature richness and diversity by aggregating feature maps at different levels.
  3. Detection head (Head): YOLOv3

    • The detection head of YOLOv4 follows the design of YOLOv3. This header is designed to generate prediction boxes (bounding boxes) and calculate the class probability and object confidence of each box. It consists of a series of convolutional layers for final object detection and classification.
    • The advantage of the YOLOv3 head is its simple and efficient design, which can handle both object detection and classification in a single network.

Overall, YOLOv4's network architecture improves the accuracy and speed of target detection through these innovative designs while ensuring high efficiency, especially the detection ability of small-sized targets.

4. YOLOv4 data enhancement

YOLOv4 introduces some innovative technologies in data augmentation, which significantly improve the model's generalization ability and accuracy under different environments and conditions. The main data augmentation methods include:

  1. Mosaic number increase:

    • Mosaic is a novel data augmentation method that blends four training images together to form a single synthetic image. This approach not only increases the diversity of the training data, but also allows the model to learn to detect objects in different contexts.
    • With Mosaic augmentation, the model is able to process activation statistics from four different images at each layer, which helps reduce the need for large mini-batches.
  2. Self-Adversarial Training (SAT):

    • Adaptive Training (SAT) is another novel data augmentation technique that operates in two forward and backward phases. In the first stage, the neural network modifies the original image instead of the network weights, which is equivalent to performing an adversarial attack on itself, by modifying the original image to create the illusion that the target object does not exist.
    • In the second stage, the neural network is trained to detect objects on this modified image. This approach enhances the model's robustness to adversarial attacks and abnormal conditions.
  3. CutMix 和 MixUp:

    • Although the YOLOv4 paper focuses on Mosaic, in the training of target detection, CutMix and MixUp is also a commonly used data enhancement technology. These techniques generate new training samples by combining parts from different images, enhancing the model's ability to learn different scene and object combinations.
  4. Random Training Shapes:

    • YOLOv4 also uses a random training shape method, which means that the size of the input image will continue to change during the training process. This method helps the model better adapt to inputs of different sizes and improves the adaptability to inputs of different resolutions.

The common goal of these data augmentation techniques is to improve the performance and robustness of models in complex and changing real-world environments, especially when dealing with object detection tasks of different sizes, different backgrounds, and different environments. Through these methods, YOLOv4 can effectively improve its adaptability to various scenarios and detection accuracy.

5. Loss function of YOLOv4

The loss function of YOLOv4 is a key component of its target detection performance, which mainly includes three aspects: confidence loss, category loss and box coordinate loss. The principles and formulas of these loss functions are introduced in detail below.

  1. Confidence Loss:

    • The confidence loss is used to evaluate whether the bounding box predicted by the model contains an object and measure the accuracy of its prediction. YOLOv4 uses cross-entropy loss to perform this task.
    • 公式通常表示为:
      Confidence Loss = − ∑ i = 0 S 2 ∑ j = 0 B 1 i j o b j log ⁡ ( C ^ i j ) + λ n o o b j 1 i j n o o b j log ⁡ ( 1 − C ^ i j ) \text{Confidence Loss} = -\sum_{i=0}^{S^2}\sum_{j=0}^{B} 1_{ij}^{obj} \log(\hat{C}_{ij}) + \lambda_{noobj}1_{ij}^{noobj} \log(1 - \hat{C}_{ij}) Confidence Loss=i=0S2j=0B1ijobjlog(C^ij)+lnoobj1ijnoobjlog(1C^ij)
      inside, S 2 S^2 S2 represents the number of grid cells, B B B represents the number of bounding boxes predicted by each grid unit, 1 i j o b j 1_{ij}^{obj} 1ijobj is an indicator if the bounding box j j j 在网格单源 i i i 中-included elephant 则为1,未则为0; C ^ i j \hat{C}_{ij} C^ij is the confidence that the bounding box predicted by the model contains the object; λ n o o b j \lambda_{noobj} lnoobjis the weight of the bounding box that does not contain the object.
  2. Class Loss:

    • Class loss is used to evaluate the accuracy of the model in classification predictions. YOLOv4 also uses cross-entropy loss to calculate category loss.
    • 公式通常表示为:
      Class Loss = − ∑ i = 0 S 2 ∑ j = 0 B 1 i j o b j ∑ c ∈ c l a s s e s p i j ( c ) log ⁡ ( p ^ i j ( c ) ) \text{Class Loss} = -\sum_{i=0}^{S^2}\sum_{j=0}^{B} 1_{ij}^{obj} \sum_{c \in classes} p_{ij}(c) \log(\hat{p}_{ij}(c)) Class Loss=i=0S2j=0B1ijobjcclassespij(c)log(p^ij(c))
      其中, p i j ( c ) p_{ij}(c) pij(c) This is true c c c Existing world frame j j j 和网格单源 i i i probability, p ^ i j ( c ) \hat{p}_{ij}(c) p^ij(c) is the corresponding probability predicted by the model.
  3. Bounding Box Loss:

    • YOLOv4 introduces CIoU loss (Complete Intersection over Union Loss) to replace the traditional IoU loss to more accurately optimize the coordinates of the prediction box.
    • The CIoU loss takes into account the bounding box overlap area, center point distance and aspect ratio, providing a more comprehensive box coordinate regression.
    • Definition:
      CIoU Loss = 1 − IoU + ρ 2 ( b , b g t ) c 2 + α v \text{CIoU Loss} = 1 - \text{IoU } + \frac{\rho^2(b, b_{gt})}{c^2} + \alpha vCIoU Loss=1IoU+c2r2(b,bgt)+αv
      Among them, IoU is the ratio of intersection and union, ρ ( b , b g t ) \rho(b, b_{gt} ) ρ(b,bgt) is 预测框 b b b Japanese real frame b g t b_{gt} bgt Euclidean distance of center point, c c c is the diagonal length of the smallest closed area containing two boxes, v v v is the consistency measure of the aspect ratio, α \alpha α is the weight coefficient used to balance different terms.

These loss functions together constitute the loss function of YOLOv4, which enables the model to take into account accuracy, confidence, and category prediction at the same time when performing target detection. Through such a design, YOLOv4 can improve detection accuracy and robustness while maintaining high-speed processing.


Summarize

After an in-depth analysis of YOLOv4, we can see that it has made significant progress in target detection technology. YOLOv4 not only improves detection speed and accuracy, but also greatly improves the model's generalization ability through its unique network architecture and innovative training strategies. Especially in data enhancement and loss function design, YOLOv4 has demonstrated its powerful ability to handle complex and diverse scenarios. Overall, the development of YOLOv4 sets new standards for real-time target detection and provides rich inspiration and possibilities for future research and applications.

Guess you like

Origin blog.csdn.net/qq_31463571/article/details/134812347