YOLOv4 Structure

1 YOLOv4 Structure

Insert picture description hereInsert picture description here
CSPNet+Darknet53
layer0~layer104 The
first CSP structure is shown in the figure, corresponding to layer 1 to layer 10. The feature map of layer 1 is saved by layer 2 after 1×1 convolution without dimensionality reduction, and the other way is to layer 1 A series of residual operations obtain layer 8, concatenating layer 2 and layer 8 to obtain layer 9, and performing 1×1 convolution on layer 9 for feature fusion, and the CSP construction at this resolution is completed.


The
implementation of SPPNet is to perform the maximum pooling of 5×5, 9×9, 13×13 on layer107, and get layer 108, layer 110 and layer 112 respectively. After pooling is completed, layer 107, layer 108, layer 110 and layer 112 Perform concate, connect into a feature map
layer 114 and reduce the dimensionality to 512 channels through 1×1.


The method used when PANet is fused is Addition.
Here, YOLOv4 changes the fusion method from addition to multiplication, and there is no explanation for the detailed reason. In yolov4.cfg, route is used to link the two parts of the feature.
The layers corresponding to PANe upsampling are layer 105 to layer 128, starting from layer 132, they are Down Sample and YOLOv3 head.

1.1 Backbone: CSPDarknet53(CSPNet + Darknet53)

1.1.1 CSPNet

Insert picture description here

CSPNet: A new backbone that can enhance the learning ability of CNN. Simply put, CSPNet can not only reduce the amount of calculation, but also improve the speed and accuracy of inference.


Background: Neural networks are particularly powerful when they become deeper and wider. However, extending the architecture of neural networks usually brings more calculations, which makes it difficult for most people to bear computationally heavy tasks such as target detection. Lightweight computing has gradually received more and more attention, because real-world applications usually need to shorten the reasoning time on small devices, which poses a severe challenge to computer vision algorithms.


What problem is mainly solved: from the perspective of network structure design, it solves the problem of requiring a large amount of calculation in the inference process of previous work.
The author believes that the problem of excessive inference calculation is caused by the duplication of gradient information in network optimization. CSPNet integrates the gradient changes into the feature map from beginning to end, which reduces the amount of calculation while ensuring accuracy. CSPNet is a processing idea that can be combined with ResNet, ResNeXt and DenseNet.


Generally speaking, CSPNet proposes to solve three problems: to
enhance the learning ability of CNN, and to maintain accuracy while reducing weight.
Reduce the computational bottleneck and
reduce the memory cost


. The above figure is a schematic diagram of DenseNet and an improvement of CSPDenseNet. The improvement is that CSPNet maps shallow features into two parts: one
part passes through the Dense module (Partial Dense Block in the figure), and the other part directly connects with Partial Dense Block output is concatenated.

Insert picture description here
Insert picture description here

Apply CSPNet to ResNeXt:
Like CSPDenseNet, the upper layer is divided into two parts, Part1 does not operate directly to concate, and Part2 performs convolution operation.


CSPNet author also designed several feature fusion strategies, as shown in the
figure. Transition Layer represents the transition layer, which mainly includes the bottleneck layer (1x1 convolution) and the pooling layer (optional).
(A) The figure is the feature fusion method of the original DenseNet.
(B) The figure is the feature fusion method of CSPDenseNet (trainsition->concatenation->transition).
(C) The figure is the feature fusion method of Fusion First (concatenation->transition)
(d) The figure is the feature fusion method of Fusion Last (transition->concatenation)
The method of Fustion First is to perform concatenation operation on the feature maps of the two branches. , So the gradient information can be reused. The Fusion Last method is to perform advanced transition operations on the branch where the Dense Block is located, and then perform concatenation. The gradient information will be truncated, so the gradient information will not be reused.
Using Fusion First helps reduce the computational cost, but the accuracy rate is significantly reduced.
Using Fusion Last also greatly reduces the computational cost, and top-1 accuracy only drops by 0.1%.
Using the fusion method adopted by Fusion First and Fusion Last's CSP at the same time can reduce the computational cost while improving accuracy.

1.2 Neck: SPP, PAN

1.2.1 SPPNet

Insert picture description hereBackground: SPP-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. The paper
RCNN written by the great god He Kaiming in 2014 was improved for RCNN after it was published in 2013. At that time, the detection speed of RCNN was slow, and it was proposed by overfeat. The multi-scale training method, but RCNN is difficult to achieve.
The biggest bottleneck of R-CNN is that 2k candidate regions must pass through CNN once, which is very slow. Kaiming He was the first to improve this and proposed SPP-net. The biggest improvement is that you only need to input the original image once to get the characteristics of each candidate area.
The big guy proposed spatial pyramid pooling in 2014, and the performance and accuracy have been greatly improved, and this idea has been continued in many subsequent networks.


Contribution:
Regardless of the input size, the SPP layer can produce a fixed size output, which is used for multi-scale training.
Each image is only extracted once.


What problem does it solve: Current deep convolutional neural networks (CNNs) require a fixed image size ( fixed-size), such as 224×224). This need is "artificial", and when faced with images/sub-images of any size or scale, the recognition accuracy is reduced (reduce the recognition accuracy). In this article, we equip the network with a "spatial pyramid pooling" strategy to eliminate the above limitations. The SPP-net structure can produce a fixed-length representation, regardless of the size or proportion of the input image. Pyramid pooling is very robust to object deformations (robust to object deformations)


Why does CNN need a fixed input size? The output size of the convolutional layer and the pooling layer are related to the input size. Their input does not require a fixed image size. What really needs a fixed size is the final fully connected layer.
Due to the existence of the FC layer, ordinary CNN fixes the input of the fully connected layer by fixing the size of the input picture. The author does not think this way. Since the convolutional layer can adapt to any size, it is only necessary to add a certain structure at the end of the convolutional layer so that the input obtained by the subsequent fully connected layer is a fixed length.
This structure is SPP.


Simply understand that the weights are stored in the convolutional layer and the fully connected layer. The pooling layer does not need to be saved. The SPP modifies the size and stride of the pooling layer, keeping the output the same size, and then performs multi-scale training.


The purpose of SPP network used in YOLOv4 is to increase the receptive field of the network.


The implementation is to perform the maximum pooling of 5×5, 9×9, 13×13 on layer107, and get layer 108, layer 110 and layer 112 respectively. After pooling is completed,
perform layer 107, layer 108, layer 110 and layer 112 Concate is connected to a feature map layer 114 and reduced to 512 channels through 1×1.

Insert picture description here
Insert picture description here

The image in the first line is the processing method of CNN that requires fixed-size input. The
second line is the processing flow of CNN (such as R-CNN) that requires fixed-size input. First, perform the picture in a manner similar to that in the first line. Processing, and then input the convolution and fully connected layers, and finally the output result
is the SPP-net processing method. The size of the image is not fixed, and it is directly input to the convolution layer for processing. The convolutional features are not directly input to the fully connected The layer is processed
by the SPP layer first, and then a fixed-length output is passed to the fully connected layer, and the result is finally output.


For a picture, R-CNN first uses the segment seletive method to extract about 2000 candidate regions, and then sends these two thousand candidate regions into the network respectively, that is, a picture needs to undergo 2000 forward propagation, which will cause A lot of redundancy.
SPP-net proposes a corresponding mapping relationship from the candidate region to the feature map of the whole image. Through this mapping relationship, the feature vector of the candidate region can be directly obtained without reusing CNN to extract features. Thereby greatly shortening the training time.
Each picture only needs to be forwarded once.

Insert picture description here
Insert picture description here

Explain the improved method in detail:
1. The SPP layer (spatial pyramid pooling)
must first clarify the position of this layer. This layer is added between the last convolutional layer and the fully connected layer for the purpose of outputting a fixed length The
structure of the SPP layer of the fully connected layer that requires fixed input is shown in the figure.


2. The input of the SPP layer:
As shown in the gray box in the figure
, the features of the last layer of convolution output (we call the feature map), the feature map is the black part of the figure below, and the input of the SPP layer is the one corresponding to the candidate area.
The sentence above on an area on the feature map may be a bit convoluted. We can understand that a picture has about 2000 candidate areas, and after convolution on a picture, the feature map is obtained. There are also about 2000 on this feature map. A region corresponding to the candidate region


3. The output of the
SPP layer : The SPP layer is divided into three pooling structures of 1x1, 2x2, and 4x4 (this part of the structure is shown in the figure), for each input (here each input size is different ) Are used for max pooling (used in the paper), and the features that come out are connected together, which is the feature vector of (16+4+1)x256.
Regardless of the size of the input image, the feature is fixed to (16+4+1)x256 dimensions. In this way, regardless of the size of the candidate area in the image, the output of the SPP layer is always (16+4+1)x256 feature vector.


The mapping relationship between the candidate area in the original image and the feature map, this part of the calculation is actually the calculation of the size of the receptive field.
In CNN, receptive fields refer to the area size of the previous layer corresponding to an element in the output result of a certain layer. It will not be expanded later.


Disadvantages:
The shortcomings of SPP-net are also obvious. The conv layer in CNN cannot continue to be trained during fine-tuning. It is still the framework of R-CNN, which is far from the end-to-end detection we need.
Since the end-to-end is so difficult, let's unify the following modules first, remove the SVM and the frame, and get the category and frame directly from CNN? So there is Fast R-CNN.

1.2.2 PANet

Insert picture description here

Background:
PANet is an instance segmentation paper of CVPR 2018. The authors are from Hong Kong Chinese, Peking University, Shangtang and Tencent Youtu.
The full name of the paper is: Path Aggregation Network for Instance Segmentation, which is a path aggregation network for instance segmentation.
PANet made many improvements on the basis of Mask RCNN, won the COCO 2017 instance segmentation competition, and also ranked second in the target detection competition.
Mask R-CNN is a very simple and effective instance segmentation framework.


The researchers in this paper pointed out that the information dissemination in the current optimal Mask R-CNN can be further optimized.
Specifically, low-level features are useful for large-scale instance recognition. However, the path between the highest-level features and the lower-level features is very long, which increases the difficulty of accessing accurate positioning information.
In addition, each suggested region is predicted based on the feature grid obtained from a feature level pooling, and this allocation is heuristic.
Since discarding information at other levels may be useful for the final prediction, this process has room for further optimization.
Finally, mask prediction is only performed on a single field of view, and more diverse information cannot be obtained.


That is, the network can greatly improve the quality of generating prediction masks by accelerating information flow and integrating features at different levels.


Contribution:
PANet as a whole can be seen as a number of improvements to Mask-RCNN, which fully integrates features. Specifically, PANet's contribution can be summarized as follows.
FPN (this one already exists, not a contribution to the paper)
Bottom-Up Path Augmentation
Adaptive Feature Pooling
Fully-Connected Fusion


Frame diagram
(a) FPN backbone network
(b) bottom-up path enhancement
(c) adaptive feature pooling
(d) border branch
(e) fully connected fusion layer
Note: For brevity, (a) and ( b) The channel dimension of the feature map is omitted.


FPN is top-down, passing down the strong semantic features of the high-level and enhancing the entire pyramid, but only enhances the semantic information and does not transmit the positioning information. And PANet adds a bottom-up pyramid behind FPN. This operation is a supplement to FPN and transfers low-level strong positioning features.
Bottom-up path enhancement, in order to shorten the information propagation path, and use the precise positioning information of low-level features to
dynamically feature pooling. Each proposal uses the features of all layers of the pyramid. In order to avoid random allocation of proposals, it is more beneficial to classification and positioning.
Fully connected layer fusion, in order to increase the source of information for the mask prediction, the


red arrow is indicated in FPN. Because of the bottom-up process, the transfer of the shallow features to the top layer requires dozens or even hundreds of network layers. Of course It depends on what the BackBone network uses, so after so many layers of transmission, the loss of shallow feature information will be more serious.
The green arrow table author added a Bottom-up Path Augemtation structure, this structure itself is less than 10 layers, so that the shallow features are connected to P2 through the horizontal in the original FPN, and then passed from P2 to the top layer along the Bottom-up Path Augemtation , The number of layers passed is less than 10, which can better preserve the shallow feature information. Note that N2 and P2 here represent the same feature map. But N3, N4, N5 are not the same as P3, P4, and P5. In fact, N3, N4, and N5 are the results of the fusion of P3, P4, and P5.

Insert picture description here

Insert picture description here
The detailed structure of Bottom-up Path Augemtation is shown in the figure. It is a conventional feature fusion operation. Here is shown that after Ni undergoes a convolution
with a size of 3 3 and a step size of 2, the feature map size is reduced to the original Half and then perform an add operation with the feature map of Pi+1, and the result obtained passes through a convolutional layer with a convolution kernel size of 3 3 and stride =1 to obtain Ni+1.


In Yolov4:


The network structure of PANet is shown in the figure. Compared with FPN, DownSample is added after UpSample, and then YOLO Head is connected.


The layers corresponding to PANe upsampling are layer 105 to layer 128, starting from layer 132, they are Down Sample and YOLOv3 head.


The method used when PANet is fused is Addition.
Here, YOLOv4 changes the fusion method from addition to multiplication, and there is no explanation for the detailed reason. In yolov4.cfg, route is used to link the two parts of the feature.

1.3 Head: YOLOv3

Head is still the same as the original Head

2 YOLOv4 Uses

2.1 Bag of Freebies (BoF)

2.1.1 Backbone:

  1. CutMix
  2. Mosaic data augmentation
  3. DropBlock regularization
  4. Class label smoothing

2.2.2 Detector:

  1. CIoU-loss
  2. CmBN
  3. DropBlock regularization
  4. Mosaic data augmentation
  5. Self-Adversarial Training
  6. Eliminate gridsensitivity
  7. Using multiple anchors for a single groundtruth
  8. Cosine annealing scheduler
  9. Optimal hyperparameters
  10. Random training shapes

2.2 Bag of Specials (BoS)

2.2.1 Backbone:

  1. Mish activation
  2. Cross-stage partial connections (CSP)
  3. Multi-input weighted residual connections(MiWRC)

2.2.2 Detector:

  1. Mish activation
  2. SPP-block
  3. PAN path-aggregation block
  4. DIoU-NMS

3 Reference

[1] https://arxiv.org/abs/2004.10934
[2] https://github.com/AlexeyAB/darknet
[3] https://arxiv.org/pdf/1911.11929.pdf
[4] https://github.com/WongKinYiu/CrossStagePartialNetworks
[5] https://arxiv.org/pdf/1406.4729.pdf
[6] https://arxiv.org/pdf/1803.01534.pdf

Guess you like

Origin blog.csdn.net/qq_36783816/article/details/112940618