OpenCV example (9) moving target detection based on deep learning (2) YOLOv2 overview

1. Overview of YOLOv2

For the shortcomings of YOLO, the industry has launched YOLOv2. YOLOv2 mainly optimizes the model through the following methods:

(1) Use the Batch Normalization method to normalize the input of each convolutional layer in the model, alleviate the disappearance of the gradient, speed up the convergence speed, reduce the training time, and improve the average detection accuracy.

(2) Add the Anchors mechanism, use the k-means clustering method to generate several different sizes of Anchors with the help of the border label value of the training set. YOLOv2 removes the fully connected layer and the last pooling layer in the YOLO network to improve the resolution of the features; the Anchors mechanism is used after the last layer of convolution sampling to improve IoU. During training, anchors are preset on each grid, and the loss function is calculated based on these anchors.

(3) Propose a new basic network structure: Darknet-19. Darknet-19 is a fully convolutional network. Compared with the main structure of YOLO, it replaces the fully connected layer with an average pooling layer, which is beneficial to better retain the spatial position information of the target.

(4) Using the optimized direct position prediction method, according to the set Anchors, on the feature map output by the last convolutional layer of the network, predict the frame of each grid, first predict tx, ty, tw, th, to These 5 values, and then calculate the location information and confidence of the predicted frame based on these 5 values.

Through the above improvements, YOLOv2 has significantly improved compared with YOLO in terms of average detection accuracy and training detection speed. As an intermediate version, we can understand it.

2. Overview of YOLOv3

In order to further improve performance, YOLOv3 was proposed. Compared with the previous two versions, YOLOv3 has made great improvements in the classification method and network structure. The specific implementation is as follows:

2.1 New basic network structure:

Darknet-53. Darknet-53 has a total of 75 layers, using a series of 3×3, 1×1 convolutions, including 53 layers of convolutional layers, and the rest are res layers, drawing on the idea of ​​ResNet (Residual Network, residual network), using jump Layer connections further optimize network performance. The network structure of Darknet-53 is shown in the figure.

insert image description here
In deep learning, the deeper the network, the easier it is for the gradient to disappear, leading to network degradation. Even if methods such as Batch Normalization are used, the effect is still not ideal. In 2015, Kaiming He and others proposed ResNet, which won the championship in the ILSVRC competition that year. The main idea of ​​ResNet is to add a "direct connection channel" to the network structure, and directly transmit the original output of a certain layer to the following layer. This layer-hopping connection structure can reduce the loss of original information during transmission. To a certain extent, it alleviates the problem of gradient disappearance in deep neural networks. The principle of ResNet is shown in the figure.

insert image description here
In ResNet, if xl and xl+1 are used to represent the input and output of layer l respectively, Wl represents the weight of layer l, and F represents the residual function of this layer, then the relationship between xl and xl+1 can be expressed It is: xl+1=xl+F(xl,Wl). If the network learns the L-th layer with such a structure, the relationship between the input xL and xl of the L-th layer represented by xL can be expressed as:

insert image description here
Thus, the gradient of the loss function during this backward pass is obtained:

insert image description here
It can be seen from the two items in the parentheses of the above formula that 1 ensures that the gradient can be transmitted without loss, and the size of the second item is determined by the network weight, and no matter how small this item is, it will not cause the problem of gradient disappearance. It can be seen that ResNet is easier and more accurate to learn the original input information.

Darknet-53 divides the entire network into several small ResNet structural units by introducing the res layer, and controls the propagation of the gradient by learning the residuals step by step, so as to alleviate the gradient disappearance during training.

2.2 Adopt multi-scale prediction mechanism.

YOLOv3 follows the Anchors mechanism in YOLOv2, and uses the k-means method to cluster 9 types of Anchors of different sizes. In order to make full use of these Anchors, YOLOv3 further refines the grid division, and distributes the Anchors equally to the 3 scales according to their size.

· scale1: Add 6 layers of convolutional layers after Darknet-53, directly get the feature map used to detect the target, the dimension is 13×13×(B×5+C), corresponding to the largest 3 kinds of Anchors, suitable for large targets detection.

sale2: Upsampling the output of the 79th layer of the network to generate a 26×26×(B×5+C) feature map, and merge it with the feature map output from the 61st layer, and then perform a series of convolution operations, The resulting feature map corresponds to 3 medium-sized anchors, which are suitable for medium target detection.

· scale3: Upsample the output of the 91st layer of the network to generate a feature map of 52×52×(B×5+C), first merge it with the feature map output from the 36th layer, and then perform a series of convolutions, and finally get the same The feature maps corresponding to the 3 smallest Anchors are suitable for small target detection. Through such improvements, YOLOv3 has significantly improved the detection effect of small targets compared to YOLOv2.

2.3 Classification using simple logistic regression

The classification loss function uses binary cross-entropy loss (binary cross-entropy loss), and no longer uses softmax for classification. In softmax classification, the predicted bounding box with the highest score gets a classification, but in many cases (especially when detecting multiple objects with occlusions or overlaps) softmax is not suitable.

Through continuous improvement and innovation, YOLOv3 has made the performance of the YOLO series model based on the regression idea reach a peak, taking into account the real-time and accuracy of detection to the greatest extent, providing real-time detection and tracking of dangerous objects, and environmental information for automatic driving. Acquisition and other application fields that have high requirements for real-time and accuracy provide a reliable model with very reference and research value.

Guess you like

Origin blog.csdn.net/qq_41600018/article/details/132395272