Yolov3 finishing

Network structure

Compared with the backbone network of YOLOv2, YOLOv3 has made great improvements. With the help of the idea of ​​residual network, YOLOv3 improved the original darknet-19 to darknet-53. The overall structure given in the paper is as follows:

Darknet-53 is mainly composed of 1×1 and 3×3 convolutional layers. After each convolutional layer, it contains a batch normalization layer and a Leaky ReLU. The purpose of adding these two parts is to prevent overfitting. The convolution layer, batch normalization layer and Leaky ReLU together form the basic convolution unit DBL in Darknet-53. Because there are 53 such DBLs in Darknet-53, it is called Darknet-53.

In order to understand the network structure of darknet-53 more clearly, you can look at the following picture:

In order to better understand this diagram, I will explain the main units below:

  • DBL:  A basic convolution unit composed of a convolutional layer, a batch normalization layer and a Leaky ReLU.
  • res unit: After the  input passes through two DBLs, it is added to the original input; this is a conventional residual unit. The purpose of the residual unit is to allow the network to extract deeper features while avoiding gradient disappearance or explosion.
  • resn:  where n represents n res unit; so resn = Zero Padding + DBL + n × res unit.
  • concat:  tensor splicing the middle layer of darknet-53 and the upsampling of a certain layer behind to achieve the purpose of multi-scale feature fusion. This is different from the add operation of the residual layer. The splicing will expand the dimension of the tensor, and the direct addition of add will not cause the tensor dimension to change.
  • Y1, Y2, Y3:  respectively represent the output of the three scales of YOLOv3.

Compared with darknet-19, darknet-53 mainly made the following improvements:

  • Instead of using the maximum pooling layer, a convolutional layer with a step size of 2 is used for downsampling.
  • In order to prevent over-fitting, a BN layer and a Leaky ReLU are added after each convolutional layer.
  • Introduced the idea of ​​residual network, the purpose is to allow the network to extract deeper features, while avoiding gradient disappearance or explosion.
  • The middle layer of the network and the upsampling of a later layer are tensor spliced ​​to achieve the purpose of multi-scale feature fusion.

Improvements

The biggest improvement of YOLOv3 lies in the improvement of the network structure, as it has been mentioned above. Therefore, the following mainly introduces other improvements:

(1) Multi-scale prediction
In order to be able to predict multi-scale targets, YOLOv3 selects three different shapes of Anchors, and each type of Anchors has three different scales, a total of 9 different sizes of Anchors. The sizes of the 9 Anchors selected on the COCO data set are shown in the red box in the figure below

Drawing on the idea of ​​feature pyramid network, YOLOv3 designed three different scale network outputs Y1, Y2, Y3, the purpose is to predict targets of different scales. Since each scale grid is responsible for predicting 3 bounding boxes, and the COCO data set has 80 classes. So the tensor output by the network should be: N × N × [3∗(4 + 1 + 80)]. Different times of downsampling results in different N. The final shapes of Y1, Y2, and Y3 are: [13, 13, 255], [26, 26, 255], [52, 52, 255]. See the original text:

(2) Loss function
For neural networks, the design of the loss function is also very important. However, the YOLOv3 article does not directly give the expression of the loss function. Following the analysis of the source code, the loss function expression of YOLOv3 is given:

 Comparing the loss function in YOLOv1, it is easy to know: the position loss part has not changed, and the loss calculation method of sum-square error is still used. However, the confidence loss and category prediction are changed from the original sum-square error to the cross-entropy loss calculation method. For category and confidence prediction, the effect of using cross entropy should be better.

(3) Multi-label classification

YOLOv3 improves the single-label classification of YOLOv2 to multi-label classification in terms of category prediction. In the network structure, the softmax layer used for classification in YOLOv2 is modified to a logical classifier. In YOLOv2, the algorithm determines that a target belongs to only one category, and classifies it into a certain category according to the maximum score of the network output category. However, in some complex scenarios, a single target may belong to multiple categories.

For example, in a traffic scene, a target category belongs to both cars and trucks. If softmax is used for classification, softmax will assume that the target belongs to only one category, and the target will only be identified as a car or truck. This classification method is This is called single-label classification. If the network output determines that the target is both a car and a truck, this is called multi-label classification.

In order to achieve multi-label classification, a logical classifier is needed to classify each category into two categories. The logical classifier mainly uses the sigmoid function, which can constrain the output from 0 to 1. If the value of the output of a feature map processed by this function is greater than the set threshold, then the target corresponding to the target box is determined to belong to the class.

To be added. . . . . . . . . . .

Guess you like

Origin blog.csdn.net/wzhrsh/article/details/110818276