Introduction to YOLOv3

1. Prediction part

1. Darknet-53

1

The backbone extraction network of YOLOv3 is Darknet-53. Compared with Darknet-19 in the YOLOv2 period, it deepens the number of network layers and introduces a Residual residual structure. It greatly deepens the network through continuous 1X1 convolution, 3X3 convolution and the superposition of residual edges. Residual networks are characterized by being easy to optimize and can improve accuracy by adding considerable depth.
As shown in the figure, a Conv layer contains a two-dimensional convolution layer, a BatchNormalization layer and a LeakyReLU activation function layer.
BN layer:
The problem solved by the BN layer: As the depth of the neural network increases, training becomes more and more difficult and the convergence becomes slower and slower. BN uses certain standardization means to force the distribution of the input values ​​of the neurons in each layer of the neural network back to the standard normal distribution with a mean of 0 and a variance of 1. A simple understanding is to force the increasingly biased distribution back to a more standard distribution, so that the output of the neuron will not be very large, and the input of the activation function will fall in a more sensitive area, and a relatively large gradient can be obtained to avoid The vanishing gradient problem occurs, and larger gradients mean faster learning convergence, which can greatly speed up training.
Formally speaking, use x ∈ B \mathbf{x} \in \mathcal{B}xB represents a batch from mini-batchB \mathcal{B}Input of B , batch normalizedBN \mathrm{BN}BN convertsx \mathbf{x}x :
BN ( x ) = γ ⊙ x − μ ^ B σ ^ B + β . \mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat {\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}.BN(x)=cp^Bxm^B+β .
Among them:μ ^ B \hat{\boldsymbol{\mu}}_\mathcal{B}m^Bis a small batch B \mathcal{B}The sample mean of B ,σ ^ B \hat{\boldsymbol{\sigma}}_\mathcal{B}p^Bis a small batch B \mathcal{B}The sample standard deviation of B. Since unit variance is a subjective choice,a stretching parameter(scale)γ \boldsymbol{\gamma}γ andoffset parameter(shift)β \boldsymbol{\beta}β , their shapes are the same asx \mathbf{x}x is the same.

Note that γ \boldsymbol{\gamma}γ andβ \boldsymbol{\beta}β is a parameter that needs to be learned together with other model parameters. μ ^ B \hat{\boldsymbol{\mu}}_\mathcal{B}m^Bσ ^ B {\hat{\boldsymbol{\sigma}}_\mathcal{B}}p^BGiven the equation:
μ ^ B = 1 ∣ B ∣ ∑ x ∈ B x , σ ^ B 2 = 1 ∣ B ∣ ∑ x ∈ B ( x − μ ^ B ) 2 + ϵ \begin{aligned} \hat{\ballsymbol{\mu}}_\mathcal{B} &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x}\in \mathcal{ B}} \mathbf{x},\\ \hat{\ballsymbol{\sigma}}_\mathcal{B}^2 &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf {x} \in \mathcal{B}} (\mathbf{x} - \hat{\ballsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon.\end{aligned}m^Bp^B2=B1xBx,=B1xB(xm^B)2+ϵ .
Add a small constant to the variance estimate ϵ > 0 \epsilon > 0ϵ>0 , to ensure that division by zero is never possible, even in cases where the empirical variance estimate may vanish. Estimated valueμ ^ B \hat{\boldsymbol{\mu}}_\mathcal{B}m^Bσ ^ B {\hat{\boldsymbol{\sigma}}_\mathcal{B}}p^BScaling issues are counteracted by using noisy estimates of mean and variance.

LeakyReLU:

LeakyReLU is similar to ReLU, the difference is that ReLU has a smaller gradient when the value is less than zero and LeakyReLU still has a smaller gradient when the value is negative.
1

2. FPN:

1

YOLOv3 uses FPN feature pyramid for feature extraction, where enhanced features of three sizes can be obtained. Respectively (13, 13, 1024) (13, 13, 1024)(13,13,1024) ( 26 , 26 , 256 ) (26, 26, 256) (26,26,256) ( 52 , 52 , 128 ) (52, 52, 128) (52,52,128 ) . The first feature map is downsampled 8 times, the second feature map is downsampled 16 times, and the third feature map is downsampled 32 times. These three enhanced features are then passed into YOLO Head to obtain the prediction results.

Up2D is the upsampling layer, which is used to generate large-sized feature maps from small-sized feature maps through interpolation and other methods for feature fusion.

The concat operation comes from the design idea of ​​the DenseNet network, which directly splices the feature maps according to the number of channels. As shown in the figure, the last one-dimensional channel number of the concat layer is the sum of the last one-dimensional channel number to the left of the horizontal arrow and below the vertical arrow, and the sum operation y = f (x) + xy=f(x)+xy=f(x)+x is completely different.

The FPN feature pyramid uses small-size feature maps to detect large objects, and large-size feature maps to detect small objects (it can be understood that the smaller the size, the deeper the depth of the network, the larger the area of ​​the image seen, and the larger the detected object) .

Feature pyramid can fuse feature layers of different shapes, which is beneficial to extracting better features.

3. Decoding prediction results

YOLO head is essentially a 3 × 3 3 \times 33×3 convolution with1 × 1 1 \times 11×1 convolution. The former performs feature extraction, and the latter adjusts the number of channels to 75. Where75 = 3 × ( 1 + 4 + 20 ) 75=3 \times(1+4+20)75=3×(1+4+20) 3 3 3 means that each position contains three anchor boxes,1 11 represents whether the anchor box contains an object,4 44 represents the adjustment parameters of the anchor box,20 2020 represents the number of categories in the voc data set. (Assume that the training data set of YOLOv3 is the VOC data set)

After passing through YOLO Head, the prediction results of three enhanced feature layers are obtained, such as:

  • ( b a t c h _ s i e , 13 , 13 , 75 ) (batch\_sie, 13, 13, 75) ( ba t c h _ s i e ,13,13,75)
  • ( b a t c h _ s i e , 26 , 26 , 75 ) (batch\_sie, 26, 26, 75) ( ba t c h _ s i e ,26,26,75)
  • ( b a t c h _ s i e , 52 , 52 , 75 ) (batch\_sie, 52, 52, 75) ( ba t c h _ s i e ,52,52,75)

Among them, each feature layer divides the predicted picture into a grid corresponding to its size. For example, $ (13, 13, 75)$ divides the original picture into 13 × 13 13 \times 1313×13 grid, and then build 3 anchor boxes from the grid. These boxes are preset by the network. The prediction results of the network will determine whether these boxes contain objects and the type of the object. Therefore the above result can be reshaped as:

  • ( b a t c h _ s i e , 13 , 13 , 3 , 25 ) (batch\_sie, 13, 13, 3, 25) ( ba t c h _ s i e ,13,13,3,25)
  • ( b a t c h _ s i e , 26 , 26 , 3 , 25 ) (batch\_sie, 26, 26, 3, 25) ( ba t c h _ s i e ,26,26,3,25)
  • ( b a t c h _ s i e , 52 , 52 , 3 , 25 ) (batch\_sie, 52, 52, 3, 25) ( ba t c h _ s i e ,52,52,3,25)

Among them, 25 represents x_offset, y_offset, h, w, confidence and 20 classification results respectively.

However, this predicted value is not the final prediction result and needs to be decoded.

Anchor box decoding :

  1. Each grid point is added to its corresponding x_offset and y_offset, and the result is the center of the prediction box.
  2. Then the anchor box is combined with h and w to calculate the width and height of the prediction box, so as to obtain the position of the entire prediction box.
    As shown in the figure, σ (tx) \sigma(t_x)s ( tx)σ(ty)\sigma(t_y)s ( ty) is the offset based on the grid point coordinates of the upper left corner of the center point of the rectangular frame. σ \sigmaσ is the sigmoid activation function. pw p_wpw p h p_h phare the width and height of the anchor box.
    1

Confidence decoding :

The confidence level occupies a fixed digit in the 25 dimensions, and is mapped using sigmoid so that the result is between [0, 1].

Category decoding :

The number of categories accounts for 20 of the 25 dimensions, and each dimension independently represents a category.

  1. Use sigmoid instead of softmax to cancel the mutual exclusion between categories and make the network more flexible.

  2. The three feature maps can decode a total of 8 × 8 × 3 + 16 × 16 × 3 + 32 × 32 × 3 = 4032 boxes and corresponding categories and confidence levels. These 4032 boxes are used in different ways during training and inference:

    1. During training, all 4032 boxes are sent to the labeling function to calculate the label and loss function in the next step.

    2. When predicting, select a confidence threshold, filter out low-threshold boxes, and then pass nms (non-maximum suppression) to output the prediction results of the entire network.

      The simple understanding of non-maximum suppression is: when there are many anchor boxes, many similar predicted bounding boxes with obvious overlap may be output, all surrounding the same target. To simplify the output, non-maximum suppression (NMS) can be used to merge similar predicted bounding boxes belonging to the same object.

After decoding, the final predicted box position and category can be obtained and drawn in the original image.

2. Training part

1. Training strategy:

  1. There are three types of prediction boxes: positive, negative and ignore.

  2. Positive example: Take any ground truth (label), calculate the IOU with all 4032 boxes, and the prediction box with the largest IOU is the positive example. And a prediction box can only be assigned to one ground truth. For example, the first ground truth has matched a positive detection frame, then the next ground truth will look for the detection frame with the largest IOU among the remaining 4031 detection frames as a positive example. The order of ground truth can be ignored. Positive examples generate confidence loss, detection frame loss, and category loss. The prediction box is the corresponding ground truth box label (requires reverse encoding, using the real x, y, w, h to calculate tx. ty, tw, th t_x. t_y, t_w, t_htx.ty,tw,th); the category label corresponding to the category is 1, and the rest are 0; the confidence label is 1.

  3. Ignore samples: Except for positive examples, if the IOU with any ground truth is greater than the threshold (0.5 is used in the paper), the sample is ignored. Ignoring samples does not produce loss.

    Function: Since Yolov3 uses multi-scale feature maps, there will be overlap detection parts between feature maps of different scales. For example, there is a real object. The detection box assigned during training is the third box of feature map 1, and the IOU reaches 0.98. At this time, the IOU of the first box of feature map 2 and the ground truth reaches 0.95, which is also After detecting the ground truth, if its confidence level is forcibly labeled as 0 at this time, the network learning effect will be unsatisfactory.

  4. Negative example: Except for the positive example (the detection frame with the largest IOU after calculation with the ground truth, but the IOU is less than the threshold, it is still a positive example). If the IOU with all ground truth is less than the threshold (0.5), it is a negative example. For negative examples, only the confidence level produces loss, and the confidence level label is 0.

2. Loss function

The final loss consists of three components:

  1. Positive example, the difference between the encoded length and width, the x- and y-axis offsets, and the prediction frame.
  2. For positive examples, the value of the confidence level of the prediction result is compared with 1; for negative examples, the value of the confidence level of the prediction result is compared with 0.
  3. Comparison of actual existing boxes, category prediction results and actual results.

The final loss is the sum of the above three losses.

Among them, x, y, w and h use MSE (or smooth L1 Loss) as the loss function; the confidence and category labels use cross entropy as the loss function because they are 0 and 1 binary classification.

Guess you like

Origin blog.csdn.net/qq_44733706/article/details/129061237