YOLOV3 network model

Table of contents

material

Network Model Principles

network framework

prior box

A priori frame calculation 

train

Confidence

Object Conditional Class Probability

discuss

Reference


material

Paper address: https://arxiv.org/abs/1804.02767

Code: https://github.com/ultralytics/yolov3

 

Network Model Principles

network framework

 

As shown in the upper left figure, Darknet-53 is a network model structure proposed in the paper, which can be used as the backbone of the detection model. Compared with Darknet-19, the number of network layers is deepened, and the cross-layer addition operation in Resnet is introduced. As shown in the upper right figure, the accuracy of Darknet on ImageNet is nearly 3 points higher than that of Darknet-19, which is comparable to the accuracy of ResNet-101 and ResNet-152, but the computational complexity and FPS are significantly better than these two.

 

 

The Backbone part of YOLOV3 in the above picture adopts the Darknet-53 structure. In the Neck part, the high-level features and the low-level features are fused. During the fusion, the upsampling method is used to double the length and width of the high-level feature map, and the high-level features are obtained. The same size as the underlying feature map, so that they can be concat together. The output layer outputs features of three sizes, the small-size feature map is used to detect large-size objects, and the large-size feature map detects small-size objects. The output dimension of the feature is NxNx[3x(4+1+80)], NxN is the number of grid cells of the output feature map, a total of 3 boxes, each box has a 4-dimensional prediction box value, , , , 1- t_{x}dimensional t_{y}prediction t_{w}box t_{h}confidence degree, the number of 80-dimensional object categories (a total of 80 categories in the COCO dataset).

prior box

The anchor box is actually the shape and size of the most frequently occurring boxes in the training set that are counted (using k-means) from all the ground truth boxes in the training set. For example, the most frequently occurring box shapes in a certain training set are flat, thin and tall, and squares with similar width and height ratios. We can add these statistical prior (or from human) experience to the model in advance, so that when the model is learning, it is less likely to be blindly searched, which of course helps the model to converge quickly. Take the three most common shapes of the ground truth box in the training data set mentioned above as an example. When the model is training, we can tell it that the shape of the object you want to find near grid cell 1 is either flat or long It is either thin and tall, or a square with a similar length and height ratio, so don't try other shapes blindly. The anchor box actually constrains the range of predicted objects and adds size prior experience to achieve the purpose of multi-scale learning.

A priori frame calculation 

The premise needs to know that the coordinates of ​and​ are (0,0) (0,1), (0,2), (0,3)...(0,13), (1,0), (1,1 )
,(1,2),(1,3)...(1,13) and so on;c_{x}c_{y}

The output of the bouding box should be: t_{x}, t_{y}, t_{w}, t_{h};

And the real prediction box should be: b_{x}, b_{y}(center coordinates), b_{w}, b_{h}(width and height);

Each grid cell is 1 as a range, and the actual size of each grid cell is 1 ∗ 1 

Several relational expressions in the above figure are disassembled and analyzed:

b_{x}=\sigma (t_{x}) + c_{x}

b_{y}=\sigma (t_{y}) + c_{y} 

Among them, \sigma (t_{x})is the sigmoid function;
c_{x}and c_{y}are the coordinates of the upper left corner of the grid cell relative to the entire picture.

The main purpose of using sigmoid in the formula is to compress the t_{x}sum t_{y}into the (0,1) interval, which can accelerate the convergence of the model in the early stage of training. In addition, using sigmoid to t_{x}compress t_{y}the sum to the [0,1] interval can effectively ensure that the target center is in the grid unit where the prediction is performed, preventing excessive offset. The result of sigmoid plus c_{x}​and c_{y}​that is, the coordinates of the grid cell where the center falls can get the center coordinates of the object on the feature map.

for:

b_{w}=p_{w}e^{t_{w}}

b_{h} = p_{h}e^{t_{h}}

Among them, p_{w}and p_{h}are the width and height of the anchor box;
t_{w}and are t_{h}the width and height directly predicted by the bounding box;
b_{w}and are b_{h}the actual width and height predicted after conversion.

This is also the width and height of the output in the final prediction. e^{t_{w}}The sum term is included in the formula e^{t_{h}}. First of all, it is not limited to 0-1 like the center coordinate point, indicating that its multiplication with the width and height of the anchor box is a relatively free value, and there is no limit to the width and height of the predicted box. Because the actual situation is also that the width and height can be large or small. e^{t_{w}}Secondly, the sum term is e^{t_{h}}convenient for derivation when deriving .

train

1. The prediction frame is divided into three situations: positive example (positive), negative example (negative), and ignore example (ignore).

2. Positive example: Take any ground truth, calculate IOU with all 4032 boxes, and the predicted box with the largest IOU is a positive example. And a prediction box can only be assigned to one ground truth. For example, the first ground truth has already matched a positive detection frame, then the next ground truth, in the remaining 4031 detection frames, looks for the detection frame with the largest IOU as a positive example. The order of ground truth can be ignored. Positive examples generate confidence loss, detection frame loss, and category loss. The prediction box is the corresponding ground truth box label (requires reverse encoding, calculated using the real x, y, w, h \hat{t_{x}},\hat{t_{y}},\hat{t_{w}},\hat{t_{h}}); the category label corresponds to 1, and the rest are 0; the confidence label is 1.

3. Ignore the sample: Except for the positive example, if the IOU with any ground truth is greater than the threshold (0.5 is used in the paper), it is an ignored sample. Ignoring examples does not generate any loss.

4. Negative example: Except for the positive example (the detection frame with the largest IOU after the calculation of the ground truth, but the IOU is less than the threshold, it is still a positive example), and the IOU with all the ground truth is less than the threshold (0.5), then it is a negative example. Negative examples only produce loss with confidence, and the confidence label is 0.

There is another point worth mentioning about the conversion of box parameters. In the training, the author does not convert t_{x}, t_{y}, t_{w}, t_{h}into b_{x}, b_{y}, b_{w}, b_{h}and then calculate the error with the corresponding parameters of the ground truth box, but uses the inverse operation of the above formula to convert the ground truth box The parameters of are converted to correspond to t_{x}, t_{y}, t_{w}, , , , and then calculate the error.t_{h}g_{x}g_{y}g_{w}g_{h}

That is to say, the output of our training is t_{x}, t_{y}, t_{w}, t_{h}, so when calculating the error, we also use the \hat{t}_{x}, \hat{t}_{y}, \hat{t}_{w}, \hat{t}_{h}values ​​of the real frame to calculate the error.

 According to the formula of the prediction box, we can deduce the formula of the actual box:

g_{x}= \sigma (t_{x}) + c_{x}

g_{y}= \sigma (t_{y}) + c_{y}

g_{w}= p_{w}e^{t_{w}} 

g_{h}= p_{h}e^{t_{h}}

In the calculation, due to the calculation of the inverse function of the sigmoid function, the inverse function of the sigmoid function is not calculated, but the corresponding sigmoid function value is calculated and output. get \sigma (\hat{t_{x}}), \sigma (\hat{t_{y}})\hat{t}_{w}, \hat{t}_{h}:

\sigma (\hat{t_{x}}) = g_{x} - c_{x}

\sigma (\hat{t_{y}}) = g_{y} - c_{y}

\hat{t_{w}} = log(g_{w}/p_{w})

\hat{t_{h}} = log(g_{h}/p_{h}) 

In this way, we can calculate the error based on the training output \sigma (t_{x}), \sigma (t_{y}), t_{w}, t_{h}and the real box  \sigma (\hat{t_{x}}), .\sigma (\hat{t_{y}})\hat{t}_{w}\hat{t}_{h}

 

Loss function

The loss abstract expression of the first feature map:

 The total Loss said:

1. \lambdaIt is a weight constant, which controls the ratio between the detection frame Loss, obj confidence Loss, and noobj confidence Loss. Usually, the number of negative examples is dozens of times that of positive examples, and the detection effect can be controlled through weight hyperparameters.

2. l_{ij}^{obj}If it is a positive example, output 1, otherwise it is 0;  l_{ij}^{noobj}if it is a negative example, output 1, otherwise it is 0; ignore the samples and output 0.

3.x, y, w, and h use MSE as the loss function, or you can use smooth L1 loss (from Faster R-CNN) as the loss function. smooth L1 can make the training smoother. Since the confidence and category labels are 0, 1 binary classification, cross-entropy is used as the loss function.

 

Confidence

There is also a very critical problem: the criterion for which bounding box we choose during training is to select the bounding box with the largest IOU between the predicted box and the ground truth box as the optimal box, but there is no ground truth box in the prediction. How can we choose the optimal bounding box? This requires another parameter, which is the confidence level mentioned below.

Indicates a degree of confidence: the degree of confidence that there is indeed an object in the framed box and the degree of confidence that the framed box includes all the features of the entire object. After the above explanation, we can actually express the definition of confidence in mathematical form:

C_{i}^{j} = Pr(Object) *IOU_{pred}^{truth} 

It Pr(Object)indicates whether there is an object in the current box, and the object represents all objects to be detected except the background. IOU indicates the possible IOU value of the predicted frame and the real frame, expressing the confidence level of the framed object.

C_{i}^{j}how to train

In training, the \hat{C_{i}^{j}}true value is determined \hat{C_{i}^{j}}by whether the bounding box of the grid cell is responsible for predicting an object. If responsible, then \hat{C_{i}^{j}}=1, otherwise,\hat{C_{i}^{j}}​​​​​​​​ = 0.

The IOU responsible for predicting the anchor box of an object and the ground truth box of the object is the largest among all the IOUs of the anchor box and the ground truth box, then it is responsible for predicting the object.

Object Conditional Class Probability

The object conditional category probability is an array of probabilities. The length of the array is the number of categories detected by the current model. Its meaning is the probability of each category in all categories to be detected when the bounding box thinks that there is an object in the current box.

discuss

Welcome everyone to join the group discussion

 

Reference

[Thesis Understanding] Understanding the anchor, confidence and category probability of yolov3_DLUT_yan's Blog-CSDN Blog_yolo Confidence

[Interpretation of the paper] Interpretation of the Yolo trilogy - Yolov3 bzdww 

[Thesis understanding] yolov3 loss function_DLUT_yan's blog-CSDN blog_yolov3 loss function 

[Intensive reading of AI papers] YOLO V3 target detection (with YOLOV3 code reproduction)_哔哩哔哩_bilibili 

 

 

Guess you like

Origin blog.csdn.net/qq_36076233/article/details/123136896