"Target Detection" YOLO V3

1. YOLO network structure description

The network structure is mainly composed of three main components:

1. Backbone

Convolutional neural network that aggregates and forms image features on different image granularities

2. Neck

A series of network layers that mix and combine image features and pass the image features to the prediction layer

3. Head

Predict image features, generate bounding boxes and predict categories

2. Introduction to YOLO V3

The prior detection system of yolov3 reuses the classifier or locator to perform detection tasks. Apply the model to multiple locations and scales of the image. Those areas with higher scores can be regarded as the test results. In addition, compared to other target detection methods, yolov3 uses a completely different method.

yolov3 applies a single neural network to the entire image. The network divides the image into different regions and predicts the bounding box and probability of each region. These bounding boxes are weighted by the predicted probability. There are some advantages over classifier-based systems. It looks at the entire image during the test, so its prediction uses the global information in the image. Unlike R-CNN, which requires thousands of single target images, it makes predictions through a single network evaluation. This makes YOLOv3 very fast, generally it is 1000 times faster than R-CNN and 100 times faster than Fast R-CNN.

Three, YOLO V3 network structure

Insert picture description here

The three basic components of Yolov3

1. CBL: The smallest component in the Yolov3 network structure, composed of three activation functions Conv+Bn+Leaky_relu.
2. Res unit: Learning from the residual structure in the Resnet network, the network can be built deeper.
3. ResX: Consists of a CBL and X residual components. It is a large component in yolov3. The CBL in front of each Res module plays the role of downsampling, so after 5 Res modules, the feature map obtained is 608->304->152->76->38->19.

Other basic operations:

  1. Concat: tensor stitching, will expand two tensor dimensions, for example, 26*26*256and 26*26*512two tensor stitching, the result is 26*26*768. Concat has the same function as the route in the cfg file.
  2. add: tensor sum, tensor directly added dimension will not expand, for example 104*104*128, and 104*104*128adding the result was 104*104*128. Add has the same function as shortcut in cfg file.

Insert picture description here
Yolov3's network structure description:

  1. In the feature utilization part, yolo3 uses darknet53 to extract multiple feature layers for target detection. A total of three feature layers are extracted. The three feature layers are located in different positions of the trunk part darknet53. They are located in the middle layer, middle and lower layer, bottom layer, and three feature layers. The shapes are (52,52,256), (26,26,512), (13,13,1024) respectively.
  2. Perform 5 convolution processing on these three initial feature layers. After processing, one part is used to output the prediction result corresponding to the feature layer, and the other part is used for deconvolution UmSampling2d and then combined with other feature layers.
  3. The shape of the output layer is (13,13,75), (26,26,75), (52,52,75), and the last dimension is 75 because the graph is based on the voc data set, and its class is 20 types, yolo3 only has 3 a priori boxes for each feature layer, so the final dimension is 3x25; if you are using the coco training set, there are 80 types, and the final dimension should be 255=3x85, with three feature layers The shape is (13,13,255), (26,26,255), (52,52,255)

Yolov3 output problem:

  1. The 9 anchors will be equally divided by the three output tensors. Choose your own anchor according to the three sizes of large, medium and small.
  2. Each output y will output 3 prediction boxes in each own grid. These 3 boxes are 9 divided by 3. From the perspective of the dimension of the output tensor, 13x13x255. 255=3*(5+80) . 80 means 80 categories, 5 means location information and confidence, 3 means 3 predictions are to be output.
  3. Logistic regression is used to score an objectness score for the content enclosed by each anchor. According to the target score, the anchor prior is selected for prediction, not all anchor priors will have output.

4. Decoding of prediction results

From the second step, the prediction results of the three feature layers can be obtained. The shapes are respectively (N,255,13,13), (N,255,26,26), (N,255,52,52) data, corresponding Each picture is divided into the positions of 3 prediction boxes on a grid of 13x13, 26x26, and 52x52. But this prediction result does not correspond to the position of the final prediction frame on the picture, and it needs to be decoded to complete it.

The prediction results of the feature layer correspond to the positions of the three prediction boxes. Reshape them first. The result is (N,3,85,13,13,3,85), (N,3,85,26,26) , (N,3,85,52,52).

The 85 in the dimension contains 4+1+80, which represent x_offset, y_offset, h and w, confidence, and classification result, respectively.

yolo3 specific decoding process: in the code that is generated first layer characterized in mesh size, then we set up in advance in the original 416*416resize frame prior to the active layer, wherein the size of the final results from the network prediction yolov3 Obtain the center adjustment parameters x_offset and y_offset of the prior frame and the width and height adjustment parameters h and w, adjust the prior frame on the size of the feature layer, and add each grid point to its corresponding x_offset and y_offset the result is that the center of the frame prior to the adjusted prediction block is the center, and then using the prior frame and h, w binding calculate the adjusted prior frame length and width, i.e. frame prediction, and high wide, so that in the feature layer can be obtained throughout the forecast position of the box, and finally the position of the predicted frame on the active layer, characterized readjusted to the original image 416*416on the size.

Of course, after obtaining the final prediction structure, score ranking and non-maximum suppression screening are required. This part is basically the common part for all target detection. However, this project is handled differently from other projects. It discriminates each category.

  1. Take out the boxes and scores of each category whose score is greater than self.obj_threshold.
  2. Use the position and score of the box for non-maximum suppression.

Summary: The process of decoding a priori frame is the process of adjusting the a priori frame using the prediction results of the yolov3 network (3 effective feature layers). After adjustment, the prediction frame is obtained.

Draw on the original image

Through predictive decoding, the position of the predictive frame on the original image can be obtained, and these predictive frames are filtered. These filtered boxes can be drawn directly on the picture, and the result can be obtained.

Five, the improvement of YOLO V3

1. Multi-scale prediction (FPN-like)

Each scale predicts 3 boxes, and the anchor design method still uses clustering to obtain 9 cluster centers, which are equally divided into 3 scales according to their sizes.

  • Scale 1: Add some convolutional layers after the basic network and then output box information.
  • Scale 2: Upsample (x2) from the convolutional layer of the penultimate layer in scale 1, and add it to the last 16x16 feature map, and output box information after multiple convolutions, which is larger than scale 1. double.
  • Scale 3: Similar to scale 2, using a 32x32 feature map

2. Better basic classification network (like ResNet) and classifier darknet-53

Insert picture description here
Backbone network modified to darknet53, its important feature is the use of residual network Residual, residual convolution darknet53 is once in 3*3, step 2 of convolution, convolution and then save the layer, then a 1*1convolution and a 3*3convolution, and this layer as a result of adding the final result, the characteristics of the residual network optimization is easy, and can improve accuracy by adding considerable depth. The internal residual block uses jump connections, which alleviates the problem of gradient disappearance caused by increasing depth in the deep neural network.

Each convolution part of darknet53 uses the unique DarknetConv2D structure, and performs L2 regularization during each convolution, and performs BatchNormalization and LeakyReLU after convolution. Ordinary ReLU sets all negative values ​​to zero, and Leaky ReLU assigns a non-zero slope to all negative values. In mathematical terms, we can express it as:
Insert picture description here

3. Classifier-category prediction

YOLOv3 does not use Softmax to classify each box. There are two main considerations:

(1) Softmax allows each box to be assigned a category (the one with the highest score), and for Open Images data sets, the target may have overlapping category labels, so Softmax is not suitable for multi-label classification.
(2) Softmax can be replaced by multiple independent logistic classifiers, and the accuracy rate will not decrease.

The classification loss uses binary cross-entropy loss .

Six, training part

1. What is pred?

For the yolo3 model, the final output of the network is the prediction box and its type corresponding to each grid point of the three feature layers. That is, the three feature layers correspond to the pictures divided into grids of different sizes. The corresponding positions, confidences and types of the three prior boxes on each grid point.

The shape of the output layer is (13,13,75), (26,26,75), (52,52,75), and the last dimension is 75 because it is based on the voc data set, and its class is 20 kinds , Yolo3 only has 3 a priori boxes for each feature layer, so the final dimension is 3x25; if the coco training set is used, there are 80 classes, and the final dimension should be 255 = 3x85, the shape of the three feature layers Is (13,13,255), (26,26,255), (52,52,255)

Note: The current y_pre is still not decoded, it will be the situation on the real image after decoding.

2. What is the target?

The target is the situation of the real frame in a real image. The first dimension is batch_size, the second dimension is the number of real frames in each picture, and the third dimension is the information of real frames, including location and type.

3. Loss calculation process

After getting the pred and target, you can't simply subtract them for comparison, you need to perform the following steps:

Step 1: Decode the prediction results of the yolov3 network, and obtain the adjustment data of the network prediction results to the prior frame

Step 2: Process the real frame to obtain the adjustment data of the prior frame that the network should really have, that is, the prediction result that the network should have, and then compare it with the prediction result of the network that we get. The get_target function in the code:

  1. Determine the position of the real frame in the picture and determine which grid point it belongs to to detect.

  2. Determine which a priori box overlaps with the true frame the most.

  3. Calculate what kind of prediction results the grid point should have to get the real frame (use the data of the real frame to adjust the preset a priori frame to get the adjustment data of the prior frame that the grid point should predict )

  4. All real frames are processed as above.

  5. Obtain the predicted result that the network should have and compare it with the actual predicted result.

Step 3: Ignore the a priori box that has no target in the real frame and the corresponding network prediction results and the degree of overlap is large, because there is no target in the real frame of the picture, that is, there is no object inside the frame, and the position of the frame The information is useless. The information of the prior box and the type it represents is meaningless. The adjusted prior box should be ignored. The network only outputs the target data information inside the box. The code The get_ignore function.

Step 4: After using the real frame to obtain the real adjustment data of the network and the adjustment data of the network prediction, we will compare the loss calculation

Guess you like

Origin blog.csdn.net/libo1004/article/details/111035582