[target detection] (5) YOLOV1 target detection principle analysis

Hello classmates, today I will share with you the principle of YOLOV1 target detection.

1. Prediction stage - forward propagation

The prediction stage is to input unknown pictures and predict pictures after the model has been successfully trained. At this point only forward propagation is required to run the model .

The process is as follows. The shape of the input image of the model is [448, 448, 3] . After several convolution layers and pooling layers, the shape of the output feature map is [7, 7, 1024]; then the feature map is flattened to have In the fully connected layer of 4096 neurons, a 4096-dimensional vector is output; then the vector is input into the fully connected layer of 1470 neurons, and a 1470-dimensional vector is output; finally, the vector is reshaped into [7,7, 30] feature maps .

In the prediction stage, the YOLOV1 model is equivalent to a black box, inputting images of [448, 448, 3] and outputting feature maps of [7, 7, 30]. The output tensor contains the coordinates, confidence, and category results of all prediction boxes


The shape of the output feature map is [7,7,30] which can be understood as:

(1) First, the network divides the image into SxS grid cells , S=7 in YOLOv1, so each image is divided into a 7x7 grid.

(2) Each grid cell can predict b bounding boxes . In YOLOv1, b=2, and each grid can predict 2 prediction boxes . The two prediction boxes may be very different in size and shape. As long as the center point of the prediction box falls within the grid, it means that the detection box is generated by the grid. So the center points of the two prediction boxes generated by each grid must fall in the grid.

(3) Each prediction frame contains four positioning coordinates of center point coordinates (x, y), width and height (w, h) to determine the position of the prediction frame; it includes the confidence c of whether the object in the prediction frame is the target object ; Contains the conditional probability of all categories , assuming that if the prediction box already contains the target object, the probability that the object is a certain category . For example: the probability of being a dog if the target object is included.

(4) Multiply the confidence of each prediction box by the conditional probability of the category to obtain the probability that each prediction box belongs to each category.

(5) The number of channels of the output feature map is 30. It can be understood that each grid generates 2 prediction boxes, and each prediction box contains 5 parameters (x, y, w, h, c), then two predictions The box has 10 parameters and contains 20 categories in the VOC dataset, that is, each grid contains the conditional probability of these 20 categories.

Therefore, each grid contains 5+5+20 parameters, each image is divided into 7x7 grids, and one image has 7x7x30 parameters


As shown in the left figure below, each grid predicts two prediction boxes. The prediction boxes with high confidence are represented by thick lines, and those with low confidence are represented by thin lines . The prediction boxes with high confidence are reserved.

Each grid can also generate conditional probabilities for 20 classes, as shown in the figure on the right, which shows the grids occupied by classes with high conditional probabilities. For example, green represents areas with high conditional probability for dogs. There is only one class per grid, and the class with the highest of the 20 conditional probabilities is selected .

Each grid can only recognize one target object, and a 7x7 grid can predict up to 49 objects, which is why YOLOv1 has poor performance in small and dense target recognition


2. Prediction stage - post-processing

The 98 prediction frames predicted by the complex are filtered ( NMS non-maximum value suppression ), the prediction frame with low confidence is removed, and only one repeated prediction frame is retained to obtain the final target detection result.

Post-processing is to turn the 7x7x30 feature map output by the network into the final target detection result

Take a grid from the 7x7 grid to study, first as described in Section 1, each grid contains 2 prediction box parameters (x, y, w, h, c) and 20 categories of conditional probabilities ( Assuming that the prediction box contains the target object, it is the probability of a certain category) , that is, 5+5+20 parameters

Next, multiply the confidence of each predicted box by the conditional probability that each grid belongs to 20 categories to get the probability that the grid actually belongs to a certain category . The confidence of the first prediction box is multiplied by the conditional probabilities of the 20 categories to obtain the full probability that the first prediction box belongs to the 20 categories. A 20-dimensional vector representing the probability that a predicted box belongs to each category.

Then each grid has 2 probability vectors, each with 20 elements. 7x7 grid has 98 vectors

Now, 98 probability vectors have been obtained. Taking the dog category as an example , the probability of the dog category calculated by some prediction boxes is very small. Now set a threshold such as 0.2, and predict the probability value of all detection boxes for dogs with a probability less than 0.2. All become 0, and then sort all prediction boxes according to the probability value of the dog category.


Suppress NMS using non-maxima on sorted prediction boxes

First take out the prediction box with the largest probability value, and then compare the remaining prediction boxes with the prediction box with the highest probability one by one . If the IOU (intersection and union ratio) of the two boxes is greater than a certain threshold , the two detection boxes are considered to be detected. The same target object is recognized repeatedly, and the low-probability prediction frame is filtered out. Keep the prediction box that meets the requirements of the intersection and ratio

As shown in the figure below, the intersection ratio of the orange prediction box and the green prediction box is greater than 0.5, indicating that the two boxes predict the same target object, and the probability value of the green box with a small probability becomes 0

Next, take out the prediction box with the second highest probability value, and calculate the intersection ratio with the remaining prediction boxes one by one. The intersection ratio of the blue box and the purple box exceeds the threshold, indicating coincidence, and the probability value of the purple box with a smaller probability value is set to 0. In the same way, all prediction boxes are compared in turn.

After the final comparison, the prediction result of the remaining orange box and blue box is that the dog is retained, and then NMS is used for these 20 categories respectively. The final calculation result is a sparse matrix with many elements set to 0 . Find out the boxes whose probability value is not 0 in 98 prediction boxes, find the category index and probability value, and then get the final target detection result.


3. Training Phase

The training of supervised learning is to iteratively fine-tune the weights of neurons through gradient descent and directional propagation methods to minimize the loss function.

Target detection is a typical supervised learning problem. The detection frame of the real target has been manually drawn on the training set, and the algorithm should make the prediction frame fit the real detection frame as much as possible.

Which grid the center point of the real detection frame falls on, the prediction frame generated by which grid is needed to fit the real detection frame . Each grid generates two prediction boxes, then one of the two prediction boxes is required to fit the real detection box. And the category of the grid output must also be the category of the real detection box.

As shown in the figure below, the blue solid line box is the real detection box, and the center point falls in the red grid, which generates two prediction boxes, yellow and orange dashed boxes. The prediction box with the larger IOU intersection ratio of the prediction box and the real box is responsible for fitting the real box . The following figure is the yellow dashed box to fit the real box. Adjust the predicted box as close to the real box as possible . Let the confidence of the prediction box with a smaller intersection ratio be as small as possible.


loss function

The loss function contains five items, all of which are the loss function of the regression problem, predicting a continuous value, and comparing the value with the labeled value, the closer the better.

(1) The first is the center point positioning error of the prediction frame responsible for detecting the object . The predicted box and the real box are as consistent as possible on the horizontal and vertical coordinates .

(2) The second item is the width and height positioning error of the prediction frame responsible for detecting the object . The width and height of the predicted box should be as consistent as possible with the width and height of the real box, and the square root can make the small box more sensitive to errors.

(3) The third term is the confidence error of the prediction frame responsible for detecting the object . The label value is the IOU intersection ratio of the predicted box and the real box . The predicted value needs to be as close to the label value as possible.

(4) The fourth item is the confidence error of the prediction frame that is not responsible for detecting objects . The label value IOU of all predicted boxes that are not used to fit the ground-truth box is preferably equal to 0

(5) The fifth item is the classification error of the grid responsible for detecting the object . If a grid is responsible for predicting dogs, the closer the probability of the grid to 1 in the 20 categories, the better.

\mathbb{I}_{i}^{obj} Indicates whether the i-th grid contains objects , that is, whether the center point of the ground truth box falls in this grid. 1 if there , 0 otherwise

\mathbb{I}_{ij}^{obj} Represents the jth prediction box of the ith grid , 1 if it is responsible for predicting the object , otherwise 0

\mathbb{I}_{ij}^{noobj} Represents the jth prediction frame of the ith grid , 1 if it is not responsible for predicting objects , otherwise 0

\lambda coordRepresents giving more weight to the error of the prediction box that is  really responsible for detecting the object

\lambda noobj Represents giving a small weight to the error of the prediction box that is not responsible for detecting objects

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123767854