Target detection -> YOLO V1

YOLO v1 is the pioneering work of One-stage work. YOLO: You Only Look Once. Published in CVPR2016

1. Main idea:

1) Divide an image into S*S grid cells. If the center of an object falls in the grid, the grid is responsible for predicting the object.

2) Each grid needs to predict B bounding boxes. In addition to predicting the position, each bbox also needs to predict a confidence value, and each grid also needs to predict the scores of C categories.

So the number of channels of the output feature map is: 5B+C

Among them, 5 refers to the confidence of the bounding box and the position parameters; B is the number of predicted bboxes for each position, and C is the number of categories. The output of the network is a S*S*(5B+C)-dimensional vector, which has a mathematical mapping relationship with the input, and the yolo network in the middle is just a tool for finding this mapping relationship.

If S=7, B=2, C=20, the prediction vector of each grid cell is as follows:

The position (x, y, w, h) of each bbox, x, y are the center coordinates of the predicted object. Since the center of the object falls in this grid, this grid is responsible for predicting the object, so the center of the object will not exceed this grid. They are relative displacement values ​​relative to the coordinates of the upper left corner of the current grid, at 0 A number between ~1.

And w, h are the width and height of the object, which may be larger than the grid, so they are relative values ​​relative to the width and height of the entire image. (There is no anchor concept in YOLOV1. The bbox in SSD and fast RCNN are relative to the anchor position)

Therefore, YOLOv1 is to make predictions on each grid. Ideally, the grid that contains the center point of the object will have a high confidence output, and the confidence output of the grid that does not contain the center point should be very close to 0. .

In general, YOLOv1 has three parts of output, namely confidence, class and bbox:

Confidence is the confidence of the frame, which is used to return whether there is an object in the grid, and the IOU of the object and the real position;

class is category prediction;

The bbox is the bounding box.

Confidence calculation method:

If the object is not in the current grid cell, the confidence is 0, otherwise the confidence is the IOU of the object and the real position.

The final predicted target probability is:

2. Network structure:

3. YOLOv1 loss function

The loss function is roughly divided into three parts. The first one is the prediction of the coordinates, which are x, y, w, and h of the frame. The second is the confidence prediction of the object, and the third is the category prediction of the object. The loss function corresponds to a 7*7*30-dimensional vector, which is a "mathematical expression" for finding the mapping relationship between input and output.

In order to balance the position deviation of the large and small target on the bbox loss, w and h use the square root.

1) Coordinate loss function

The reason why the root sign is used to calculate the length and width of the object is because the length and width loss of the large object after the root sign is similar to the length and width loss of the small object, so that the entire loss function will not be manipulated by the large object. If the root sign is not used, the loss of large objects is much greater than that of small objects, so this loss function will be more accurate for large objects and ignore small objects. The coefficient before the formula is a hyperparameter, because in the object detection process, the objects we want to detect are much less than the background, so this hyperparameter is added to balance the impact of "non-objects" on the results.

2) Confidence loss function

Why add the confidence of "non-object" here? Because if the network wants to learn to classify n objects, it actually needs to learn n+1 categories, and the extra "1" is the background or the real sense. Non-objects, this category occupies a large proportion, so it is necessary to learn this category to ensure the accuracy of the network. So why add hyperparameters in front of the confidence of "non-object" here? It is also because the target objects we detect are relatively few compared to "non-objects". If this hyperparameter is not added, the confidence loss of "non-objects" will be large, and the weight will be relatively large, which will cause the network Only the "non-object" features are learned, while the target object features are ignored.

3) Category loss function

Category loss is a very rough subtraction of two categories, which is an undesirable part of YOLOv1, and of course it will be changed later.

4. Existing problems:

1) It is not good for crowded target detection: YOLOV1 can only predict 2 bboxes per grid, and they are of the same category, so the effect is not good when small targets are gathered.

2) Directly predict the position of the bbox, without using relative anchor parameters, and the recall rate is lower than the method based on the region proposal class

Guess you like

Origin blog.csdn.net/wanchengkai/article/details/124384605