The basic idea of Yolov5 target detection

In the process of target detection, yolo divides the input feature map into S×S grids, and each grid detects the targets falling into it, and predicts the bounding boxes of the targets contained in all grids at one time, and locates the reliability. and probabilities for all classes.

The general process is image preprocessing (resize, enhance, etc.) --> convolutional network --> post-processing (generally non-maximum suppression), and then objects can be detected in the image. In the example of the original paper, person, dog, and horse were detected. and gives the confidence

 

The process in the above figure is the detection process:

1. Divided S×S grid

2. Get the Bounding boxes and confidence of the object, as well as the probability map of the Class probability map category

3. Combining the two in the second step, we can get the final result

I want to interject here, the format of the final yolov5 prediction result P is like this (different versions may have different order, but the idea has not changed)

From the above detection process, we can see that we not only need to predict the category of the object, but also predict the position and confidence of the object in the image .

Suppose there are 10 types of objects, then the Class Sorces in the above picture is 10

The representation of the position of each object in the image is the coordinates of the upper left corner of the object plus the corresponding length and width. tx-x coordinate of upper left corner ty-y coordinate of upper left corner tw-width of object th-height of object

Confidence Po

The last B here refers to how many bounding boxes of different scales are predicted. The multi-scale problem here actually follows the FPN method (multi-scale fusion) in Fast-Rcnn, and uses the feature pyramid network to output features of different sizes at different scales, so as to extract features better. To deal with input images of different sizes, in the case of large images-small targets, improve the accuracy of prediction

The Anchor anchor frame mechanism is still used in yolo. Under the same scale, multiple rectangular frames of different sizes are preset to train and estimate the target features.

 In yolov3/v4, when the picture is input to the network, the grid is drawn at different scales, and the center point of the detected object falls in the grid of which scale, then it is predicted by which scale. There are 3 anchor boxes for each match

But it is different in yolov5, which can realize cross-layer prediction, that is, each scale can participate in the prediction, and the detected object is counted as a positive sample in each layer. At this time, the anchor boxes for each match are 3-9

Confidence score formula mentioned above:

Total Confidence = Position Confidence × Category Confidence

Loss function calculation formula:

Total loss = classification loss + localization loss (error between predicted bounding box and ground truth) + confidence loss

Guess you like

Origin blog.csdn.net/qq_35326529/article/details/128168716