Preface to target detection - YOLO

        Object detection, also called object extraction, is an image segmentation based on the geometric and statistical characteristics of the object. It combines target segmentation and recognition into one, and its accuracy and real-time performance are an important capability of the whole system.

        We require the detector to output 5 values: object category class, coordinate x1 of the upper left corner of the bounding box, coordinate y1 of the upper left corner of the bounding box, coordinate x2 of the lower right corner of the bounding box, and coordinate y2 of the lower right corner of the bounding box.

Traditional Object Detection

        Traditional algorithms are usually divided into three stages: region selection, feature extraction, and sign classification.

        1) Region selection. Select the position where the object may appear in the image. Because the information such as the position and size of the object is uncertain, the traditional algorithm usually uses a sliding window (Sliding Windows). This algorithm will have a large number of redundant frames, and the computational complexity is high.

insert image description here

         2) Feature extraction. After obtaining the position of the object, feature extraction is carried out through an artificially designed extractor, such as SIFT, HOG, etc. Since the artificially designed extractor contains few parameters and has low robustness, the quality of feature extraction is not high.

        3) Feature classification. Classify according to the features obtained in the previous step, usually using SVM, AdaBoost classifier.

Deep Learning Object Detection

        Deep learning classic detection method:

        1) two_stage (two-stage): Faster-rcnn, Mask-Rcnn series

        2) one-stage (single stage): YOLO series

        Dual-stage is more accurate but slower, and single-precision is faster but less accurate.

1. Indicators

1) IOU (Intersection over Union) intersection and union ratio

 2) mAP (mean Average Precision) average correct rate of all classes: comprehensively measure the detection effect

        First of all, the accuracy rate of precision and the recall rate of Recall are calculated as follows:

 precision=\frac{TP}{TP+FP}

Recall=\frac{TP}{TP+FN}

        Among them, TP, FP, and FN are explained as follows:

        in:

        Accuracy: The proportion of the true correct results among all the predicted positive results.

        Recall: The proportion of all positive examples that are correctly predicted.

        Each prediction result in object detection contains two parts, the prediction box (bounding box) and the confidence probability (Pc). Confidence probability Pc has two meanings, one is the category of the predicted bounding box, and the other is the confidence probability of this category. The predicted boxes that exceed the threshold are the detection boxes.

        In general, as the confidence threshold decreases, more positive prediction boxes are judged as positive, and the recall rate will increase, but this will inevitably introduce negative examples that are falsely detected as positive, resulting in a decrease in accuracy. .

        With the adjustment of the confidence threshold, the recall rate increases steadily, the accuracy rate decreases as a whole, and the local area jumps up and down. The RP curve is as follows:

         The calculation of AP (Average Precision) is basically the same as calculating the area under the PR curve, but it is slightly different. The PR curve needs to be smoothed first. The method is, the precision rate p corresponding to the recall rate r is taken as the maximum precision rate p when the recall rate is greater than or equal to r. Right now:

        p(r)=\frac{max}{\bar{r} > r}p(\bar{r})

        After smoothing, it looks like this:

         The AP calculation can be defined as the area of ​​the interpolated precision-recall curve and the X-axis envelope. That is, the area enclosed by the smoothed PR curve and the X axis.

        mAP is to calculate the AP of all categories and then take the average.

        mAP=\frac{\sum_{i=1}^{K}AP_{i}}{K}

2. Introduction to YOLO-V1

        YOLO, You Only Look Once, the classic one_stage method, authored by Joseph Redmon. To convert the detection problem into a regression problem, a CNN can complete it. Real-time detection of video can be carried out, and the application field is very wide.

1) Prediction phase (forward inference)

        The network structure of YOLO-V1 is as follows:

         The input is 448\times 448\times 3an image. 448\times 448Represents the size of the image, \times 3representing the RGB three-channel of the color image. Image features are extracted through 24 convolutional layers, a 1470-bit vector is obtained through 2-layer fully connected layer regression, and 7\times 7\times 30a tensor (tensor) is obtained through a reshape operation. The output tensor contains the coordinates, confidence and category results of all prediction boxes, and the target detection results can be obtained by post-processing it.

Introduction to detection steps:

        1) First, YOLO divides the image into  S\times S grid cells (grids), in YOLO-V1  S=7.

        2) Each grid cell can predict B bounding boxes (bounding boxes), in YOLO-V1 B=2. Each bounding box (bounding box) contains five parameters (x,y,h,w,c). Among them x,y, is the center coordinate, h,w is the height and width, and cis the confidence level. The shape and size of the bounding box are variable, and the position of its center point determines which grid cell it belongs to.

        3) Each grid cell can generate its conditional probability of belonging to each category.

        4) Multiply the confidence of each bounding box by its conditional probability of belonging to each category to obtain the probability of each category. After post-processing, the target detection result can be obtained.

7\times 7\times 30The output tensor consists of:

        First, each graph has 7\times 7a grid cell, which is the output tensor 7\times 7.

        Each grid cell has two bounding boxes, and each bounding box has 5 parameters (x,y,h,w,c). That is, each grid cell has 10 bounding box parameters.

        In YOLO-V1, there are 20 categories, so each grid cell has 20 category conditional probabilities.

        Taken together is 7\times 7\times 30the tensor of .

        

2) Prediction stage (post-processing)

        After forward inference, a 30-bit tensor for each grid cell is obtained.

         Multiply the confidence of each bounding box by the category conditional probability of the grid cell it belongs to to get the confidence of each category.

        For one of the classes, first set a confidence threshold, set the confidence less than the threshold to 0, and then sort all the confidences, and the results are visualized as shown in the figure below:

        In the figure, the thickness of the wireframe represents the confidence level, and the color of the wireframe represents the category it belongs to.

        Then perform NMS (Non-Maximum Suppression, non-maximum suppression) on the prediction frame of the same category to obtain the final detection result.

NMS (Non-Maximum Suppression, non-maximum suppression)

       Taking the detection of dogs as an example, after threshold screening and sorting, the confidence data of all bboxes (bounding boxes) for dogs are obtained as follows:

The NMS steps are as follows:

        1) Calculate the maximum confidence bbox and the IoU of each bbox. If the IoU is greater than a certain threshold, it is considered to be repeatedly detected with the bbox with the maximum confidence, and its confidence is set to 0; if it is less than the threshold, it is retained.

         2) Continue to calculate the IoU of the bbox whose second largest confidence is not 0 and each bbox. Repeat step 1) until all bboxes are calculated, and the detection result of the dog can be obtained.

        Perform the above step 1) and 2) calculations for each category to get the detection results of each category.

3) Training phase (backpropagation)

        In the training phase, it is mainly to make the prediction frame fit the ground truth (correctly labeled data) as much as possible, so as to minimize the loss function.

         The green box in the figure is the ground truth. The grid cell where the center point of the green box falls is responsible for fitting the green box, and the category information output by the grid cell should also be the category of ground truth (this limits the number of target detections of YOLO-V1, and can identify at 7\times 7=49most ).

        For each grid cell, two bounding boxes are generated. The bbox with a larger intersection with the ground truth (IoU) is responsible for adjusting the fitting to the ground truth, and the other box reduces its confidence as much as possible.

         In the figure, the green solid line frame of the fitting ground truth is adjusted by the outer green dotted line frame.

        For the grid cell whose center point of the ground truth does not fall, the confidence of the two bboxes is as small as possible.

loss function

       Based on the above goals, the design loss function is as follows:

         Among them, all parameters with superscripts are labeled values, and those without superscripts are predicted values.

There are five loss functions:

        1) Responsible for detecting the positioning error of the bbox center point of the object.

        2) It is responsible for detecting the bbox width and height positioning error of the object. (Adding a root sign can make the small box more sensitive to errors)

        3) Responsible for detecting the bbox confidence (confidence) error of the object.

                Confidence prediction value: Find the confidence of this bbox from the model forward inference results.

                Confidence label value: Calculate the IoU of the bbox and ground truth responsible for detecting the object.

        4) Not responsible for the bbox confidence (confidence) error of the detected object.

                The confidence label value is set to 0.

        5) Responsible for detecting the grid cell classification error of the object.

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127751357