In-depth understanding of target detection YOLOv1

reference:

https://blog.csdn.net/c20081052/article/details/80236015

Paper download: http://arxiv.org/abs/1506.02640 
darknet version of the code download: https://github.com/pjreddie/darknet

Code download for tensorflow version: https://github.com/hizhangp/yolo_tensorflow

Source code analysis can refer to: https://zhuanlan.zhihu.com/p/25053311

YOLOv1 is another masterpiece of rbg (Ross Girshick) after RCNN, fast-RCNN and faster-RCNN. It has a very entertaining name: YOLO (You Only Look Once). YOLOV1 solves object detection as a regression problem, based on a single end-to-end network, from the input of the original image to the output of the object position and category. Although the current version has some flaws, it solves a major pain point in the current DeepLearning-based detection, which is the speed problem. The enhanced version GPU can run 45fps, and the simplified version 155fps.

The main features of YOLO are:

  • It is fast and can meet real-time requirements. It can reach 45 frames per second on the Titan X GPU.
  • Using the whole image as the context information, there are fewer background errors (mistaking the background as an object).
  • Strong generalization ability.

1. The core idea of ​​YOLOv1 

  •  The core idea of ​​YOLO is to use the entire picture as the input of the network, and directly return to the position of the bounding box and the category to which the bounding box belongs in the output layer.
  • YOLO solves object detection as a regression problem. After the input image is inference (inference), the position of all objects in the image, their category and the corresponding confidence probability can be obtained. And rcnn/fast rcnn/faster rcnn divides the detection result into two parts to solve: the object category (classification problem), and the object position is the bounding box (regression problem).

Second, the network structure of YOLOv1

The YOLOv1 detection network contains 24 convolutional layers (used to extract features) and 2 fully connected layers (used to predict image position and class confidence), and uses a large number of 1x1 convolutions to reduce the previous layer layer to the feature space of the next layer. And in the paper, the author also gave the architecture of fast-YOLO, namely: 9 convolutional layers and 2 fully connected layers. Using titan x GPU, fast YOLO can achieve a detection speed of 155fps, but the mAP value has also dropped from 63.4% of YOLO to 52.7%, but it is still much higher than the mAP value of the previous real-time object detection method (DPM). 

  •  YOLOv1 divides an image into SxS grid cells. If the center of an object falls in this grid, this grid is responsible for predicting the object. As shown in the figure below, the center point (red origin) of the object dog in the figure falls into the grid in the fifth row and the second column, so this grid is responsible for predicting the object dog in the image.

     
  •  Each grid needs to predict B bounding boxes, each bounding box contains five data, namely: x, y, w, h, confidence. Where x, y refer to the coordinates of the center position of the bounding box of the object predicted by the current grid. w, h are the width and height of the bounding box (note: during training, the values ​​of w and h are normalized to the [0,1] interval using the width and height of the image; x, y are the relative positions of the center of the bounding box The offset value from the current grid position, and is normalized to [0,1]). At the same time, it is necessary to predict a confidence value. This confidence represents the dual information about the confidence that the predicted box contains the object and how accurate the box is predicted. The value is calculated as follows: 

      Among them, if an object falls in a grid cell, the first item is 1, otherwise it is 0. The second term is the IoU value between the predicted bounding box and the actual groundtruth. 

  • Each grid also needs to predict a category of information, denoted as category C, a picture is divided into SxS grids, each grid has to predict B bounding boxes, and also predict C categories, so the output is A tensor of S x S x (5*B+C). For the paper, the author selected S=7, B=2, C=20, so the output dimension is 7 * 7 * (20 + 2 * 5) = 1470, so this also explains why the fc26 in the network code The channel is 1470 dimension.
    Note: The class information is for each grid, and the confidence information is for each bounding box.

  • In the test, the class information predicted by each grid and the confidence information predicted by the bounding box are multiplied to obtain the class-specific confidence score of each bounding box:

 The first term on the left side of the equation is the category information predicted by each grid, and the second and third terms are the confidence predicted by each bounding box. This product encodes the probability that the predicted box belongs to a certain category, and also has information about the accuracy of the box. After obtaining the class-specific confidence score of each box, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.

Third, the implementation details of YOLOv1

  1. Each grid has 30 dimensions. Among these 30 dimensions, 8 dimensions are the coordinates of the regression box, 2 dimensions are the confidence of the box, and 20 dimensions are the categories. Among them, the coordinates of x, y are normalized to 0-1 with the offset of the corresponding grid, and w, h are normalized to 0-1 with the width and height of the image.

The picture of 448 is added with several layers of convolution after the idea of ​​google's inception (has a better effect). As shown in the figure above, the center of the object falls in the red grid, so Pr=1. So, why did the author add Pr to the confidence of this place? The reason is for the requirement of conditional probability calculation later.
In yolo's paper, the author mentioned that the prediction is not w and h, but w squared and h squared.
The above (x, y, w, h) parameters are explained as follows:
(x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
(w, h) The width and height are predicted relative to The whole image.
Here xywh I see the same explanation on the Internet, plagiarizing each other, this place is more difficult to understand, I will talk in detail about the meaning of the four values ​​of xywh predicted by bbox:
xy represents the center point of bbox relative to The offset value of the upper-left corner of the cell. The width and height are normalized relative to the width and height of the entire image. The offset calculation method is as follows:

After this expression, the value of xywh is normalized to between 0-1, which will not cause too difficult convergence.
Note that there is no concept of anchor in this yolo v1, so there is no anchor initialization during training. Only the weight of the network needs to be initialized, and then this weight outputs the result of 7 * 7 * 30, and the value inside is xywh Value. 

  2. Design of loss function

     In the implementation, the most important thing is how to design the loss function so that these three aspects are well balanced. The author simply and rudely uses sum-squared error loss to do this. This approach has the following problems: 
          First, it is obviously unreasonable that the 8-dimensional localization error and the 20-dimensional classification error are equally important; 
          second, if there is no object in a grid (such a grid in a picture Many), then the confidence of the boxes in these grids will be pushed to 0. Compared with fewer grids with objects, this approach is overpowering, which will cause network instability or even divergence. 
     Solution:

  • Pay more attention to the 8-dimensional coordinate prediction, and give these losses a larger loss weight, which is recorded as 5 in the pascal VOC training.
  • For the confidence loss of a box without an object, a small loss weight is assigned, which is recorded as 0.5 in pascal VOC training.
  • The confidence loss of the box with the object and the loss weight of the loss of the category are normally set to 1.
  • A grid predicts multiple bounding boxes. During training, we hope that each object (ground true box) has only one bounding box (one object, one bbox). The specific method is that the bounding box with the largest IOU of the ground true box (object) is responsible for the prediction of the ground true box (object). This approach is called the specialization of the bounding box predictor. Each predictor will predict better and better for a specific (sizes, aspect ratio or classed of object) ground true box.
  • In the prediction of bboxes of different sizes, compared with the prediction of the big bbox, the prediction of the small box is a little bit more unbearable. The same offset loss is the same in sum-square error loss. In order to alleviate this problem, the author used a more tricky method, which is to take the square root of the box width and height instead of the original height and width. As shown in the figure below: the horizontal axis value of the small bbox is small. When an offset occurs, the loss on the y-axis (green in the figure below) is larger than that of the big box (red in the figure below).

       In the prediction of boxes of different sizes, compared to the prediction of the large box, the prediction of the small box is definitely more intolerable. The same offset loss is the same in sum-square error loss. In order to alleviate this problem, the author used a more tricky method, which is to take the square root of the box width and height instead of the original height and width. This is easy to understand by referring to the figure below. The horizontal axis value of the small box is small. When an offset occurs, the response to the y-axis is larger than that of the large box.  

         A grid predicts multiple boxes. The hope is that each box predictor is specifically responsible for predicting an object. The specific method is to see which IoU in the currently predicted box and ground truth box is larger, whichever is responsible. This approach is called specialization of box predictor.

         The final loss function is as follows:

 In this loss function: 

  • Only when there is an object in a grid, the classification error will be punished.
  • Only when a box predictor is responsible for a ground truth box, will the coordinate error of the box be punished, and which ground truth box is responsible depends on whether its predicted value and IoU of the ground truth box are in that cell. The largest of all boxes.

 

 

 Four, YOLOv1 training and testing

1. Training phase

 

  • Pre-training classification network: Pre-train a classification network on the ImageNet 1000-class competition dataset. This network is the first 20 roll machine networks + average-pooling layer + fully connected layer in Figure 3 (the network input is 224*224 at this time).
  • Training the detection network: Convert the model to perform the detection task. "Object detection networks on convolutional feature maps" mentioned that adding convolution and full link layers to the pre-training network can improve performance. Based on their example, add 4 convolutional layers and 2 fully-linked layers, and initialize the weights randomly. Detection requires fine-grained visual information, so the network input is also changed from 224*224 to 448*448. See Figure 3.
  • A picture is divided into 7x7 grid cells. The center of an object falls in this grid. This grid is responsible for predicting the object.

  • The output of the last layer is (7*7)*30 dimensions. Each dimension of 1*1*30 corresponds to one of the 7*7 cells in the original image, and 1*1*30 contains category prediction and bbox coordinate prediction. In general, let the grid be responsible for category information, and the bounding box is mainly responsible for coordinate information (part of it is responsible for category information: confidence is also considered category information)
  • Each grid (1*1*30 dimension corresponds to the cell in the original image) has to predict the coordinates (x, y, w, h) of the two bounding boxes (the yellow solid box in the figure), where: the center coordinate x , Y is normalized to 0-1 with respect to the corresponding grid, and w, h is normalized to 0-1 with the width and height of the image. In addition to returning to its own position, each bounding box also has to predict a confidence value. This confidence represents the two pieces of information about the confidence that the predicted box contains the object and how accurate the box is: confidence =

Among them, if an object falls in a grid cell, the first item is 1, otherwise it is 0. The second term is the IoU value between the predicted bounding box and the actual groundtruth. 

 

  • Each grid also predicts category information. There are 20 categories in the paper. For a 7x7 grid, each grid has to predict 2 bounding boxes and 20 category probabilities, and the output is 7x7x(5x2 + 20). (General formula: SxS grids, each grid has to predict B bounding boxes and C categories, the output is a tensor of S x S x (5*B+C). Note: The class information is for each For each grid, the confidence information is for each bounding box)

 2. Test phase

In the test, the class information predicted by each grid is multiplied to obtain the class-specific confidence score of each bounding box.

The first term on the left side of the equation is the category information predicted by each grid, and the second and third terms are the confidence predicted by each bounding box. This product encodes the probability that the predicted box belongs to a certain category, and also has information about the accuracy of the box.

  •  Perform the same operation for each bbox of each grid: 7x7x2 = 98 bbox (each bbox has both corresponding class information and coordinate information)

  • After obtaining the class-specific confidence score of each bbox, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to get the final detection result.

  •  After obtaining the Bounding Box, Confidence, and Class probability, the non-maximum suppression algorithm is used to retain the target box.

 Five, the disadvantages of YOLOv1

  • YOLO does not perform well in detecting objects that are very close to each other and small groups. This is because only two boxes are predicted in a grid, and they belong to only one category.

  • In the test image, the generalization ability is weak for the new and unusual aspect ratios of the same type of objects and other situations.

  • Due to the problem of loss function, positioning error is the main reason that affects the detection effect. Especially the handling of large and small objects needs to be strengthened.

Guess you like

Origin blog.csdn.net/qq_40716944/article/details/104908692