Deep Learning (4) - Introduction to the basic ideas of the training and detection process of the target detection algorithm YOLO

      The mastery of basic knowledge determines the height of research. When we first came into contact with deep learning, we usually saw other people’s generalizations. This method is very good for us to get started quickly, but it also has a big disadvantage. The knowledge understanding is not thorough. As a result, we are confused about algorithm optimization. I also started my exploration of the essence of deep learning knowledge with the idea of ​​​​knowledge summary, and I hope to help more people. There are unclear points in the article. I hope fellow researchers (friends who study deep learning) will point it out and I will work hard to improve my article.

      Training process:

       There are two main differences between yolo and R-CNN series:

            The first one is the selection of RPN. Yolo does not need to select the candidate box, but forms an SxS grid on the input image to generate SxS grids. Then find the grid cell contained in the center store containing the true bounding box. Each grid cell can generate B (super parameter set by oneself, generally 2) bounding box.

               The second is the difference in the network structure. The RCNN series no matter whether the R-CNN chooses to train the convolutional layer from the beginning. Later, fast R-CNN proposed common convolution and ROI to solve the problem of different image input sizes, and finally the RPN network proposed by faster-rcnn was optimized step by step. The accuracy of network detection is very high, and the speed is gradually increasing. However, YOLO uses an end-to-end training process. There is no RPN layer, and there is no separation of classification and box regression. The YOLO network has become simpler and faster, and the accuracy is gradually improving.

        Basic steps of YOLO training process:

            When YOLO is training, first resize the input image to the same resolution, and then grid the image. Then find out the grid (grid cell) where the center point of the border corresponding to each detected object on the lable is wrong. Then each grid cell is responsible for detecting an object. Then eradicate the grid cell and draw the bounding box in B according to the settings, and then score the memory credibility of the bounding box. The credibility calculation includes the product of the recognized score and the IOU covered by the frame. The bounding box and classification information generated in this way are stored in a matrix of bxbxl. Then we have to calculate the loss. The loss mainly includes three parts. The first one is to calculate the error of the frame position. When calculating the box position, we only select the box with the highest confidence score for calculation. The second is the IOU error, which includes the highest scoring box and the remaining boxes we choose for each detected object. Since there are too many boxes left, we consider adding a parameter before the iou loss of the remaining boxes to maintain balance.

       YOlO detection process:

           During detection, YOLO traverses the entire grid drawn on the image. Each grid will have B bounding boxes, and then scores the identified objects and deletes the bounding boxes of Hell 0.2. Then sort the results of a certain type of detection from small to large, and then use non-extreme suppression to delete redundant frames. First, we select the largest box, and then find other boxes whose fusion exceeds the specified IOU, indicating that these boxes are similar to my selected maximum score box. We can abandon it. Save the ones below 0.5. Because there may be coverage. Then we sequentially select the box with the second highest score as the comparison box, knowing that a round of screening has been carried out. Then we follow the same steps and do the same for the next category. When everything is done, we will find that each category may correspond to several boxes, but a box can only have one box per object, so we select a box with the maximum score for each object as the final detection box.

Recommend a document: https://www.jianshu.com/p/7cc2e8a465e3

 This blog post belongs to my own independent thinking and understanding, please indicate the source of the quotation. If there is a mistake in understanding, I hope all bloggers can point it out.

     

Guess you like

Origin blog.csdn.net/qq_37100442/article/details/81706992