YOLO9000 of Deep Learning Target Detection Series

1. Gossip

            Before formal study, I like to let myself go. I think technology is for chatting. Find a tavern and find some great gods. Let's listen to the music and talk together. So I especially hope that I can colloquially my own words, just like playing. Just like those storytellers in ancient times, Xiao Yuanshan and Murong Bo smiled at each other, Wang Tu domineering, blood and blood, and deep enmity, all to the dust. This is a way of expression that I yearn for, but I still can't reach that level, so I can only do my best.

2.YOLOV2

      1. Ten transformation points

            yolov1 improves the speed of target detection, but it drops off in terms of MAP. So speaking of the gods of iron fighting and the model of flowing water, they will naturally try various ways to solve this problem. In my opinion, it's like a programmer writing a bug, which always needs to be corrected in the end. So yolov2 can be divided into two parts. The first part is the effort to improve the MAP, and the second part is the optimization of the original model, of course, under the premise of ensuring the detection speed.

           The following 10 points are the efforts made by the great gods of V2. What does this mean, an increase in speed? Increased accuracy? Improve the generalization ability of the model? Yes, but more importantly, I think it is a manifestation of workload and year-end settlement. Sometimes when we read the paper, we feel that these great gods are like saints. All their efforts are to benefit the society and to promote the further development of AI vision. In fact, they are also humans, they will also have restrictions from various aspects, they will also have their own selfishness, some small willfulness and various helplessness that are born to be human. So reading the paper is talking to the great gods, while saying that you are awesome, while thinking in my heart that I must surpass you.

           

           Let's explain. If the results predicted by the model are not very satisfactory, then generally look for the reasons from the following three aspects: data, model, training strategy. If it still doesn't work, then look for it carefully!

           1. Data    

                        1) batch norm: After the data is processed by convolution, its mean and distribution will change. We use this method to pull the data in the middle layer back to the same distribution as the original data. To put it bluntly is to keep the data from running too seriously.

           2. Model       

                         . 1) Convolutional  using a full convolution network, after the full connection of the two layers yolov1 removed and replaced convolution. The advantage of full convolution is that the input can be arbitrary.

                         2) New network   designed a lightweight darknet19 network, which greatly reduced the amount of calculation and parameters, and increased the accuracy by 0.4%.

                         3) Anchor boxes  draws on the anchor concept of faster and adds the concept of a priori box, and each cell predicts 5 boxes.

                         4) Passthrough concats  the feature map 26*26 before the last pooling layer and the last feature 13*13 together for the final detection. There is a bit of FPN thinking here, because shallow information carries more shape information and is more suitable for detecting small objects; deep features carry more semantic information and are suitable for detecting large objects. After fusion, the model can improve the predictive ability of small objects.

                                         

            3. Training strategy

                         1) hi-res classifier : high-resolution classifier, which converts imagenet's 224*224 pictures into 448*448 for ten rounds of pre-training. Compared to yolov1, it is directly trained on 224, and then features are extracted on 448. In this way, the classification ability of the model will be improved.

                         2) The difference between dimension priors  and the manually defined a priori box size and ratio in faster is that yolov2 uses k-means to cluster the data before training, and uses 1-IOU to calculate the distance between the box and the box, clustering The width and height of the latter cluster center box is taken as the width and height of the a priori box.

                         3) The  difference between location prediction and faster is that the x and y of the yolov2 prediction box are offsets from the upper left corner of the cell, and in order to predict that the center point of the box exceeds the cell, sigmoid normalization is performed on it. Because the prediction method of faster will lead to the instability of the model, especially in the first few rounds, it may be that the center point often exceeds the position of the center point of the prediction box

                              

                           

                            4) For multi-scale  training with multiple sizes, the size of the input image is randomly changed every 10 rounds (this is the majesty of full convolution), the size ranges from 320,352,...,608, all of which are 32 Multiples of. Because the size of the input image and the output of the backbone network convolutional layer are 32 times downsampled. The model trained in this way has stronger generalization ability, and the prediction of comparing images of different sizes will be more accurate.

                            5) Hi-res detector's   last point is more like a by-product. It can predict high-resolution images more accurately. Because training is performed at high resolution, the high-resolution prediction is more accurate. And for high-resolution pictures, their own explanatory power is very strong, and semantic information is richer.

 

3.yolo9000

             Why is it such a bold name? It feels like yolo has evolved for a long time. The 9000 here means that yolo can accurately predict 9000 categories. How to do it? We all know that the imagenet data set used for classification has a total of 14197122 images and is divided into 21841 categories; the data set used for detection has a larger workload, so the number of images and the number of classifications in the data set will be smaller. For example, there are about 330,000 pictures in the coco data set with 80 categories.

             In essence, the prediction and classification tasks of the target frame are two different tasks, so we can try to use the classification data set for classification, and use the detection data set for detection + classification (because the detection data set must have category information of). In this way, the model can frame and mark more types of object pictures.

            The ideal is full, how to realize it? We use the back propagation of the loss to control it. It is too high. In fact, when the classified data set comes in, only the classified loss will participate in the reverse adjustment.

           But another problem is that some category information in the two data sets is not independent of each other. Such as cat and Garfield, Garfield is also a kind of cat. This is a big deal, because deep learning assumes that data is independent of each other. Then the great god found a tree. You see, the great things are related to trees. wordtree, as shown in the figure below, all category information is connected using a tree structure, and the path from each child node to the root node is unique. This uniqueness eliminates the trouble caused by the duplication of cats and Garfields. Our game can continue. When predicting, the probability output corresponding to each node is equal to the product of the outputs of all nodes on the path. It's perfect!

                          

4. Performance

       1.VOC2007

                           

       2.VOC2012

3.coco

 We can see that compared to v1, the yolov2 version has indeed improved performance and accuracy, but there are still some gaps in its performance compared to SSD. But his speed is indeed very fast. So yolo, who will never admit defeat, will have a lot of hard work, and we will offer you one by one in the follow-up.

4. Summary

        In this article, we introduced the improvements made by yolov2 over the v1 version. It mainly includes three aspects: data aspect: BN; model aspect: darkenet19, full convolution, passthrough, anchor box; training strategy aspect: high-resolution classification, high-resolution detection, clustering, multi-size training, loss adjustment; then The working principle of yolo9000 is introduced. The classification data set only performs reverse adjustment of classification loss. In order to solve the problem of non-independence of category information, we introduced wordtree; the final model can successfully detect 9000 objects. Finally, the performance and accuracy of yoloV2 are summarized, and its accuracy has been greatly improved, but there is still a gap in the coco data set compared to ssd. yolo never give up! See you next time.

5. Dessert moment

           I value a kind of ability very much, called the ability to believe. I think people can believe in something that seems impossible, and this kind of faith is great enough. Children want to be Spider-Man, we think he is so cute! But do you still believe it when you grow up? Is there a problem with his IQ? Not necessarily, really not necessarily. How much do we know about the world? Pathetic. How accurate is this little knowledge? Pathetic. The first step in discovering truth is to believe it. Therefore, I have always believed that we on Earth are all imperfect people. We were originally perfect. Face value, character, infinite brightness, and many other aspects that I can't imagine. I don't know exactly why I came, but I deeply believe that one day I will go back. Back to the original style, a perfect style. But maybe not all people can go back. The world may be a casino, and if you lose the gambling, you may not be able to go back. So what is the bet? Perhaps it is to bet that when we become imperfect, we will do some bad things.

  

         

[Heavyweight] Justin Bieber’s newest electronic single "Cold Water" is released~ I can hear the crying throughout the whole process @油兔不二分词组

 

       

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/gaobing1993/article/details/108382560