YOLOV1 of Deep Learning Target Detection Series

1 Introduction

          Following the faster rcnn, the RGB master zooms in on the target detection field, relying on its close to artistic performance. RGB Great God successfully demonstrated that famous saying. Everyone who keeps improving sees yesterday's self as rubbish. We all know that faster rcnn is learned in two stages, RCNN+RPN network. But the great god’s artistic instinct tells him that no, no, you only look once. On a dark and windy night for a month, this inspiration made the great god awake all night with excitement. There are hundreds of years in your life, why not get up and do it now, so you have this masterpiece with thousands of words.

2.you only look once

       1. Overall overview

                 In layman's terms, yolo, the computer is essentially an extension of humans, so we let him be as close to humans as possible, even beyond humans. The two-stage target detection theory based on faster is like that, if you want to know the position of the object you see, you have to imagine what shape and size it is in your mind. But common sense tells us that when people see an object, they don’t need to stop and think about the possible shape and size of the object. so, you only look once.

                               

        2. Testing process

                    The entire detection process is also very simple. As we can see in the figure, there are three steps. The first step is to scale the image to a size acceptable to the network (448*448); the second step uses a convolutional network to extract features, and then uses a fully connected output Prediction frame; the third step is to suppress the non-maximum value and delete the redundant frame.

                      

        3. Network structure

                   The following is a schematic diagram of the entire network structure, there are a total of 24 convolutional layers, 2 fully connected layers, followed by an output layer. Drawing on the idea of ​​residual network, a 1*1 convolutional layer is used to downsample the features of the previous layer. At the same time, a fast yolo was trained, using 9-layer convolution to replace 24-layer convolution.

              The final output is a 7*7*30 tensor. To explain this data, yolo will divide the original picture into a grid of S*S (7*7) cells; each cell is only responsible for predicting one category object C; each cell predicts two candidate frames B; each candidate frame needs to output five values ​​x, y, w, h, Pr (Object). It includes the center point position (x, y) of the box (relative to the upper left corner of the cell), width and height w, h, and the confidence level Pr (probability of whether there is an object*IOU) because it is 20 on the pascal data set Classification, each category has corresponding probability information. So the final output is the above S*S*(2*5+20) = 7*7*30;

       

          4. Loss function

                   This loss function that seems troublesome can actually be divided into three parts, candidate box, confidence, and classification; all use se square difference loss. Among them, λcoord (5) and λnoobj (0.5) are to balance the impact of too few positive examples and too many negative examples on the loss; the root sign of w and h is to balance boxes of different sizes (the same gap has greater influence on small boxes than large ones Box, square root to alleviate this inconsistency); then there are three losses for positive examples, and only the loss of confidence for negative examples.

                

           5. Activation function

                        yolov1 uses leaky relu instead of traditional relu; that is to say, values ​​less than 0 are not treated as 0, but multiplied by 0.1.

                     

           6. Training

                         1. During training, data preprocessing includes random scaling, rotation, and adjustment of exposure and saturation after converting the color gamut to HSV.

                          2. Now the first 20 convolutional layers are pre-trained on the imagenet data set, and the final training and testing are performed using darknet

                          3. Increase the learning rate from 0.001 to 0.01 for 75 rounds, then use 0.001 for 30 rounds, and finally use 0.0001 for 30 rounds

                          4. A dropout rate of 0.5 was used in the first fully connected layer

                          5. Finally yolo reached 63.4 in the voc data set MAP (accuracy rate and recall rate calculation), and the speed reached 45FPS. If the backbone uses VGG16, the MAP can reach 66.4, but the speed will drop to 21FPS, which can barely reach real-time. We see that although the great yolo has a lower accuracy rate than the faster, its speed has achieved a qualitative increase, which can meet the real-time requirements in most scenarios. But its recall rate does need to be improved. Don’t worry that we still have yolov2, yolov3 and yolov4 that just came out this year, and yolov5, which just came out but was evaluated by others as not worthy of the name yolo5; anyway, 4 and 5 are not great gods. RGB did it by itself, RGB is gone, just like Jay Chou doesn't sing anymore.

                           

3. Related issues

             1. The paper says that the cell where the center of the object is located is responsible for predicting the object. How to achieve this?

              This is a specific implementation problem in engineering, which is really a headache, but if you figure out this problem, you can basically figure out the internal processing logic of the yolo code.

              Through the source code, we know that yolo will pre-organize the information corresponding to the real frame in the picture S*S*25 (S*S* (confidence + real frame (x, y, w, h) + classification)), compare us The prediction frame S*S*(2*5+20), because a cell in the real picture is only responsible for predicting one frame, so there is no need to multiply by 2. Then it will calculate which cell the center point coordinates of the real frame fall into, using the method of x*(cell_size/image_size). For example, the center point of the real frame is (224*224), then the cell responsible for predicting this real frame is (int(224*7/448)=3, int(224*7/448)=3), which is the third The third cell in the row.

    def load_pascal_annotation(self, index):
        """
        Load image and bounding boxes info from XML file in the PASCAL VOC
        format.
        """

        imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
        im = cv2.imread(imname)
        h_ratio = 1.0 * self.image_size / im.shape[0]
        w_ratio = 1.0 * self.image_size / im.shape[1]
        # im = cv2.resize(im, [self.image_size, self.image_size])

        label = np.zeros((self.cell_size, self.cell_size, 25))
        filename = os.path.join(self.data_path, 'Annotations', index + '.xml')
        tree = ET.parse(filename)
        objs = tree.findall('object')

        for obj in objs:
            bbox = obj.find('bndbox')
            # Make pixel indexes 0-based
            #如果图像进行了缩放 框也要进行相应的缩放
            x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
            y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
            x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
            y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
            #name属性代表了类别
            cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()]
            #中心点坐标 宽 高
            boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1]
            #计算中心点的坐标落在哪一个cell单元格内
            x_ind = int(boxes[0] * self.cell_size / self.image_size)
            y_ind = int(boxes[1] * self.cell_size / self.image_size)
            if label[y_ind, x_ind, 0] == 1:
                continue
            #置信度为1 框的坐标 对应的类别为1
            label[y_ind, x_ind, 0] = 1
            label[y_ind, x_ind, 1:5] = boxes
            label[y_ind, x_ind, 5 + cls_ind] = 1

        return label, len(objs)

                           

          Finally, when calculating the loss, the response corresponds to the confidence of the real box. When calculating the loss, the box with the largest IOU in a cell will be set to 1 with the prediction confidence. The response filters all cells that are not responsible for prediction. The maximum value of IOU filters out the boxes with inaccurate predictions. Multiplying the two is the box we need to adjust.

 

   2. How to divide the original image into S*S cells

          From the first question, we can divide the real frame into S*S*25, and then express our real frame in the form of area division, and finally calculate the loss by dividing the area in this way. This realizes the function of dividing areas to predict.

4. Summary

        In this article, we introduced the once very awesome yoloV1, which uses a unified network to achieve end-to-end processing of target detection. First use the darknet network to extract features, and then connect to the fully connected layer for prediction. The original image is divided into S*S area cells, and the cell where the center of the object is located is adjusted. When the NMS non-maximum value is suppressed, it is different from the previous method of retaining the threshold value, but directly takes the box with the largest IOU for adjustment. The three-stage loss function of yolo is also introduced. Through such a concise form, yolo meets the real-time performance in terms of prediction speed. The only drawback is that yolo needs to work harder in terms of accuracy. They have worked hard and succeeded, as we will introduce later. Stay tuned!

         Damn it, it's so late again. I find that going to bed late and getting up late is my most comfortable time for work and rest. I now particularly like the guilt of not studying and blindly playing games. For so many years, I was tortured by this guilt, and now I finally got rid of it successfully. The word is cool. Now I am no longer a slave to learning, and learning is my little brother. Tomorrow I will challenge myself not to do anything, I will watch a movie and watch to death. I just go to bed, I'm in a daze, anyway, I just don't study. Amazing, really awesome!

  

Giao and Xu Zhenzhen collaborate on the latest single "Happy Real Giao"

 

 

 

 

Guess you like

Origin blog.csdn.net/gaobing1993/article/details/108368400