YOLOv3 paper translation (corrected)

This article is a series of papers on target detection : YOLOv3 paper translation (YOLOv3: An Incremental Improvement) translation (corrected sentence by sentence, convenient for everyone to learn)

 YOLOv3: An Incremental Improvement

  Summary

       We've made some more updates on YOLO! Implemented many small designs that improved the performance of the system, and trained a really cool new network. Although the new network is slightly larger than the previous version, the accuracy has also improved a lot. Don't worry, though, as it's still fast. YOLOv3 can process a 320×320 input image within 22 milliseconds and achieve a score of 28.2mAP. The accuracy is close to SSD321, but the speed is 3 times faster than it. Compared with the old two versions, YOLOv3 performs much better under the detection standard of 0.5IOU. In addition, compared with RetinaNet's accuracy of 57.5 AP50 and Titan X's performance of 198ms, YOLOv3's accuracy of 57.9 AP50 is only 51ms. In other words, similar performance but YOLOv3 is 3.8 times faster. As usual, all code is open source at: https://pjreddie.com/yolo/.
 

1 Introduction

      Sometimes, a year can be consumed by "trifles", don't you feel that way? So, I didn't do any research last year, but spent a lot of time on Twitter and played with GANs. Then, using the little energy [1][12] left last year, I made some improvements to YOLO. But, to be honest, it's not a big deal, it's just some small updates to make YOLO better, plus, I did a little research for others.

        In fact, that's why we're here. Our article is about to be submitted [4], and some of the updated content I made in YOLO is quoted in the article, but such a reference source does not exist yet, so let's write a technical report first!

        One good thing about a technical report is that it doesn't require a long "introduction", and I think at this point, you all know why we're writing this. After reading the end of the introduction, I believe you will have an idea of ​​the structure of the article. First we will introduce you to the update of YOLOv3, then we will show you how we are doing, then we will tell you about some failed attempts, and finally we will talk about what this update means.

 Figure 1 We adapted the data from the Focal Loss paper [9] to make this figure. It can be seen from the figure that Yolov3 runs significantly faster than other detection models when the performance is similar. All models are run on M40 or Titanx, which are very similar GPUs.

2. update

        So here's the update to YOLOv3: we're pretty much getting ideas for updates from other people. Of course we also trained a new classification network which is better than other networks. We will introduce it to you from the beginning, hoping that you can get a complete understanding.

2.1 Bounding box prediction

       Like YOLO9000, YOLOv3 also uses the result of border clustering on the dataset as a priori border [15]. The network predicts 4 coordinates for each bounding box: tx, ty, tw, and th. If the grid where the center of the object is located is offset by (cx, cy) from the upper left corner of the image, and its corresponding prior bounding box width and height are pw, ph, then the predicted value will be given by:

       During training, we use squared errors and a loss function. If the corresponding label value of the predicted coordinate is  , then our gradient is the difference between the label value (calculated through the label box) and the predicted value: . The label value can be easily calculated by inverting the above formula.

 figure 2. Bounding boxes with dimensionality priors and location predictions. We predict the width and height of the bounding boxes as offsets from the cluster centers. We use the sigmoid function to predict the center coordinates of the bounding box relative to where the filter is applied. This figure is clearly self-plagiarized from [15].

       YOLOv3 predicts object scores for each bounding box via logistic regression. A score of 1 is given if the current predicted bounding box prior coincides with the labeled bounding box more than all previous predicted bounding box priors. If the current bounding box prior is not the best, but its coincidence with the labeled bounding box object exceeds a set threshold, as in [17], we will discard the prediction. We use 0.5 as the threshold. But unlike [17], YOLOv3 only assigns a prior bounding box to each labeled object. If a prior bounding box is not assigned to an annotated object, it does not impose a penalty on the coordinate or category prediction, only the object confidence.

2.2 Classification Prediction

        Each box uses multi-label classification to predict the classes that are likely to be contained in the bounding box. We no longer use softmax as we found it does not help improve the performance of the model, replacing it with a separate logistic classifier. During training, we evaluate class predictions with a binary cross-entropy loss.

        When we use YOLOv3 on more complex data sets, such as the Open Images data set [7], the above operation will be of great help. Open Images contains a large number of overlapping labels (such as women and people), and if softmax is used, it will impose an assumption that each box has only one category, which is often not the case. In contrast, multi-label classification methods can better fit these data.

2.3 Multi-scale prediction

        YOLOv3 predicts on 3 different scales. Our system extracts features from these feature maps of different sizes using a similar concept to FPN [8]. We added several convolutional layers to the base feature extractor. The final output is a 3D tensor encoding the bounding box, object confidence and class prediction. In the COCO[10] dataset experiment, our model predicts 3 borders for each scale feature map, so the obtained tensor is N ×N ×[3∗(4+ 1+ 80)], which contains 4 bounding box offset values, 1 target prediction confidence, and 80 class prediction probabilities.

        Then, we obtain the feature maps of two scales from the penultimate convolutional layer in the convolutional block of the lowest two scales (8×8 and 16×16), and upsample them by a factor of 2 respectively. Then obtain the feature maps of two scales from the output of the convolution block (16×16 and 32×32) earlier than these two convolution blocks, and then respectively combine the feature maps of the high and low resolutions of the two scales Cascade up. In this way, we can obtain more semantic information from the upsampled features, as well as fine-grained information from the lower-level feature maps. Then, we add some convolutional layers to process these two (16×16 and 32×32) combined feature maps, and finally predict a size that is the original (8×8×[3∗(4+ 1+ 80) ] and 16×16×[3∗(4+ 1+ 80)]) twice as similar tensors.

         We again use the same design to predict bounding boxes on the last scale (8×8). Therefore, our prediction on the feature map of the third scale will integrate all the calculations before the network and the fine-grained features of the lower layers.

        YOLOv3 still uses the K-Means clustering method to determine the prior bounding box. In the experiment, we selected 9 kinds of prior bounding boxes and 3 sizes of feature maps, and then evenly distributed the 9 kinds of prior bounding boxes to the 3 sizes of feature maps. On the COCO dataset, the sizes of the nine clustering prior boxes are: (10×13), (16×30), (33×23), (30×61), (62×45), (59 ×119), (116 × 90), (156 × 198), (373 × 326).

2.4 Feature Extractor

        We use a new network for feature extraction. Our new network is a fusion of YOLOv2, Darknet-19 and residual networks. Our network uses successive 3×3 and 1×1 convolutional layers, as well as some residual blocks. It's significantly bigger, and it has 53 convolutional layers, so we can call it... call it... call it Darknet-53!

 Our network far outperforms Darknet-19 in performance and outperforms ResNet-101 and ResNet-152 in efficiency. Here are the experimental results of some networks on ImageNet:

  Table 2 Comparison of networks. Accuracy, billions of operations, gigaflops and FPS for each network.

      Each network is trained under the same configuration and tested with 256 × 256 pictures to obtain the test accuracy of single-size pictures. Runtimes are measured on a Titan X processing 256 × 256 images. Therefore, Darknet-53 not only has the accuracy comparable to the state-of-the-art classifier, but also has fewer floating-point operations and is faster. Darknet-53 performs better than ResNet-101 and is 1.5 times faster. Darknet-53 performance is similar to ResNet-152, but 2 times faster than it.

        Darknet-53 also has the highest number of floating point operations per second. This shows that our network structure can better utilize the GPU, making its prediction more efficient and faster. ResNets are slower, presumably because they have too many layers, so they are not as efficient.

2.5 training

        We are still only training on complete images, without repeatedly training samples that are difficult to classify correctly, and without doing any other operations. Operations such as multi-scale training, massive data augmentation, and batch normalization that we use also conform to the standard. We train and test under the darknet framework [14].

3. How are we doing


        YOLOv3 is really awesome! Looking at Table 3, for those strange mAP evaluation indicators of the COCO dataset, YOLOv3 performs comparable to DSSD, but is 3 times faster. But I have to admit that it is still much behind models like RetinaNet.

        However, when we evaluate YOLOv3 with the previous evaluation index, namely mAp (AP50 in the table) at IOU=0.5, we find that it is really strong. It is almost comparable to RetinaNet and much higher than DSSD. This shows that YOLOv3 is a very powerful detector that is good at generating suitable bounding boxes for objects. However, as the IOU threshold increases, YOLOv3 performance drops significantly, indicating that YOLOv3 is struggling to perfectly align boxes with objects.
 

 Table 3. I literally just stole all these tables from [9], they took a long long long time to do this. Well, YOLOv3 did a good job. Remember, RetinaNet takes 3.8 times longer to process an image than YOLOv3. YOLOv3 is much better than SSD and is comparable to state-of-the-art models on the AP50 scale!

 Figure 3. Adapted again from [9], this time showing a speed/accuracy comparison at mAP of 0.5 IOU. You could say YOLOv3 is good because it's very tall and far to the left. Can you cite your own current article? Guess who's going to try it, oh, it's this guy! [16]. Oh, and one more thing I forgot, we fixed a bug in data loading in YOLOv2, which seems to improve the model by 2 mAP. Just a sneaky mention here, it's not the point.

       In the past, YOLO has been competing with small object detection. Now, however, we may have to change the focus of our work. Because although through the new multi-scale prediction, YOLOv3 already has relatively high AP performance. However, its performance on medium and large object detection is relatively poor. It may take more research and experimentation to know how to improve this.

        When we plot accuracy vs. speed based on the AP50 metric (see Figure 3) we see that YOLOv3 has a significant advantage over other detection systems. That said, it's faster and better.

4. Failed attempts


        We tried many things in our implementation of YOLOv3, but many failed, here are some failed attempts that we can remember.

       "Anchor boxes" x, y offset predictions. We try to use the common "anchor box" mechanism by using linear activations to predict x,y offsets as multiples of the bounding box width or height. But we found that this approach destabilized the model and did not work well.

        Linear x,y predictions instead of logistic ones. Instead of using logistic activations, we try to use linear activations to directly predict x, y shifts. This lowers the mAP score.

        focal loss. We try to use focal loss, but it eats up about 2 points of our mAP. This may be because YOLOv3 with separate object prediction and conditional category prediction is already quite robust to the problem that focal loss is trying to solve. Is there no loss for class prediction for most examples? Or some other reason? We're not entirely sure.

        Dual IOU threshold and ground truth assignment. During training, Faster-R-CNN uses two IOU thresholds. If the coincidence between the predicted border and the marked border is not less than 0.7, it is judged to be a positive sample. Ignored if between 0.3-0.7. If it is lower than 0.3, it is judged as a negative sample. We also tried this method, but the final effect is not good.

        We really like our current model, it's at least as good as it gets right now. Some of the above techniques may make our model better, but we may still need to make some adjustments to them.

5. Significance of update


        YOLOv3 is a great detector, it is accurate and fast. Although its average AP under 0.3 and 0.95 IOU on the COCO dataset is not good, it is still very good under the old 0.5 IOU detection index.

        Then why do we change the indicators? In the original COCO paper, there is only such an ambiguous sentence: the evaluation is completed, and the complete evaluation result will be generated. However, Russakovsky et al. once pointed out that it is difficult for humans to distinguish the borders of IOUs of 0.3 and 0.5! "It is super difficult to train humans to distinguish between 0.3 and 0.5IOU borders with the naked eye" [18]. Since it is difficult for humans to distinguish, how important is this indicator?

         Maybe a better question is "what are we using it for" A lot of people working on this research are at Google and Facebook, I think at least we know this technology is in good hands and will never use it to collect your personal information and sell it to... wait, what do you want to do with it! ! oh.

        Also the military spends a lot of money on computer vision research and they've never done anything horrible like kill a lot of people with new technology... oh wait

        I very much hope that most people who use computer vision technology are just using it to do happy, happy things, such as counting the number of zebras in a national park [13], or tracking cats wandering in their homes [19]. But now computer vision technology is being used in some dubious ways. As researchers, we have at least a responsibility to think about the possible harms of our work and consider how they can be mitigated. We owe the world too much.

        Finally, don't @me, I've escaped Twitter.

Guess you like

Origin blog.csdn.net/m0_57787115/article/details/130325360