YOLOv2 paper translation (corrected)

This article is a series of YOLO papers: YOLOv2 paper translation (YYOLO9000: Better, Faster, Stronger) translation (corrected sentence by sentence, convenient for everyone to learn)

YOLOv2 paper (original text): YOLO9000: Better, Faster, Stronger

YOLOv2 Papers (Paper Interpretation/Paper Summary): YOLOv2 Paper Interpretation/Summary_Geng Gui Drinking Coconut Juice Blog-CSDN Blog


Summary

      We introduce YOLO9000, a state-of-the-art real-time object detection system that can detect over 9000 object categories. First, we propose various improvements to the YOLO detection method, which are both new and derived from previous work. The improved model YOLOv2 is state-of-the-art on standard detection tasks such as PASCAL VOC and COCO. Using a novel, multi-scale training method, the same YOLOv2 model can be run at different scales, providing an easy trade-off between speed and accuracy. At 67FPS, YOLOv2 got 76.8mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, surpassing state-of-the-art methods such as Faster R-CNN with ResNet and SSD, while still running at a high speed. Finally, we propose a method for jointly training object detection and classification. Using this approach, we simultaneously train YOLO9000 on the COCO detection dataset and the ImageNet classification dataset. Our joint training enables YOLO9000 to predict detections for object categories without labeled detection data. We validate our method on the ImageNet detection task. Despite only having detection data for 44 of the 200 classes, YOLO9000 achieves a mAP of 19.7 on the ImageNet detection validation set. In the 156 classes that are not available on COCO, YOLO9000 got 16.0 mAP. But YOLO can detect more than 200 classes; it can predict detections for over 9000 different object categories. All run in real time.

1 Introduction

       General-purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly faster and more accurate. However, most detection methods are still limited to a small set of objects.
Current object detection datasets are limited compared to datasets for other tasks such as classification and labeling. The most common detection datasets contain thousands to hundreds of thousands of images with tens to hundreds of labels [3][10][2]. Classification datasets have millions of images with tens or hundreds of thousands of categories [20][2].
      We want detection to be at the level of object classification. However, labeling images for detection is much more expensive than labeling for classification or labeling (labels are usually provided to users for free). Therefore, it is unlikely that we will see detection datasets on the same scale as classification datasets in the near future.
       We propose a novel approach to exploit the large amount of classification data we already have and use it to expand the scope of current detection systems. Our approach uses a hierarchical view of object classification, allowing us to combine different datasets together.
We also propose a joint training algorithm that enables us to train object detectors on detection and classification data. Our method utilizes labeled detection images to learn to precisely localize objects, while using classified images to increase its vocabulary and robustness.
      Using this approach, we trained YOLO9000, a real-time object detector that can detect over 9000 different object categories. First, we improve on the base YOLO detection system to produce YOLOv2, a state-of-the-art real-time detector. We then train a model on over 9,000 categories from ImageNet and detection data from COCO using our dataset combination method and joint training algorithm.
      All our code and pretrained models are available online: http://pjreddie.com/yolo9000/.

2. better

      Compared with the state-of-the-art detection system, YOLO has various defects. Compared with Faster R-CNN, the error analysis of YOLO shows that YOLO has a large number of localization errors. Furthermore, YOLO has a relatively low recall rate compared to region proposal based methods. Therefore, our main focus is to improve recall and localization accuracy while maintaining classification accuracy.
       Computer vision generally tends towards larger and deeper networks [6][18][17]. Better performance often depends on training larger networks or ensemble multiple models together. However, in YOLOv2 we would like to have a detector that is more accurate but still fast. Instead of expanding our network, we simplify the network and then make the representation easier to learn. We combine various ideas from past work with our own novel concepts to improve the performance of YOLO. A summary of the results can be seen in Table 2.
      Batch normalization.  Batch normalization leads to a significant improvement in convergence while eliminating the need for other forms of normalization [7]. By adding batch normalization on all convolutional layers of YOLO, we get more than 2% improvement in mAP. Batch normalization also helps normalize the model. With batch normalization, we can remove dropout from the model without overfitting.

       High-Resolution Classifiers.  All state-of-the-art detection methods use classifiers pre-trained on ImageNet [16]. Starting with AlexNet, most classifiers operate on input images smaller than 256×256 [8]. The original YOLO trains the classifier network at 224×224 and increases the resolution to 448 for detection training. This means that the network must also adjust to the new input resolution when switching to detection learning.
       For YOLOv2, we first fine-tune the classification network on ImageNet at the full resolution of 448×448 for 10 epochs. This gives the network time to adjust its filters to work better with higher resolution inputs. Then, we fine-tune the results of the detection network. This high-resolution classification network increases our mAP by nearly 4%.

      Convolutional  YOLO with anchor boxes directly uses a fully connected layer on top of a convolutional feature extractor to predict the coordinates of bounding boxes. Faster R-CNN does not directly predict coordinates, but uses hand-picked priors to predict bounding boxes [15]. The Region Generation Network (RPN) in Faster R-CNN uses only convolutional layers to predict offsets and confidences for anchor boxes. Since the prediction layer is a convolution, the RPN predicts the offset for each location in the feature map. Predicting offsets instead of coordinates simplifies the problem and makes the network easier to learn.
       We remove fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First, we eliminate a pooling layer to make the output of the network's convolutional layers higher resolution. We also downscale the network so that it operates on input images with a resolution of 416×416 instead of 448×448. We do this because we want to have an odd number of locations in our feature map so that there is only one central cell. Objects, especially large objects, tend to occupy the center of the image, so it is good to have a single location at the center to predict these objects, rather than four locations around the center. YOLO's convolutional layers downsample the image by a factor of 32, so by using an input image of 416, we get a 13×13 output feature map.
      After introducing the anchor boxes, we separate the category prediction mechanism from the spatial location, predicting the class and object of each anchor box independently. Like the original YOLO, the target prediction still predicts the IOU of the a priori box and the real box, while the category prediction predicts the conditional probability of the category in the presence of the target.
      With anchor boxes, we get a small drop in accuracy. YOLO only predicts 98 boxes per image, but with anchor boxes, our model predicts over a thousand boxes. Without anchor boxes, our intermediate model has a mAP of 69.5 and a recall of 81%. With anchor boxes, our model has a mAP of 69.2 and a recall of 88%. Even though mAP drops, the increase in average recall means that our model has more room for improvement.

       Dimensional Clusters  We ran into two problems when using anchor boxes with YOLO. The first problem is that the size of the box is hand picked. The network can learn to adjust boxes appropriately, but if we pick better prior anchor boxes for the network to start with, we can make it easier for the network to learn to predict good detections.
       We run k-means clustering on the bounding boxes of the training set to automatically find good prior parameters, rather than manually selecting prior parameters. If we use standard k-means and Euclidean distance, large boxes produce more error than small boxes. However, what we really want are anchor boxes priors that can achieve good IOU scores, independent of box size. Therefore, for the distance metric, we use: d( box , centroid )=1−IOU( box , centroid )​.
      We run k-means for different values ​​of k and plot the mean IOU closest to the center point, see Figure 2. We choose k = 5 as a good trade-off between model complexity and high recall. The cluster centroids are significantly different from the hand-picked anchor boxes. There are fewer short, wide frames and more tall, thin frames.
      We compare the average IOU of our clustering strategy and hand-picked anchor boxes with the closest prior in Table 1. With only 5 priors, the center points perform similarly to 9 anchor boxes with average IOUs of 61.0, 60.9, respectively. If we use 9 center points, we see a much higher average IOU. This shows that using k-means to generate our bounding boxes starts the model with a better representation and makes the task easier to learn.

      Direct location prediction.  When YOLO used anchor boxes, we ran into a second problem: model instability, especially in early iterations. Most of the instability comes from the prediction of the (x, y) position of the box. In the area generation network, the calculation method of the network prediction value tx and ty, (x, y) center coordinates is:

     For example, a prediction with tx=1 will shift the box to the right by the width of the anchor box, and a prediction with tx=-1 will shift the box to the left by the same amount.
       This formulation is unrestricted, so any anchor box can end up at any point in the image, regardless of where the box is predicted. With random initialization, it takes a long time for the model to predict a reasonable offset stably.
       Instead of predicting offsets, we follow YOLO's approach and predict coordinates relative to grid cell locations. This makes the true value bound between 0 and 1. We use logistic activations to constrain the network's predictions to fall within the range 0-1.
       The network predicts 5 bounding boxes per unit in the output feature map. The network predicts 5 coordinates for each bounding box, namely tx, ty, tw, th and to. If the offset of the cell from the upper left corner of the image is (cx, cy), and the width and height of the prior box are pw, ph, then the predicted value corresponds to:

       Since we constrain the position predictions, the parameterization is easier to learn, making the network more stable. Using dimensional clustering and directly predicting the center position of the bounding box, YOLO improves by nearly 5% compared to the version using the anchor box.

       Fine-grained features. This modified YOLO predicts detections on a 13×13 feature map. While this is adequate for large objects, it may benefit from finer-grained features to localize smaller objects. Both Faster R-CNN and SSD run their networks on various feature maps in the network to obtain multiple resolutions. We take a different approach and only need to add a pass-through layer to extract 26×26 resolution features from earlier layers.
       The pass-through layer concatenates high-resolution features with low-resolution features by stacking adjacent features onto different channels instead of spatial locations, similar to identity mapping in ResNet. Such fine-grained features. This turns the 26×26×512 feature map into a 13×13×2048 feature map, which can be concatenated with the original features. Our detector operates on top of this extended feature map so that it has access to fine-grained features. This gives a modest 1% performance boost.

      Multi-scale training The original YOLO uses an input resolution of 448×448. By adding anchor boxes, we change the resolution to 416×416. However, since our model only uses convolutional and pooling layers, it can be resized on the fly. We want YOLOv2 to run robustly on images of different sizes, so we apply multi-scale training to the model.
Instead of modifying the size of the input image, we change the network every few iterations. Every 10 batches, our network randomly chooses a new image size. Since our model is downscaled by a factor of 32, we draw from the following multiples of 32: {320, 352, …, 608}. So the smallest option is 320 x 320 and the largest is 608 x 608. We will resize the network and continue training.
      This regime forces the network to learn to make good predictions across various input dimensions. This means that the same network can predict detection results at different resolutions. Networks run faster at smaller sizes, so YOLOv2 provides an easy trade-off between speed and accuracy.
      At low resolutions, YOLOv2 operates as an inexpensive, reasonably accurate detector. At 288×288, it runs at over 90 FPS and its mAP is almost as good as Faster R-CNN. This makes it ideal for smaller GPUs, high frame rate video, or multiple video streams.
     At high resolution, YOLOv2 is a state-of-the-art detector with a mAP of 78.6 on VOC 2007, while still running faster than real-time.

    For further experiments, we trained YOLOv2 to detect VOC 2012. Table 4 shows the performance comparison of YOLOv2 with other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running much faster than compared methods. We also train on COCO and compare with other methods in Table 5. On the VOC indicator (IOU = 0.5), YOLOv2 gets 44.0 mAP, which is comparable to SSD and Faster R-CNN.

3. Faster

       We want detection to be accurate, but we also want it to be fast. Most detection applications, such as robotics or self-driving cars, rely on low-latency predictions. To maximize performance, we designed YOLOv2 to be fast from start to finish.
      Most detection frameworks rely on VGG-16 as the base feature extractor [17]. VGG-16 is a powerful and accurate classification network, but it is unnecessarily complex. The convolutional layer of VGG-16 requires 30.69 billion floating-point operations to process a 224×224 resolution image.
       The YOLO framework uses a custom network based on the Googlenet architecture [19]. This network is faster than VGG-16, and only 8.52 billion operations are used for a forward pass. However, its accuracy is slightly worse than VGG16. For a single image of 224×224, the top 5 accuracy rate, YOLO’s custom model accuracy on ImageNet is 88.0%, while VGG-16 is 90.0%.

      Darknet-19 we propose a new classification model as the basis of YOLOv2. Our model builds on previous network design work as well as common knowledge in the field. Similar to the VGG model, we mainly use 3×3 filters and double the number of channels after each pooling step [17]. Following the work of Networks in Networks (NIN), we use global average pooling for prediction and 1×1 filters to compress feature representations between 3×3 convolutions [9]. We use batch normalization to stabilize training, speed up convergence, and regularize the model [7].
       Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. See Table 6 for a complete description. Darknet-19 only needs 5.58 billion operations to process an image, but it achieves the highest accuracy rate of 72.9% and top-5 accuracy rate of 91.2% on ImageNet.

      Classification training  We use the stochastic gradient descent method to train the network 160 times on the standard ImageNet 1000 class classification data set, using the Darknet neural network framework [13], the initial learning rate is 0.1, the polynomial rate decays to the 4th power, and the weight decays is 0.0005 and Momentum is 0.9. During training, we use standard data augmentation techniques, including random crops, rotations, shifts in hue, saturation, and exposure.
      As mentioned above, after initial training on images of 224×224, we fine-tune our network on a larger size (448). In this fine-tuning, we train with the above parameters, but only for 10 epochs, and start with a learning rate of 10−3. At this higher resolution, our network achieves a top accuracy of 76.5% and a top-5 accuracy of 93.3%.

     Detection training  We modified this network by removing the last convolutional layer and instead adding three 3×3 convolutional layers with 1024 filters each, followed by a final 1× 1 convolutional layer, the number of outputs is what we need for detection. For VOC, we predict 5 coordinates for 5 boxes, each with 20 categories, so there are 125 filters. We also add a pass-through layer from the last 3×3×512 layer to the second convolutional layer so that our model can use fine-grained features.
    We train the network for 160 epochs with a starting learning rate of 10−3, divided by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and a momentum of 0.9. We use data augmentation similar to YOLO and SSD, including random cropping, color transformation, etc. We use the same training strategy on COCO and VOC.

Four, stronger

       We propose a mechanism for joint training on classification and detection data. Our method uses images labeled for detections to learn detection-specific information, such as bounding box coordinate predictions and object classes, and how to classify common objects. It uses images with only class labels to expand the number of classes it can detect.
      During training, we mix images from detection and classification datasets. When our network sees an image that has been labeled for detection, we can backpropagate against the full YOLOv2 loss function. When it sees a classified image, we only backpropagate the loss from the classification-specific part of the architecture.
       This approach poses some challenges. The detection dataset has only common objects and general labels like "dog" or "boat". Classification datasets have a wider and deeper range of labels. ImageNet has over a hundred dog breeds, including "Norfolk terriers," "Yorkshire terriers," and "Bedlington terriers." If we want to train on these two datasets, we need a coherent way to combine these labels.
     Most classification methods use a softmax layer over all possible classes to compute the final probability distribution. When using softmax, the categories are assumed to be mutually exclusive. This poses a problem for merging datasets, for example, you wouldn't want to use this model to merge ImageNet and COCO, because the two classes "Norfolk terrier" and "dog" are not mutually exclusive.
     We can combine datasets using a multi-label model that does not assume mutual exclusion. This approach ignores everything we know about the structure of the data, e.g. all COCO classes are mutually exclusive.

      Hierarchical Classification  The labels for ImageNet are extracted from WordNet, a linguistic database for constructing concepts and the relations between them [12]. In WordNet, "Norfolk terrier" and "Yorkshire terrier" are both foreign words of "terrier", and "terrier" is a kind of "hound", a kind of "dog", a kind of "canine", etc. wait. Most classification methods assume that the labels have a flat structure, however for combined datasets, the structure is exactly what we need.
      The structure of WordNet is a directed graph, not a tree, because language is complex. For example, "dog" is both a type of "canine" and "domestic animal" A type of , they are all keywords in WordNet. Instead of using a full graph structure, we simplify this problem by building a hierarchical tree from concepts in ImageNet.
        To build this tree, we examine the visual nouns in ImageNet and look at their path through the WordNet graph to the root node, in this case "Physical Object". Many synonyms have only one path in the graph, so we first add all of these paths to our tree. We then iterate over our remaining concepts and add paths so that the tree grows as little as possible. So if a concept has two paths to the root, one of which adds three edges to our tree and the other adds only one, we choose the shorter path.
        The final result is WordTree, a hierarchical model of visual concepts. For classification with WordTree, we predict conditional probabilities at each node, i.e., the probability of each synonym given the synonym. If we want to calculate the absolute probability of a particular node, we simply follow the path of the tree to the root node and multiply by the conditional probability. For classification purposes, we assume that the image contains an object. Pr(physics object) = 1.
      To validate this approach, we train the Darknet-19 model on WordTree built using the 1000-class ImageNet. To build WordTree1k, we add all intermediate nodes and expand the label space from 1000 to 1369. During training, we propagate the ground truth labels through the tree so that if an image is labeled "Norfolk terrier", it is also labeled "dog" and "mammal", etc. To compute the conditional probabilities, our model predicts a vector of 1369 values, and we compute the softmax over the set of all systems that are pseudonyms of the same concept, see Figure 5.

     Using the same training parameters as before, our hierarchical Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy. Despite adding 369 additional concepts and having our network predict tree structures, our accuracy dropped only a little. There are also some benefits to categorizing in this way. Performance degrades gracefully on new or unknown object classes. For example, if the network sees a picture of a dog but is not sure what type of dog it is, it will still predict "dog" with high confidence, but with less confidence distributed among pseudonyms.
      This expression also applies to detection. Now, instead of assuming that each image has an object, we use YOLOv2's objectness predictor to give us the value of Pr (Physical Object). The detector predicts a bounding box and probability tree. We walk down the tree, taking the highest confidence path at each fork, until we reach some threshold at which we can predict that object class.

      Combining Datasets with WordTree WeWe simply map the categories in the dataset to the isotopes on the tree. Figure 6 shows an example of using WordTree to combine ImageNet and COCO labels. WordNet is very diverse, so we can use this technique for most datasets.

       Joint Classification and Detection Now that we can combine datasets using WordTree, we can train a joint model for classification and detection. We want to train an extremely large-scale detector, so we use the COCO detection dataset and the first 9000 classes from the full version of ImageNet to create our joint dataset. We also needed to evaluate our method, so we included any classes not already included in the ImageNet detection challenge. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a larger dataset, so we balance the dataset by oversampling COCO so that ImageNet is only 4:1 larger than it.
      Using this dataset, we train YOLO9000. We use the base YOLOv2 architecture, but with only 3 priors instead of 5, to limit the output size. When our network sees a detection image, we backpropagate the loss as usual. For classification loss, we only backpropagate the loss at or above the corresponding level of the label. For example, if the label is "dog", we don't assign any errors to predictions further up the tree, "german shepherd" vs "golden retriever", because we don't have that information.
      We only backpropagate the classification loss when it sees a classified image. To do this, we simply find the bounding box that predicts the highest probability for that class, and compute the loss on its prediction tree. We also assume that the predicted boxes overlap with the ground truth labels by at least 0.3 IOU, and we backpropagate the objectness loss based on this assumption.
       Through this joint training, YOLO9000 learned to use the detection data in COCO to find objects in the image, and learned to use the data in ImageNet to classify these objects.
       We evaluate YOLO9000 on the ImageNet detection task. ImageNet's detection task shares 44 object categories with COCO, which means YOLO9000 only sees classification data for most test images, not detection data. YOLO9000 achieves 19.7 mAP overall and 16.0 mAP on 156 irrelevant object categories for which it has never seen any labeled detection data. This mAP is higher than the results achieved by DPM, but YOLO9000 is trained on a different dataset with only partial supervision [4]. It also simultaneously detects 9,000 other object categories, all in real time.
      When we analyze YOLO9000's performance on ImageNet, we see that it learns new animal species well, but struggles when learning categories like clothing and equipment. New animals are easier to learn because objectness predictions generalize well from animals in COCO. In contrast, COCO does not have bounding box labels for any type of clothing, only people, so YOLO9000 struggles to model classes like "sunglasses" or "swimming trunks".

V. Conclusion

      We introduce YOLOv2 and YOLO9000, real-time detection systems. YOLOv2 is state-of-the-art and faster than other detection systems on various detection datasets. Furthermore, it can operate at various image sizes, providing a smooth trade-off between speed and accuracy.
      YOLO9000 is a real-time framework that detects more than 9000 object categories by jointly optimizing detection and classification. We use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO. YOLO9000 is a strong step towards closing the dataset size gap between detection and classification.
      Many of our techniques can be generalized beyond object detection. Our WordTree representation of ImageNet provides a richer and more detailed output space for image classification. Combining datasets using hierarchical classifications will be useful in the field of classification and segmentation. Training techniques like multi-scale training can provide benefits in various vision tasks.
      For future work, we hope to use similar techniques for weakly supervised image segmentation. We also plan to improve our detection results by using more powerful matching strategies to assign weak labels to categorical data during training. Computer vision is blessed with large amounts of labeled data. We will continue to find ways to combine data from different sources and structures to build stronger models of the visual world.


The study and summary of this paper is over here. If you have any questions, you can leave a message in the comment area~

If it helps everyone, you can support it with one click~

Guess you like

Origin blog.csdn.net/m0_57787115/article/details/129654554