Deep learning (target detection) --- University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Recently, Joseph Redmon and Ali Farhadi from the University of Washington proposed the latest version of YOLO, YOLOv3. By adding changes in design details to YOLO, this new model achieves a great improvement in detection speed with considerable accuracy, generally 1000 times faster than R-CNN and 100 times faster than Fast R-CNN. The heart of the machine has compiled the paper, and the code and video demo are detailed in the text.

Code address: https://pjreddie.com/yolo/.

1 Introduction

Sometimes, you've been perfunctory all year long without realizing it. For example, I didn’t do much research this year, wasting my time on Twitter and ignoring GANs. With a little motivation left over from last year, I successfully made some upgrades to YOLO. But honestly, nothing super interesting, just a little tinkering. At the same time I did a little bit of research on other people.

Hence today's paper. We have a final deadline and need to randomly cite some updates from YOLO, but no resources. So keep an eye out for technical reports.

The advantage of a technical report is that it requires no introduction, and you know why. So the end of the introduction will provide a roadmap for Yu Wen. First I'll introduce the YOLOv3 finalization scheme; followed by its implementation. We will also introduce some failure cases. The last part is the conclusion and reflection of this paper.

2. Solutions

This part mainly introduces the solution of YOLOv3, we get a lot of inspiration from other researchers. We also trained a very good classification network, so this part of the original article mainly introduces the whole system in detail from the aspects of bounding box prediction, class prediction and feature extraction.

In short, YOLOv3's Prior detection system reuses a classifier or a locator to perform detection tasks. They applied the model to multiple locations and scales of the image. And those regions with higher scores can be regarded as detection results.

Furthermore, we use a completely different approach compared to other object detection methods. We apply a single neural network to the entire image, which divides the image into regions and thus predicts bounding boxes and probabilities for each region, which are weighted by the predicted probabilities.

Our model has some advantages over classifier-based systems. It looks at the entire image when testing, so its predictions take advantage of global information in the image. Unlike R-CNN, which requires thousands of images of a single target, it makes predictions with a single network evaluation. This makes YOLOv3 very fast, typically 1000x faster than R-CNN and 100x faster than Fast R-CNN.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Figure 1: We adopted this graph from the Focal Loss paper [7]. YOLOv3 is significantly faster than other detection methods while achieving the same accuracy. The times are all measured with the same GPU like M40 or Titan X.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Figure 2: Bounding boxes with dimensional priors and localization predictions. The width and height of our bounding box are taken as the displacement from the cluster center, and the sigmoid function is used to predict the center coordinates of the bounding box relative to where the filter is applied.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Table 1: Darknet-53 network architecture.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Table 2: Performance comparison of backbone architectures: accuracy (top-1 error, top-5 error), operations (/billion), floating point operations per second (/billion), and FPS value.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Table 3: This table is from [7]. It can be seen from this that YOLOv3 performs well. RetinaNet takes about 3.8 times longer to process an image, YOLOv3 outperforms the SSD variant by a large margin, and matches the current state-of-the-art model on the AP_50 metric.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Figure 3: Also borrowed from [7], showing the speed/accuracy trade-off process (mAP vs inference time) at .5 IOU metric. It can be seen from the figure that YOLOv3 has high accuracy and fast speed.

Finally, the heart of the machine also tried to use the pre-trained YOLOv3 to perform target detection. In the inference, the model needs to take about 1s to load the model and weights, and the subsequent prediction has a great relationship with the pixel size of the image itself. Therefore, I really feel that YOLOv3 is very fast.

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Paper: YOLOv3: An Incremental Improvement

University of Washington launches YOLOv3: detection speed is three times faster than SSD and RetinaNet

Paper link: https://pjreddie.com/media/files/papers/YOLOv3.pdf

Abstract: In this paper we propose YOLOv3, the latest version of YOLO. We've added a number of design detail changes to YOLO to improve its performance. This new model is relatively larger but more accurate. Don't worry, it's still very fast. For a 320x320 image, YOLOv3 can achieve a detection speed of 22ms and a performance of 28.2mAP, which is comparable to the accuracy of SSD but 3 times faster. YOLOv3 is pretty good when we use the legacy .5 IOU mAP detection metric. It achieves 57.9 AP_50 performance in 51ms on a TitanX and 57.5 AP_50 in 198ms on RetinaNet, similar performance but 3 times faster.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325598208&siteId=291194637