[YOLO series] Super detailed interpretation of YOLOv1 papers (translation + study notes)

foreword

Starting from this article, we will enter the study of YOLO. YOLO is currently a popular target detection algorithm with fast speed and simple structure. Other target detection algorithms such as RCNN series will be introduced later if there is time.

This article mainly introduces YOLOV1, which is a new target detection algorithm proposed by the leaders led by Joseph Redmon in 2015. It differs from previous target detection algorithms such as R-CNN in that target detection algorithms such as R-CNN are two-stage algorithms. The steps are to first generate candidate frames on the picture, and then use the classifier to classify these candidate frames one by one. Judgment; YOLOv1 is a one-stage algorithm and an end-to-end algorithm. It regards the target detection problem as a regression problem, inputs the picture into a single neural network, and then outputs the object bounding box of the picture, that is, boundingbox and classification probability, etc. information. Let's start learning.


Here are some study materials:

Link to the paper: [1506.02640] You Only Look Once: Unified, Real-Time Object Detection (arxiv.org)

Project address: YOLO: Real-Time Object Detection (pjreddie.com)

Github source address: mirrors / alexeyab / darknet · GitCode 


Table of contents

Abstract—abstract 

1. Introduction—Preface 

2. Unified Detection—unified detection

2.1 Network Design—Network Design

2.2 Training—training 

2.3 Inference—inference

2.4 Limitations of YOLO-YOLO limitations 

3. Comparison to Other Detection Systems—Comparison with other target detection algorithms

4. Experiments—Experiments 

4.1 Comparison to Other RealTime Systems—Comparison with other real-time systems

 4.2 VOC 2007 Error Analysis—VOC 2007 Error Analysis

 4.3 Combining Fast R-CNN and YOLO—the combination of Fast R-CNN and YOLO

4.4 VOC 2012 Results—VOC 2012 Results 

4.5 Generalizability: Person Detection in Artwork—Generalization: Person Detection in Images 

5. Real-Time Detection In The Wild—real-time detection in natural environment 

6. Conclusion—Conclusion


Abstract—abstract 

translate

Our proposed YOLO is a new object detection method. Previous object detection methods perform detection by repurposing classifiers. Unlike previous schemes, we treat object detection as a regression problem that spatially locates a bounding box and predicts the class probability of that box . We use a single neural network to predict bounding boxes and class probabilities directly from full images in a single evaluation. Since only one network is used in the entire detection process, the detection performance can be directly optimized end-to-end .

Our unified architecture is blazingly fast. Our base YOLO model processes images in real time at 45 fps (frames per second). A smaller version of the network, Fast YOLO, runs at an astonishing 155 fps , while still being twice as fast as other real-time detectors. Compared with the most advanced (state-of-the-art, SOTA) detection system, although YOLO produces more positioning errors, it hardly occurs the false positive (False Positive) of predicting the background as the target. mistake. Finally, YOLO can learn target representations with strong generalization . When the model learned from natural images is used in other domains such as artistic paintings, it outperforms other detection methods including DPM and R-CNN.

intensive reading

Previous methods (RCNN series)

(1) Generate a large number of potential bounding boxes that may contain objects to be detected through the region proposal

(2) Then use the classifier to judge whether each bounding box contains an object, and the probability or confidence of the category to which the object belongs

(3) Final regression prediction

Introduction to YOLO:

This paper turns the detection into a regression problem (regression problem). YOLO directly obtains some bounding boxes and the probability of each bounding box's category from the input image through a neural network.

Because the entire detection process has only one network, it can be directly optimized end-to-end.

end-to-end:  end-to-end, refers to the input of the original data, and the output is the final result. It turns out that the input end is not the direct original data, but the features extracted from the original data. By reducing manual preprocessing and subsequent processing, the model can be made from the original input to the final output as much as possible, giving the model more room for automatic adjustment according to the data, and increasing the overall fit of the model. The specific performance in CV is that the input of the neural network is the original picture, and the output of the neural network is the control instruction (which can directly control the machine).


1. Introduction—Preface 

translate

People just glance at the image and instantly know what the objects in the image are, where they are and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving without even realizing it. Fast and accurate object detection algorithms could allow computers to drive cars without specialized sensors, enable assistive devices to communicate real-time scene information to human users, and unlock the potential of general-purpose, responsive robotic systems.

Current detection systems perform detection by reusing classifiers . To detect an object, these systems provide a classifier for that object, which is evaluated at different locations and at different scales in the test image. Systems like deformable parts models ( DPM , Deformable Parts Model) use a sliding window approach where the classifier operates on evenly spaced locations across the image [10].

Recent methods such as R-CNN use a region proposal strategy that first generates potential bounding boxes in the image and then runs a classifier on these boxes. After classification, post-processing is performed to refine the bounding box, eliminate duplicate detections, and re-score the bounding box with respect to other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual part must be trained individually .

We treat object detection as a single regression problem, deriving bounding box coordinates and class probabilities directly from image pixels. Using our system - You Only Look Once (YOLO), you can get what is the object on the image and the specific location of the object.

YOLO is very simple (see Figure 1), and it can simultaneously predict multiple bounding boxes and their class probabilities with only a single convolutional network. YOLO is trained on the whole image and can directly optimize the detection performance . Compared with traditional object detection methods, some advantages of this unified model are listed below

First, YOLO is very fast . Since we treat detection as a regression problem , we don't need complex pipelines. At test time, we simply run our neural network on a new image to predict detections. Without batching on a Titan X GPU, the base version of YOLO runs at 45 frames per second, while the fast version runs at over 150fps. This means we can process streaming video in real time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the average accuracy of other real-time systems. For a demonstration of our system running in real time on a webcam, see our project webpage: YOLO: Real-Time Object Detection .

Second, YOLO is inferred on the entire image . Unlike techniques based on sliding windows and proposals, YOLO considers the entire image during training and at test time, so it implicitly includes contextual information about classes and their appearance. Fast R-CNN is a good detection method [14], but since it cannot see the larger context, it will misdetect background patches as objects. Compared with Fast R-CNN, YOLO has half the number of background false detections.

Third, YOLO can learn generalizable representations of objects. When the model trained on natural images is tested on artistic images, YOLO greatly outperforms top detection methods such as DPM and R-CNN. Because YOLO is highly generalizable, it is less likely to fail when applied to new domains or encountering unexpected inputs.

YOLO still lags behind current state-of-the-art detection systems in accuracy . While it can quickly identify objects in images, it is less accurate at locating certain objects, especially small ones.

We further explore the accuracy/time trade-off in our experiments. All our training and testing code is open source, and various pretrained models are also available for download.


intensive reading 

Previous research:

DPM:  The system uses a classifier for detecting objects and evaluates it at different locations and scales in the test image

R-CNN: SS method extracts candidate frames + CNN + classification + regression.

YOLO processing steps:

(1) Adjust the size of the input image to 448×448, and divide it into a 7*7 grid;

(2) Feature extraction and prediction through CNN;

(3) Screening using non-maximum suppression (NMS)


Definition of YOLO:

YOLO redefines object detection as a single regression problem , from image pixels directly to bounding box coordinates and class probabilities . YOLO can predict in an image: which objects are present? where are they

Such as Figure 1: Input the image into a separate CNN network, and the bounding box and the probability of the category of these bounding boxes will be predicted.

YOLO uses a whole image to train, and can directly optimize performance detection.

Performance test comparison:


Advantages of YOLO:

(1) YOLO is very fast. Able to meet real-time requirements. 45 frames per second can be achieved on the Titan X's GPU.

(2) YOLO uses the global image when making predictions. Compared to FastR-CNN, YOLO produces less than half the number of background errors.

(3) YOLO learns a more general feature representation of objects. So less likely to crash when applied to new domains or unexpected input.


2. Unified Detection—unified detection

grid unit 

translate

We integrate the separate components of object detection into a single neural network . Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes for all classes in an image simultaneously. This means that our network performs reason globally on the entire image and all objects in the image. YOLO is designed to achieve end-to-end training and real-time speed while maintaining high average accuracy.

Our system divides the input image into an S×S grid . If the center of the object falls into a certain grid cell, then that grid cell is responsible for detecting the object.

Each grid cell predicts B bounding boxes and confidence scores for those boxes . These confidence scores reflect how confident the model is that the object is contained within that box, and its estimate of how accurate its own predictions are. Formally, we define confidence as confidence = P r ( O bject ) ∗ IOU pred truth confidence = Pr(Object)*\mathrm{IOU}_{\text {pred }}^{\text {truth }} confidence=Pr(Object)∗IOUpred truth ​. If no object exists in that cell (ie P r ( Object ) = 0 Pr(Object)=0Pr(Object)=0), the confidence score should be 0. Otherwise (i.e. P r ( Object ) = 1 Pr(Object)=1Pr(Object)=1), we want the confidence score to be equal to the intersection of the joint part between the predict box and the ground truth (IOUs).

Each grid cell also predicts the conditional probability of class C, Pr(Classi|Object). These probabilities are conditional on the grid cell containing the target. We only predict a set of class probabilities for each grid cell, regardless of the number of boxes B.

At test time, we multiply the conditional class probabilities and confidence predictions for individual boxes:Pr(Class_{i}|Object)*Pr(Object)*IOU^{truth}_{pred}=Pr(Class_{i})*IOU^{truth}_{pred}

This gives us a class-specific confidence score for each box. These scores encode both the probability that the class appears in the box and how well the predicted box matches the target.


intensive reading

Thought

YOLO treats the target detection problem as a regression problem . The input image will be divided into S×S grids. If the center point of an object falls into a cell, the cell will be responsible for predicting the object. One grid can only predict one object, and two prediction boxes will be generated.

For each grid cell:

(1) Predict B bounding boxes, and each box has a confidence score (confidence score). The size of these boxes is random, and there is only one requirement, that is, the center point of the generated box must be in the grid cell .

(2) Each bounding box contains 5 elements: (x,y,w,h)

x, y:  Refers to the offset between the center coordinates of the prediction frame of the bounding box and the upper left corner of the grid cell to which the bounding box belongs, between 0 and 1.

In the above figure, the green dashed box represents the grid cell, and the green dot represents the coordinates of the upper left corner of the grid cell, which is (0, 0); the red and blue boxes represent the two bounding boxes contained in the grid cell, the red dot and the blue The point represents the center coordinates of the two bounding boxes. It is very important that the center coordinates of the bounding box must be inside the grid cell, so the coordinates of the red and blue points can be normalized between 0-1. In the above figure, the coordinates of the red point are (0.5, 0.5), that is, x=y=0.5, and the coordinates of the blue point are (0.9, 0.9), that is, x=y=0.9.

w, h:  refer to the width and height of the bounding box, but also normalized to between 0-1, representing the width and height compared to the original image (ie 448 pixels). For example, if the bounding box predicts a frame width of 44.8 pixels and a height of 44.8 pixels, then w=0.1, h=0.1.

The x=0.8, y=0.5, w=0.1, h=0.2 of the red box .

(3) Regardless of the number of boxes B, it is only responsible for predicting one target .

(4) Predict C conditional probability categories (the possibility that the object belongs to each category)

To sum up, there are S×S grids, and each grid needs to predict B bounding boxes (middle upper picture) and C classes (middle lower picture). Merge the two images, and the network output is a S × S × (5×B+C). (S x S grids, each grid has B predicted boxes, and each box has 5 parameters, plus each grid has C predicted classes)

Q1: Why does each grid have fixed B bounding boxes? (ie B=2)

During training, the IOU of the bounding box and ground truth predicted by each predictor will be calculated online. The predictor with the larger calculated IOU will be responsible for predicting the object, while the other one will not. What is the benefit of doing this? My understanding is that in this way, there are actually two predictors to make predictions together, and then the network will select the predictor (that is, the large IOU) online to make predictions.

Q2: How are the two bounding boxes predicted by each grid obtained?

The two bounding boxes in YOLO are artificially selected (two different aspect ratio) boxes, and the information of the bounding box is input as a hyperparameter at the beginning of the training. As the number of training increases, the loss decreases, and the bounding box becomes more and more accurate. . Faster RCNN is also artificially selected (9 different aspect ratios and scales), and YOLOv2 is obtained by statistical analysis of the characteristics of the ground true box (5).


Predicted Feature Composition

The final prediction feature consists of the position of the border, the confidence score of the border, and the category probability. The meanings of these three are as follows:

  • Border position:  For each border, it is necessary to predict its center coordinates, width, and height. There are a total of 8 predicted values ​​for the two borders. The width w and height h of the bounding box are normalized by the image width and height. So x,y,w,h are all between 0 and 1.
  • Confidence score (box confidence score) c:  How likely is the box to contain an object and how accurate is the bounding box. Similar to whether it is foreground or background in Faster RCNN. Since there are two bounding boxes, there are two confidence predictors.
  • Category probability:  Since the PASCAL VOC dataset has a total of 20 object categories, it is predicted here which category the border belongs to.

Notice

  • The two bounding boxes of a cell prediction share the same category prediction. During training , a frame with a larger IoU of the label will be selected to return the real object frame. During testing , a frame with higher confidence will be selected, and the other will be selected. Abandoned, so 7×7=49 grid cells can only predict up to 49 objects.
  • Because each grid cell can only have one classification, that is, he can only predict one object, which is why YOLO has poor performance on small target objects. If the given picture is extremely dense, there may be multiple objects in the grid cell, but the YOLO model can only predict one, then other objects in this grid cell will be ignored.

    2.1 Network Design—Network Design

 translate

We implement this model as a convolutional neural network and evaluate it on the Pascal VOC detection dataset [9]. The network's initial convolutional layers extract features from images, while fully connected layers are responsible for predicting output probabilities and coordinates.

Our network architecture is inspired by the image classification model GoogLeNet [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. We only use a 1×1 dimensionality reduction layer followed by a 3×3 convolutional layer, which is similar to Lin et al. [22], instead of the Inception module used by GoogLeNet.

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24), using fewer convolutional kernels in those layers. Except for the network size, all training and testing parameters of the basic version of YOLO and fast YOLO are the same.

The final output of our network is a 7×7×30 prediction tensor.


intensive reading

network structure

The YOLO network structure draws on GoogLeNet ( super-detailed interpretation of classic neural network papers (3) - GoogLeNet InceptionV1 study notes (translation + intensive reading + code reproduction) ). The size of the input image is 448×448, after 24 convolutional layers, 2 fully connected layers (FC), and finally reshape operation, the output feature map size is 7×7×30.


Q: How did 7×7×30 come about?

Tensor Profile

 (Image source: YOLO v1 Detailed Interpretation_yolov1 Detailed Explanation_Diffie Herman's Blog-CSDN Blog )

  •  7×7:  A total of 7×7 grids are divided.
  •  30:  30 contains the parameters of two prediction boxes and the category parameters of Pascal VOC: each prediction box has 5 parameters: x, y, w, h, confidence. In addition, there are 20 categories in Pascal VOC; so the last 30 is actually composed of 5x2+20, which means that this 30-dimensional vector is the information of a grid cell.
  •  7×7×30:  A total of 7 × 7 grid cells is 7 × 7 × (2 × 5+ 20) = 7 × 7 × 30 tensor = 1470 outputs, exactly corresponding to the paper.

Detailed network explanation

(1) YOLO mainly builds a CNN network to generate and predict tensors of 7×7×1024 .

(2) Linear regression is then performed using two fully connected layers for 7×7×2 bounding box prediction. Results with high confidence scores (greater than 0.25) are taken as final predictions.

(3) After the 3×3 convolution, a convolution with a lower channel number of 1×1 is usually followed . This method not only reduces the amount of calculation, but also improves the nonlinear capability of the model.

(4) Except for the linear activation function used in the last layer, the activation function of the remaining layers is Leaky ReLU .

(5) Dropout and data enhancement methods are used in training to prevent overfitting.

(6) For the last convolutional layer, it outputs a tensor of shape (7, 7, 1024) . Then the tensor is unrolled. Using 2 fully connected layers as a form of linear regression, it outputs 1470 parameters, then reshape to (7, 7, 30) .


2.2 Training—training 

translate

We pre-train our convolutional layers on ImageNet's 1000-class competition dataset [30] . For pre-training, we use the first 20 convolutional layers in Figure 3, followed by average pooling and fully connected layers. We trained this network for about a week and achieved 88% top-5 accuracy on a single crop image on the ImageNet 2012 validation set, comparable to the GoogLeNet model in the Caffe model pool. We use the Darknet framework for all training and inference [26].

We then convert the model to perform detection training. Ren et al. showed that adding convolutional and concatenated layers to a pretrained network improves performance [29]. Following their approach, we add four convolutional layers and two fully connected layers, all weights of these layers are initialized with random values . Detection usually requires fine-grained visual information, so we change the input resolution of the network from 224×224 to 448×448.

The last layer of the model predicts class probabilities and bounding box coordinates . We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parameterize the bounding box x and y coordinates as offsets from specific grid cell locations , so their values ​​are clamped between 0 and 1.

The last layer of the model uses a linear activation function, while all other layers use the following Leaky-ReLU:

\phi (x)=\left\{\begin{matrix} x, \: \: \: \: \: \: \: if \: x>0 & & \\0.1x, \: \: \: \: \: otherwise& & \end{matrix}\right.\phi (x)=\left\{\begin{matrix} x, \: \: \: \: \: \: \: if \: x>0 & & \\0.1x, \: \: \: \: \: otherwise& & \end{matrix}\right.

We optimize the sum-squared error of the model output . We chose to use the sum of squared errors because it is easy to optimize, but it does not quite meet the goal of maximizing average precision. It weights the classification error as much as the localization error, which may not be ideal. Additionally , each image has many grid cells that do not contain any objects, which pushes the "confidence" scores of these cells towards zero, generally suppressing the gradient of cells containing objects. This can lead to model instability, causing training to diverge early on.

To compensate for the sum-of-squares error, we increase the loss for bounding box coordinate prediction and decrease the loss for confidence prediction for boxes that do not contain objects. We do this using two parameters λ coord λ_{coord}λcoord​and λ noobj λ_{noobj}λnoobj​. We set λ coord = 5 λ_{coord}= 5λcoord​=5 and λ noobj = 0.5 λ_{noobj}=0.5λnoobj​=0.5.

The sum of squared errors is the same for the error trade-offs of the large frame and the small frame, and our error metric should reflect that the small deviation of the large frame is less important than the small deviation of the small frame. To partially address this problem, we directly predict the square root of the bounding box width and height instead of width and height.

YOLO predicts multiple bounding boxes for each grid cell . At training time, we only need one bounding box predictor for each object. A predictor is designated as "responsible" for predicting the target if its predicted value has the highest IOU value to the target's actual value. This leads to specialization of the bounding box predictor. Each predictor can better predict a specific size, orientation, or class of object, thereby improving the overall recall.

During training, we optimize the following multipart loss function:

 Note that the loss function only penalizes misclassification if the target exists in that grid cell (the conditional class probability discussed earlier) . It also only penalizes bounding box coordinate errors if the predictor is "in charge" of the actual bounding box (i.e. the predictor with the highest IOU in that grid cell) .

We trained the network for about 135 epochs using the Pascal VOC 2007 and 2012 training and validation datasets. Since we only test on Pascal VOC 2012, our training set includes the test data of Pascal VOC 2007. Throughout training, we use: batch size=64,momentum=0.9,decay=0.0005.

Our learning rate ( learning rate) is planned as follows: In the first epoch, we slowly increase the learning rate from 1 0 − 3 10^{-3}10−3 to 1 0 − 2 10^{-2}10 −2. If trained from a large learning rate, our model often diverges due to unstable gradients. We continue with 1 0 − 2 10^{-2}10−2 for 75 epochs, then 1 0 − 3 10^{-3}10−3 for 30 epochs, and finally with 1 0 − 4 10^{-4}10−4 for 30 epochs of training.

To avoid overfitting, we use dropout and extensive data augmentation . The dropout rate of dropout layers after the first connection layer is set to 0.5 to prevent mutual adaptation between layers [18]. For data augmentation, we introduce random scaling and translations up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by a factor of up to 1.5 in the HSV color space.


intensive reading

Pretrained classification network

Pre-training a classification network on the ImageNet 1000 dataset, this network uses the first 20 convolutional layers in Figure 3, followed by an average pooling layer and a fully connected layer. (At this point the network input is 224×224).

Q: The input requirement of the backbone structure must be a fixed size of 448x448. Why can an image of 224x224 be input in the pre-training stage?

The main reason is that the average pooling layer is added , so that no matter what the input size is, the same number of neurons can be guaranteed when connected to the last fully connected layer.


  Train the detection network

After the pre-training in the previous step, the first 20 convolutional layers of the backbone network have been trained, and the parameters of the first 20 layers have learned the characteristics of the image. The essence of the next step is migration learning. After the trained first 20 convolutional layers, 4 convolutional layers and 2 fully connected layers are added, and then migration learning is performed on the target detection task.

During the training process of the entire network (24+2), except for the last layer that uses the ReLU function, the other layers use the leaky ReLU activation function. Compared with the ReLU function, leaky ReLU can solve the zero gradient problem when the input is negative. The expression of the leaky ReLU function used in YOLOv1 is:

\phi (x)=\left\{\begin{matrix} x, \: \: \: \: \: \: \: if \: x>0 & & \\0.1x, \: \: \: \: \: otherwise& & \end{matrix}\right.


NMS non-maximum suppression

Concept: The NMS algorithm mainly solves the problem that a target is detected multiple times, and its significance mainly lies in selecting an optimal one among many overlapping frames in an area.

Specific operations in YOLO

 (1) For the above 98 columns of data, first look at a certain category, that is, only look at all the data in this row of 98 columns, first take out the box with the maximum probability, and compare each of the remaining ones with it , if If the IoU of the two is greater than a certain threshold, it is considered that the two frames have repeatedly recognized the same object, and the low probability is reset to 0.

(2) After comparing the largest box with other boxes, find the largest one from the remaining boxes, continue to compare with other boxes , and so on for all categories. Note that the largest one cannot be directly selected here, because there may be multiple objects of this category in the picture, so if the IoU is less than a certain threshold, it will be retained.

(3) Finally, a sparse matrix is ​​obtained, because many places in it are reset to 0, and the probability and category of the places that are not 0 are taken out, and the final target detection result is obtained.

Note:  NMS only occurs in the prediction stage. NMS cannot be used in the training stage , because in the training stage, regardless of whether the box is used to predict objects, it is related to the loss function and cannot be reset to 0 at will.


loss function

Loss functions include:

localization loss -> coordinate loss

confidence loss -> confidence loss

classification loss -> classification loss

Detailed explanation of the loss function:

(1) Coordinate loss

  • The first line: responsible for detecting the positioning error of the frame center point (x, y) of the object.
  • The second line: The height and width (w, h) positioning error of the frame responsible for detecting the object. The function of this root sign is to correct the shortcomings of treating the same size frame and weaken the error of the large frame.

Q: Why add the root sign?

In the above figure, the bounding box and ground truth of the large frame and the small frame are a little bit worse, but for actual prediction, the difference between the large frame (big target) may not matter, and the small frame (small The difference between the target) may cause the box of the bounding box to be far from the target. And if you still use the first item to directly calculate the square sum error, it is equivalent to treating the big frame and the small frame equally, which is obviously unreasonable. However, if the square root is used to deal with it, this problem will be improved to a certain extent.

In this way, it is also a bit close, and the error generated by the small frame will be greater, that is, the penalty for the small frame will be more severe.

(2) Confidence loss

 

  • First row: Confidence error for the box responsible for detecting the object.
  • Second row: Confidence error for the box that is not responsible for detecting the object.

(3) Classification loss

 Error in grid cell classification responsible for detecting objects.


Meaning of special symbols:

 


2.3 Inference—inference

translate

Just like in training, predicting detections on test images requires only one network evaluation. On Pascal VOC, the network predicts 98 bounding boxes and the class probability of each box on each image. YOLO is very fast at test time because it requires only one network evaluation, unlike classifier-based methods.

The grid design enforces spatial diversity in bounding box predictions . Usually it is obvious which grid cell an object falls in, and the network can only predict one bounding box for each object. However, some large objects or objects close to the boundaries of multiple grid cells can be located by multiple grid cells. Non-maximal suppression (NMS) can be used to correct for these multiple detections . Non-maximum suppression is not as important to the performance of YOLO as it is for R-CNN or DPM, but can also increase mAP by 2−3%.

intensive reading

(1) Predicting detections on test images requires only one network evaluation.

(2) Fast test time

(3) When the object in the image is large, or the object at the boundary of the grid cells, it may be located in multiple cells.

(4) Use NMS to remove repeatedly detected objects to improve mAP, but it is not large compared with RCNN.


2.4 Limitations of YOLO-YOLO limitations 

translate

Since each grid cell can only predict two boxes and only one class, YOLO imposes strong spatial constraints on bounding box predictions . This spatial constraint limits the number of nearby objects that our model can predict. Our model has difficulty predicting small objects that appear in groups (such as flocks of birds).

Since our model learns to predict bounding boxes from data, it has difficulty generalizing to targets of new, uncommon aspect ratios or configurations. Our model also uses relatively coarse features to predict bounding boxes because the input image goes through multiple downsampling layers in our architecture .

Finally, our training is based on a loss function that approximates the detection performance by indiscriminately treating small vs. large bounding box errors. Small errors for large bounding boxes are usually insignificant, but small errors for small bounding boxes have a much larger impact on IOU. Our main mistakes come from incorrect positioning .

intensive reading

(1) The detection effect of some group small targets in the picture is relatively poor. Because the yolov1 network has a large receptive field at the back, the characteristics of small targets cannot be reflected in the 7×7 grid at the back. In view of this, yolov2 has made certain modifications, adding the features of the front layer (small receptive field) for fusion .

(2) The original picture is only divided into 7x7 grids. When two objects are very close (the situation where they are close together and the midpoints all fall on the same grid), the effect is relatively poor. Because the model of yolov1 determines that a grid can only predict one object, the target will be lost. In view of this, yolov2 introduces the concept of anchor. How many anchors a grid has can theoretically predict as many targets.

(3) Each grid only corresponds to two bounding boxes. When the aspect ratio of the object is not common (that is, when the training data set cannot be covered), the effect is poor.

(4) In the end, each grid corresponds to only one category, which is prone to missed detection (objects are not recognized).


3. Comparison to Other Detection Systems—Comparison with other target detection algorithms

translate

Object detection is a central problem in computer vision. The detection pipeline usually starts by extracting a set of robust features (Haar [25], SIFT [23], HOG [4], convolutional features [6]) from the input image. Then , classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers operate either over the entire image or over some sub-regions in the image in a sliding window fashion [35, 15, 39]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

Deformable parts models . Deformable Part Models (DPM) use a sliding window approach for object detection [10]. DPM uses disjoint pipelines to extract static features, classify regions, predict bounding boxes for high-scoring regions, etc. Our system replaces all these disparate parts with a single convolutional neural network. The network simultaneously performs feature extraction, bounding box prediction, non-maximum suppression, and contextual reasoning . The feature features of the network are trained in-line rather than statically, so they can be optimized according to specific detection tasks. Our unified architecture is faster and more accurate than DPM.

R-CNN . R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes (Selective Search generates potential bounding boxes), convolutional network extracts features, SVM scores boxes, linear model adjusts bounding boxes, and non-maximum suppression eliminates duplicate detections. Each stage of this complex pipeline has to be precisely tuned independently, and the resulting system is very slow, requiring more than 40 seconds per image at test time [14].

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and uses convolutional features to score these boxes. However, our system imposes spatial constraints on the grid cell proposals , which helps alleviate the problem of multiple detections of the same object. Our system also generates far fewer bounding boxes , only 98 per image compared to ~2000 with Selective Search. Finally, our system combines these individual components into a single, co-optimized model.

Other rapid detectors . Fast R-CNN and Faster R-CNN accelerate the R-CNN framework by sharing computation and using neural networks instead of selective search [14], [28] to propose region proposals. Although they provide faster speed and higher accuracy than R-CNN, they still cannot achieve real-time performance.

Much research work has focused on speeding up the DPM process [31][38][5]. They accelerate HOG computations, use cascades, and push computations onto (multiple) GPUs. However, only 30Hz DPM [31] can actually run in real time.

YOLO does not try to optimize individual components of the large detection process, but instead throws out...entirely the large detection process and improves speed by design.

Detectors for single classes like faces or pedestrians can be highly optimized since they only need to deal with less diversity [37]. YOLO is a general-purpose detector that can simultaneously (simultaneously) detect multiple objects.

Deep MultiBox . Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest (ROI) [8] instead of using selective search. MultiBox can also perform single object detection by replacing confidence predictions with single class predictions. However, MultiBox cannot perform general object detection and is still only a part of a larger detection pipeline , requiring further image patch classification. Both YOLO and MultiBox use convolutional networks to predict bounding boxes in images, but YOLO is a complete detection system .

Over Feat . Sermanet et al. trained a convolutional neural network to perform localization and made the localizer detectable [32]. OverFeat efficiently performs sliding window detection, but it is still a disjoint system . OverFeat optimizes the localization function, not the detection performance. Like DPM, localizers only see local information when making predictions. OverFeat cannot infer global context, thus requiring extensive post-processing to produce coherent detections.

MultiGrasp . Our system is similar in design to the grasp detection of Redmon et al. [27]. Our grid bounding box prediction method is based on the MultiGrasp system for regression analysis . However, grasp detection is much simpler than object detection. MultiGrasp only needs to predict one graspable region for an image containing one object. It doesn't have to estimate the size, location or boundaries of the object or predict its class, it just needs to find the area suitable for grasping. YOLO, on the other hand, predicts bounding boxes and class probabilities for multiple objects of multiple classes in an image .


intensive reading

DPM

Use the traditional HOG feature method, also use the traditional support vector machine SVM classifier, and then artificially create a template, and then use the sliding window method to continuously search the entire image to be recognized violently to set the template. The big problem with this method is that the design template has a huge amount of calculation, and it is static. It cannot match many changing things, and its robustness is poor.

R-CNN

  • The first stage: Each image uses the selective search SS method to extract 2000 candidate boxes.
  • The second stage: each candidate box is sent to the CNN network for classification (using SVM).

YOLO is stronger than both of them. YOLO and R-CNN also have similarities, such as extracting candidate boxes. The candidate boxes of YOLO are the 98 bounding boxes mentioned above. NMS non-maximum value suppression is also used. CNN is used to extract features.

Other Fast Detectors

Fast and Faster R-CNN: These two models are based on R-CNN revisions, the speed and accuracy have been improved a lot, but there is no way to achieve real-time monitoring, that is to say, the FPS cannot reach 30, the author did not talk about it here The problem of accuracy, in fact, the accuracy of YOLO is not dominant here, even lower than them.

Deep MultiBox

Train a convolutional neural network to predict regions of interest instead of using selective search. Multibox can also perform single object detection by replacing confident predictions with single class predictions. Both YOLO and MultiBox use convolutional networks to predict bounding boxes in images, but YOLO is a complete detection system.

OverFeat

OverFeat efficiently performs sliding window detection, optimizing localization, not detection performance. Like DPM, locators only see local information when making predictions. OverFeat cannot reason about the global environment.

MultiGrasp

YOLO is similar in design to the grasp detection work of Redmon et al. A mesh approach to bounding box prediction is based on the regression-to-grasp of multiple grasp systems.

In short, the author is to count all the work of the predecessors, highlighting the power of his model (learned!)


4. Experiments—Experiments 

4.1 Comparison to Other RealTime Systems—Comparison with other real-time systems

translate

Much research work in object detection has focused on making standard detection pipelines faster [5], [38], [31], [14], [17], [28]. However, only Sadeghi et al. actually produced a detection system that runs in real time (30 frames per second or better) [31]. We compared YOLO to a GPU implementation of DPM, which runs at 30Hz or 100Hz. Although other algorithms do not meet the real-time standard, we also compare their mAP and speed relationships to explore the trade-off between accuracy and performance in object detection systems .

Fast YOLO is the fastest object detection method on PASCAL; to the best of our knowledge, it is the fastest existing object detector. With 52.7% mAP, the accuracy of real-time detection is more than double that of previous methods. The normal version of YOLO pushed mAP to 63.4% while maintaining real-time performance.

We also train YOLO using VGG-16. This model is more accurate than the normal version of YOLO , but also slower. Its role is for comparison with other detection systems that rely on VGG-16, but since it is slower than real-time, the rest of this paper focuses on our faster model.

Fastest DPM can effectively speed up DPM without sacrificing too much mAP, but still degrades real-time performance by a factor of 2 [38]. Compared with neural network methods, the detection accuracy of DPM is relatively low, which is also the reason that limits it.

R-CNN minus R replaces selective search with static bounding box proposals [20]. While faster than R-CNN, it is still not real-time, and accuracy suffers severely due to the method's inability to find good bounding boxes.

Fast R-CNN speeds up the classification phase of R-CNN, but it still relies on selective search, which takes about 2 seconds per image to generate bounding box proposals. So while it has a high mAP, at 0.5 fps it is still far from real-time.

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to the approach of Szegedy et al. [8]. In our tests, their most accurate model achieved 7fps, while the smaller, less accurate model ran at 18fps. The VGG-16 version of Faster R-CNN is 10mAP higher than YOLO, but 6 times slower than YOLO. The Zeiler-Fergus version of Faster R-CNN is only 2.5 times slower than YOLO, but it is not as accurate as YOLO.

intensive reading

Table 1 Comparison with other detection methods on Pascal VOC 2007

Conclusion: For real-time target detection (FPS>30), YOLO is the most accurate and Fast YOLO is the fastest. 


 4.2 VOC 2007 Error Analysis—VOC 2007 Error Analysis

translate

To further investigate the differences between YOLO and state-of-the-art detectors, we analyze the classification (breakdown) results of VOC 2007 in detail. We compare YOLO with Fast R-CNN because Fast R-CNN is one of the highest performing detectors on PASCAL and its detection code is publicly available.

We use the method and tools of Hoiem et al. [19], and for each class tested, we look at the top N predictions for that class. Each prediction is either correct or classified according to the type of error:

  • Correct: correct class and IOU>0.5
  • Localization: correct class, 0.1<IOU<0.5
  • Similar: class is similar, IOU>0.1
  • Other: class is wrong, IOU>0.1
  • Background: IOU<0.1 for any object (all targets have IOU<0.1)

YOLO has difficulty localizing objects correctly, so localization errors are greater than all other errors of YOLO combined. Fast R-CNN has fewer positioning errors, but more errors mistaking the background for the target. 13.6% of its top detections are false positives (background) that do not contain any objects. Fast R-CNN is 3 times more likely to misidentify the background as a target than YOLO. 

intensive reading

This paper uses the method and tools of HoeMm et al. For each class at test time, look at N predictions for that class. Each prediction is either correct or classified based on the type of error:

Parameter meaning:

  • Correct: The classification is correct, and the IOU between the prediction frame and the ground truth is greater than 0.5, which means that the category is predicted correctly, and the position and size of the prediction frame are also very appropriate. 
  • Localization: The classification is correct, but the IOU between the prediction frame and the ground truth is greater than 0.1 and less than 0.5, that is, although the prediction is correct, the position of the prediction frame is not so tight, but it is acceptable.
  • Similar: Similar categories are predicted, and the IOU between the prediction box and the ground truth is greater than 0.1. That is, the predicted category is not correct but similar, and the position of the predicted frame is acceptable.
  • Other: The predicted category is wrong, and the IOU between the predicted frame and the ground truth is greater than 0.1. That is, the predicted category is incorrect, but the prediction frame still barely frames the target.
  • Background: The IOU between the prediction frame and the ground truth is less than 0.1, that is, the position of the prediction frame is the background and there is no target.

Figure 4 shows the average breakdown of each error type across all 20 classes 

Conclusion: YOLO positioning error rate is higher than Fast R-CNN; Fast R-CNN background prediction error rate is higher than YOLO 


 4.3 Combining Fast R-CNN and YOLO—the combination of Fast R-CNN and YOLO

 translate

YOLO mistook the background as a target much less than Fast R-CNN . We obtain significant performance gains by using YOLO to remove background detection from Fast R-CNN. For each bounding box predicted by R-CNN, we check whether YOLO predicted a similar box. If it does, then we boost the prediction based on the probability predicted by YOLO and the overlap between the two boxes.

The best Fast R-CNN model achieves 71.8% mAP on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. The combined average growth rate for this write is between 0.3% and 0.6%.

The performance gain obtained by combining YOLO is not just a by-product of model ensembles, as there is little benefit in combining different versions of Fast R-CNN. On the contrary, it is precisely because YOLO makes various errors at test time that it is very effective in improving the performance of Fast R-CNN.

Unfortunately, this combination does not benefit from the speed of YOLO, since we run each model separately and then combine the results. However, since YOLO is so fast, it does not add any significant computation time compared to Fast R-CNN.

intensive reading

Comparison of experimental results of Table2 model combination on VOC 2007

Conclusion: Because YOLO makes various mistakes at test time, it is very effective in improving the performance of Fast R-CNN. But this combination does not benefit from the speed of YOLO, since YOLO is fast, it does not add any meaningful computation time compared to Fast R-CNN.

4.4 VOC 2012 Results—VOC 2012 Results 

 translate

In the VOC 2012 test set, YOLO's mAp score is 57.9%. This is lower than the existing state-of-the-art and closer to the original R-CNN using VGG-16, see Table 3. Compared to its closest competitors, our system struggles with small objects. YOLO scores 8-10% lower than R-CNN and Feature Edit on categories like bottle, sheep, TV/monitor, etc. However, in other categories such as cats and trains YOLO achieves better performance.

Our Fast R-CNN + YOLO model combination is one of the highest performing detection methods. Combining Fast R-CNN with YOLO improves by 2.3% and 5 positions on the public leaderboard.

intensive reading

Table 3 mAP ranking on VOC2012

 Conclusions: Fast R-CNN achieves 2.3% improvement from combination with YOLO, a 5 percentage point boost on the public leaderboard. 


4.5 Generalizability: Person Detection in Artwork—Generalization: Person Detection in Images 

translate

The training and testing data of academic datasets for object detection follow the same distribution . But in real-world applications, it is difficult to predict all possible use cases, and his test data may be different from what the system has seen [3]. We compare YOLO to other detection systems on the Picasso dataset [12] and the People-Art dataset [3], which are used to test people detection on artwork.

As a reference (for reference), we provide the AP of VOC 2007 for human detection, where all models are trained only on VOC 2007 data. The model tested on the Picasso dataset was trained on VOC 2012, while the model on the People-Art dataset was trained on VOC 2010.

R-CNN has a high AP value on VOC 2007. However, when applied to artistic images, R-CNN drops significantly. R-CNN uses selective search to fine-tune the candidate bounding boxes of natural images. R-CNN can only see small regions in the classifier stage, and it needs to have good candidate boxes.

DPM does a good job of maintaining its AP when applied to artistic images. Previous studies have argued that DPM performs well because of its robust spatial model of object shape and layout. Although DPM does not degrade like R-CNN, its AP is inherently low.

YOLO performs well on VOC 2007, and its AP decreases less than other methods when applied to artistic images. Like DPM, YOLO models the size and shape of objects, as well as the relationship between objects and where they typically appear. Art images and natural images are very different at the pixel level, but they are similar in object size and shape, so YOLO can still predict good bounding boxes and detection results.

intensive reading

Figure 5 Generality (Picasso dataset and People-Art dataset) 

Conclusion: YOLO has very good detection results


5. Real-Time Detection In The Wild—real-time detection in natural environment 

 translate

YOLO is a fast and accurate object detector ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time it takes to acquire images from the camera and display detection results.

The resulting system is interactive and engaging. While YOLO processes images alone, when connected to a webcam it functions like a tracking system, detecting objects as they move and change in appearance. A system demo and source code are available on our project website: YOLO: Real-Time Object Detection .

intensive reading

 Conclusion: Connect YOLO to a webcam and verify that it maintains real-time performance, including the time to acquire images from the camera and display detection results. It turned out to work very well, as shown in the picture above, except for the second one in the second row that misjudged the person as an airplane, everything else was fine.


​​​​​​​​Six , Conclusion—Conclusion

translate

We introduce YOLO - a unified model for object detection. Our model is simple in construction and can be directly trained on full images. Unlike classifier-based methods, YOLO is trained with a loss function that directly corresponds to detection performance, and the entire model is trained together.

Fast YOLO is the fastest general-purpose object detector in the literature, and YOLO advances the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains, making it ideal for fast, robust applications.

intensive reading

 What exactly is YOLO?

  • In the eyes of YOLO, target detection is a regression problem
  • Feed the picture at one time, and then give the bbox and classification probability
  • Simply put, you can know the category and position of the object in the picture just by looking at it once

Summary of the YOLO process:

Training phase:

First divide an image into S × S grid cells, and then send it to CNN to generate S × S × (B × 5 + C) results, and finally calculate Loss according to the results and backpropagate gradient descent.

Prediction and verification stage:

First divide an image into S × S grid (gird cell), then send it to CNN to generate S × S × (B × 5 + C) results, and finally use NMS to select the appropriate pre-selection box.


This is the end of this article, see you at YOLOv2~ 

Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/129011644