Target detection algorithm - YOLO[v1~v3]

Table of contents

0 Preface

1 YOLO v1

1.1 Network structure

1.2 Feature map

1.3 Loss function

1.4 Disadvantages of YOLO v1

2 YOLO v2

2.1 Improvement of network structure

2.2 Design of prior frame

2.3 Loss function

2.4 Training techniques

2.5 Features

3 YOLO v3

3.1 YOLO v3 network structure

3.2 Multi-scale prediction

3.3 Features


0 Preface

The two-stage classic detector (Faster RCNN) uses a two-stage structure to first generate the region of interest, and then perform fine classification and regression. Although it completes the object detection task excellently, it also limits its speed.

The YOLO v1 algorithm uses the idea of ​​regression and uses the first-order network to directly complete the two tasks of classification and location positioning , and the speed is extremely fast.

YOLO v2 and v3 have further improved detection accuracy and speed.

1 YOLO v1

YOLO v1 (No Anchor Frame), which was born in 2015, uses a first-order structure to complete the object detection task, directly predicting the category and location of the object. There is no RPN network, and there is no pre-selection box similar to the anchor box (Anchor), so the speed is very fast .

1.1 Network structure

The network structure of YOLO v1 is shown in the figure, which is similar to the GoogLeNet model. Firstly, the convolutional neural network is used for feature extraction, and the size of the input image is fixed at 448×448. After 24 convolutional layers and two fully connected layers, the final output feature map size is 7×7×30.

 

 Note:

(1) After the 3×3 convolution, a 1×1 convolution with a lower number of channels is usually followed. This method not only reduces the amount of calculation, but also improves the nonlinear capability of the model.

(2) Except for the linear activation function used in the last layer, the activation function of the remaining layers is Leaky ReLU.

(3) Dropout and data enhancement methods are used in training to prevent overfitting.

1.2 Feature map

The essence of the network structure of YOLO v1 is mainly in the final 7×7×30 feature map.

 YOLO v1 divides the input image into 7×7 regions, each region corresponds to a point on the final feature map, the number of channels of this point is 30, representing the predicted 30 features.

YOLO v1 predicts two borders in each area. The sizes and positions of these borders are different, and they can basically cover the objects that may appear on the entire map.

Detection principle: If the center point of an object falls within a certain area, the area is responsible for detecting the object. Specifically, the two frames in the area are matched with the real object frame, and the frame with a larger IoU is responsible for returning the real object frame.

Note:

(1) YOLO v1 does not have a priori frame, but directly predicts the size and position of the frame in each area, which is a regression problem. The reason for successful detection is that the region itself contains certain position information, and the scale of the detected object is within a regressable range.

(2) YOLO v1 uses a prediction method that separates the object category from the confidence level

(3) Select a border with a larger IoU of the object during training, and select a border with a higher confidence (indicating whether the region contains the probability of an object) during testing .

1.3 Loss function

The loss of YOLO v1 uses mean square error. formula:

i represents the number of areas, there are a total of S square areas, here is 49; j represents the number of prediction frames in a certain area, there are B prediction frames in total, here is 2; obj represents that the frame corresponds to the real object; noobj means that the box does not correspond to a real object.

 The meaning of each item in the formula:

The first term is the loss of the coordinates of the center point of the positive sample. The purpose of λcoord is to adjust the weight of position loss. YOLO v1 sets λcoord to 5, which increases the weight of position loss.

The second term is the loss of the width and height of the positive samples. Since the width and height difference is affected by the scale of the object, the square root of the width and height is first processed here, which reduces the sensitivity to scale to a certain extent and strengthens the loss weight of small objects.

The third and fourth items are the confidence loss of the positive sample and the negative sample respectively. The true value of the positive sample confidence is 1, and the negative sample confidence is 0. λnoobj defaults to 0.5, the purpose is to reduce the weight of negative sample confidence loss.

The fifth term is the category loss of positive samples.

1.4 Disadvantages of YOLO v1

(1) The model is not effective in detecting small objects and objects that are very close.

(2) No anchor frame leads to poor detection of new or uncommon objects with aspect ratios.

(3) Due to the large downsampling rate, the detection accuracy of the border is not high.

(4) In the loss function, the position loss weights of large objects and small objects are the same, which will lead to inaccurate object positioning.

2 YOLO v2

In 2016, YOLO v2 (depending on the anchor box) was born. It has made improvements in the improvement of the network structure, the design of the prior box and the training skills, making its prediction more accurate, faster, and more object categories recognized.

2.1 Improvement of network structure

YOLO v2 proposes a brand new network structure DarkNet . The original DarkNet has 19 convolutional layers and 5 pooling layers. After adding a Passthrough layer, it has a total of 22 convolutional layers. The accuracy is equivalent to that of VGGNet, but the floating point calculation is only about 1/5 of VGGNet. Therefore, The speed is extremely fast, and its network structure is shown in the figure.

 Compared with the basic network of v1, the improvements made by DarkNet:

(1) BN layer: DarkNet uses the BN layer, which brings a performance improvement of more than 2%. The BN layer helps to solve the problem of gradient disappearance and explosion in backpropagation, can accelerate the convergence of the model, and at the same time play a certain role in regularization. The specific location of the BN layer is after each convolution and before the activation function LeakyReLU.

(2) The 7×7 convolution in the v1 version is replaced by continuous 3×3 convolution, which not only reduces the amount of calculation, but also increases the depth of the network. In addition, DarkNet removed the fully connected layer and the Dropout layer.

(3) Passthrough layer: DarkNet also fuses deep and shallow features. The specific method is to transform the shallow 26×26×512 features into 13×13×2048, so that it can be directly combined with the deep 13×13×1024 features Perform channel splicing. This feature fusion is beneficial to the detection of small objects and also brings a 1% performance improvement to the model.

(4) YOLO v2 predicts 5 borders in each region, and each border has 25 predicted values, so the number of feature map channels that is finally output is 125. Among them, the 25 prediction values ​​of a frame are 20 category predictions, 4 position predictions and 1 confidence prediction value respectively.

2.2 Design of prior frame

YOLO v2 absorbs the advantages of Faster RCNN, and sets a certain number of pre-selected boxes, so that the model does not need to directly predict the scale and coordinates of the object, but only needs to predict the offset from the a priori box to the real object, which reduces the difficulty of prediction.

YOLO v2 uses a clustering algorithm to determine the scale of the prior box, and optimizes the subsequent offset calculation method. The design of the prior box improves the recall rate.

(1) Clustering extraction prior frame scale

YOLO v2 obtains the pre-selected boxes by clustering on the training set. You only need to set the number k of pre-selected boxes, and you can use the clustering algorithm to get the most suitable k boxes.

(2) Optimize the offset formula

With the prior frame, YOLO v2 no longer directly predicts the position coordinates of the frame, but predicts the offset between the prior frame and the real object.

The solid line box in the figure represents the prediction frame, and the dotted line box represents the prior frame:

 pw and ph represent the width and height of the current priori frame; cx and cy represent the coordinates of the upper left corner of the area where the center point is located; σ(tx) and σ(ty) represent the coordinates of the center point of the prediction frame and the upper left corner of the area where the center point is located Distance; bx and by indicate the center coordinates of the predicted frame; tw and th are the predicted width and height offsets.

σ represents the Sigmoid function, and its function is to quantize the coordinate offset to the (0, 1) interval, so that the center coordinates bx and by of the predicted border obtained in this way will be limited to the current area, ensuring that an area only predicts the center point within the area Objects are good for model convergence.

2.3 Loss function

Due to the use of the prior box, the loss function of YOLO v2 has also been changed accordingly.

2.4 Training techniques

(1) Multi-scale training

Since the fully connected layer is removed, YOLO v2 can accept input images of any size. In the training phase, in order to make the model robust to objects of different scales, YOLO v2 takes pictures of various scales as input for training. The trained model can adapt to a variety of different scene requirements.

(2) Multi-stage training

a. Use the DarkNet network to pre-train the classification task on ImageNet, and the image scale is 224×224.

b. Enlarge the ImageNet picture to 448×448, continue to train the classification task, and let the model adapt to the changing scale first.

c. Remove the classification convolutional layer, add Passthrough layer and 3 convolutional layers on DarkNet, and use the input image with a scale of 448×448 to complete the training of object detection.

2.5 Features

(1) Using methods such as a priori frame and feature fusion, and using a variety of training techniques at the same time, the model greatly improves the detection accuracy while maintaining extremely fast speed.

(2) The detection problem of small objects is not well solved.

3 YOLO v3

YOLO v3 (multi-scale and feature fusion), launched in 2018, has made improvements in network structure, network features and subsequent calculations. While maintaining the speed advantage, it has further improved the detection accuracy, especially the detection ability of small objects. . (Note: The speed of YOLO v3 is not as fast as the previous version, but it pursues the accuracy of detection under the premise of ensuring real-time performance.)

3.1 YOLO v3 network structure

YOLO v3 introduces detection frameworks such as residual network and feature fusion, and proposes a DarkNet-53 network structure. The image input is 416×416×3 in size.

 The meaning of each module:

(1) DBL: It represents the combination of three layers of convolution, BN and Leaky ReLU. In YOLO v3, the convolution layer appears with such components, which constitute the basic unit of DarkNet.

(2) Res: Represents the residual module.

(3) Upsampling: The method of upsampling is pooling, that is, the method of element copy expansion makes the feature size larger, and there is no learning parameter.

(4) Concat: After upsampling, perform the Concat operation on the deep and shallow feature maps, that is, the splicing of channels, similar to FPN, but FPN uses element-by-element addition.

Features of the DarkNet-53 structure:

(1) Residual idea: DarkNet-53 draws on the residual idea of ​​ResNet, and uses a large number of residual connections in the basic network, so the network structure can be designed very deep, and the problem of gradient disappearance in training is alleviated, making the model It is easier to converge.

(2) Multi-layer feature map: Through upsampling and Concat operation, deep and shallow features are fused, and feature maps of three sizes are finally output for subsequent prediction. Multi-layer feature maps are beneficial for multi-scale object and small object detection.

(3) No pooling layer: The previous YOLO network had 5 maximum pooling layers, which were used to reduce the size of the feature map, and the downsampling rate was 32, while DarkNet-53 did not use pooling, but passed step The convolution kernel with a length of 2 is used to achieve the effect of reducing the size. The number of downsampling is also 5 times, and the overall downsampling rate is 32.

3.2 Multi-scale prediction

(1) YOLO v3 outputs 3 feature maps of different sizes, corresponding to the deep, middle and shallow features from top to bottom. The deep feature map is small in size and has a large receptive field, which is conducive to detecting large-scale objects. The shallow feature map is more convenient for detecting small-scale objects, similar to the FPN structure.

(2) YOLO v3 continues to use the pre-selection box Anchor, and uses the clustering algorithm to obtain 9 kinds of a priori boxes of different sizes, widths and heights, and assign the a priori boxes according to the following table.

 (3) YOLO v3 uses the COCO dataset by default. There are a total of 80 object categories. Therefore, an a priori frame requires 80-dimensional category prediction values, 4 position predictions, and 1 confidence prediction. A total of 3 prediction frames are required. ×(80+5)=255 dimensions, that is, the number of prediction channels for each feature map.

3.3 Features

(1) The multiple category predictions output by the Softmax function will inhibit each other, and only one category can be predicted, while the Logistic classifiers are independent of each other and can realize multi-category prediction. Therefore, the Logistic function is used instead of the Softmax function to process the predicted scores of the categories.

(2) Softmax is replaced by multiple independent Logistic classifiers, which can realize multi-label classification of objects, and the accuracy rate will not decrease.

(3) Advantages: Fast speed is the most important characteristic of the YOLO series. At the same time, the YOLO series is very versatile. Because of its strict positive sample generation process, the false detection rate of the background is also low.

(4) Disadvantages: The accuracy of the position is poor, and the recall rate is not high, especially for difficult situations such as occlusion and crowding, it is difficult to achieve high precision.

references:

1. "PyTorch Object Detection in Deep Learning"

2. YOLOV3 network structure study notes_J ..'s blog-CSDN blog_yolov3 network structure

3. The original text of the YOLO v3 paper: "YOLOv3: An Incremental Improvement"

Note:

This article is the collation and induction after learning the reference documents and materials, and it is only for learning records. If there is any infringement, please contact the author to delete it! Welcome everyone to correct and communicate.

Guess you like

Origin blog.csdn.net/weixin_45820024/article/details/128686836