Intensive Reading of YOLO Series Papers

Basic idea: Use a predefined candidate area to roughly cover the entire area of ​​the picture, find a rough candidate area, and then use RCNN's border regression to adjust to a closer to the real bounding box (one-stage)

Basic structure: [convolution + pooling] + [dense+dense], the output activation function of the last layer uses a linear function (predicting the bounding box requires a numerical type)

YOLO and RCNN:

  • RCNN needs two steps: classification, object position (regression) [RCNN will traverse the image to generate 2000 candidate boxes, classify the candidate boxes, and then process the bounding box to optimize it after classification]

  • YOLO one step: object position

YOLOV1

The core idea: take the entire image as the network input, and return the position and category of the bounding box to the result output by the output layer.

  • Paper: https://arxiv.org/abs/1506.02640
  • Code: https://github.com/pjreddie/darknet

Pros: YOLO sees the entire image during training and testing, implicitly encodes contextual information about classes and their appearance, and produces less than half the number of background errors compared to Fast R-CNN.

Cons: Difficult to precisely locate small targets.

1) Achieve

1. Divide the input image into an S*S grid. If the center of an object falls into a grid cell, this grid is responsible for detecting the object. Each cell predicts the B bounding box and its confidence score (calculation method IOU)

insert image description here

Save the grid prediction data as a tensor of S x S x (B * 5+C), each grid predicts B bounding boxes, each bounding box returns to its own position and predicts a confidence C, each confidence C represents the confidence of the object contained in the predicted box and how accurate the box prediction is

If an object falls in a grid cell, the first item is 1, otherwise it is 0. The second item is the IoU value between the predicted bounding box and the actual groundtruth.

That is, each bounding box needs to predict a total of 5 values ​​(x, y, w, h) and confidence, and each grid also needs to predict a category information, which is recorded as C category. Then there are SxS grids, and each grid needs to predict B bounding boxes and C categories. The output is a tensor of S x S x (B*5+C) .

2. Network structure:

insert image description here

The class information predicted by each grid is multiplied by the confidence information predicted by the bounding box to obtain the class-specific confidence score of each bounding box. After obtaining the class-specific confidence score of each box, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.

insert image description here

The first item on the left side of the equation is the category information predicted by each grid, and the second and third items are the confidence predicted by each bounding box. This product not only encodes the probability that the predicted box belongs to a certain class, but also has information about the accuracy of the box.

3. Training parameters

Network input 448 x 448 (the output layer is a fully connected layer, and the input during detection must be the same resolution as the training image)

The network outputs 7x7x30 , and finally outputs the first five values ​​of the tensor, which are x, y, w, h, and c of the bbox, namely the center coordinates x, y of the bbox, the width and height w, h of the bbox, and the confidence of the bbox.

7x7x30:7x7x5+7x7x5+7x7x20

  • Two 7x7x5 are the data of two bboxes of a cell, and one 7x7x5 contains the x, y, w, h, c of a bbox, each occupying one dimension
  • 7x7x20 is the probability of each cell in 20 classification categories
  • (x,y) is the center of the bbox
  • (w, h) is the ratio of bbox to the whole picture

The paper states: The last layer predicts class probabilities and bounding box coordinates. The width and height of the bounding box need to be normalized according to the width and height of the image so that they fall between 0 and 1. Parametric bounding box x and y coordinates are offsets from specific grid cell locations, and are therefore also bounded between 0 and 1

Loss function: The loss optimization proposed by the author will punish the classification error only when there is an object in the grid cell; only when a certain box predictor is responsible for a certain ground truth box, the coordinate error of the box will be punished, and which ground truth box is responsible depends on whether its predicted value and the IoU of the ground truth box are the largest among all the boxes of that cell.

​ Note: A grid predicts multiple boxes, and it is hoped that each box predictor is dedicated to predicting an object. The specific method is to see which IoU is larger in the currently predicted box and the ground truth box, and which one is responsible.

insert image description here

2) Detailed interpretation

Because the understanding of yolov1 is related to the understanding of the entire yolo series, I spent a lot of effort to read the paper repeatedly, read a lot of information, and wrote down all the important points of thought that I think.

insert image description here

Summarize

  • YOLO imposes strong spatial constraints on the bounding box prediction, each grid cell can only predict two boxes, and only one class (this will be improved later). This spatial constraint limits the number of nearby objects that the model can predict. It does not work well for objects that are close to each other and for small groups.
  • When new and uncommon aspect ratios and other situations appear in the same type of objects, the generalization ability is weak
  • The loss function handles errors the same for small bounding boxes as it does for large bounding boxes. A small error in a large box is usually benign, but a small error in a small box has a larger impact on IOU. Positioning error is the main reason affecting the detection effect.

YOLOV2/9000

  • Paper address: https://arxiv.org/abs/1612.08242

Core idea: YOLOv2 and YOLO9000 are proposed. On the basis of YOLOv1, YOLOv2 improves from three aspects: Better, Faster, and Stronger.

1)Better:

Purpose: To improve recall and localization while maintaining classification accuracy (Thus we focus mainly on improving recall and localization while maintaining classification accuracy)

Solution:

operate specific method Effect
batch normalization Adding batch normalization to all convolutional layers helps to normalize the model and delete data from the model without overfitting, which helps to solve the problem of gradient disappearance and gradient explosion during backpropagation, and reduces sensitivity to some hyperparameters (such as learning rate, network parameter size range, activation function selection), and when each batch is normalized separately, it has a certain regularization effect (YOLOv2 no longer uses dropout) mAP increased by more than 2%, to obtain better convergence speed and convergence effect
high resolution classifier After YOLOv2 uses 224x224 images for classification model pre-training, it uses 448x448 high-resolution samples to fine-tune the classification model for 10 epochs , so that the network features gradually adapt to the resolution of 448x448. Then use 448x448 detection samples for training, which alleviates the impact of sudden resolution switching mAP increased by almost 4%
Anchor Boxes The initial boungdingbox prediction of YOLOv1 is completely random, V2 introduces Anchor Boxes , and presets a set of borders of different sizes, widths and heights in each grid mAP: 69.5, recall rate: 81% becomes mAP: 69.2, recall rate: 88%. Greatly improved recall rate
Dimension Clusters Use K-Means cluster analysis on the bounding box of the training set to automatically find a good prid Can better represent the model and make the task easier to learn
direct location prediction The AnchorBox method will make the model unstable, and it takes a long time for the RPN to stabilize to predict a reasonable offset in the case of random initialization. So YOLOv2 uses YOLOv1 to predict the position coordinates relative to the grid cell position . Compared with the version using anchorbox, using size clustering and directly predicting the center position of the bounding box can improve YOLO by nearly 5%
Fine-Grained Features In YOLOv2, the input 416 416 is down-sampled by the convolutional network and the final output is 13 13. The possible characteristics of smaller objects are not obvious or even ignored. In order to better detect some relatively small objects, the final output feature map needs to retain some more detailed information. YOLOv2 introduces a method called passthrough layer to retain some details in the feature map Slightly improved performance by 1%
multi-scale training YOLOv1 input size must be 448x448, YOLOv2 hopes to run in different sizes. And YOLOv2 only has convolution and pooling layers, so the size can be changed at will. So YOLOv2 changes the network parameters every few iterations . Every 10 batches, the network will randomly select a new picture size. Since the downsampling parameter is 32, different sizes are also selected as multiples of 32 {320, 352...608}, the minimum is 320x320, and the maximum is 608x608. The network will automatically change the size and continue the training process. The network learns to predict well across a variety of input dimensions. That is, the same network can predict detection results at different resolutions. When the input image size is relatively small, it runs faster, and when the input image size is relatively large, the accuracy is high

2)Faster:

Purpose: I hope that the detection is accurate, but I also hope that the detection speed is fast . (VGG-16 is powerful and accurate as a basic feature extractor, but cumbersome, YOLOv1 uses a network based on googlenet architecture that is faster than VGG-16 but slightly less accurate than VGG16)

Solution:

Darknet-19:

  • YOLOv2 uses 3*3filter, doubling the number of Channels after each Pooling.
  • Use Global Average Pooling for forecasting.
  • Use Batch Normilazation to make training more stable, accelerate convergence, and normalize the model.
  • The final Darknet19 has 19 convolutional layers and 5 maxpooling layers. It only needs 5.58 billion operations to process a picture, reaching 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

insert image description here

Training for classification:

train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1,polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework

Achieved 76.5% top-1 accuracy, 93.3% top-5 accuracy

Training for detection:

Remove the last convolutional layer and replace it with three 3×3 convolutional layers (1024 filters per layer), each followed by a 1×1 convolutional layer.

On VOC, each grid predicts 5 boundingboxes, 20 categories 5*(5+20)=125

Added a passthrough layer from the last 3×3×512 layer to the second to last convolutional layer

3)Stronger:

**Purpose:** Propose a mechanism for joint training on classification and detection data, using images labeled for detection to learn detection-specific information, such as bounding box coordinate prediction and objectness, and how to classify common objects. Use images with only class labels to expand the number of classes it can detect.

The detection data set is a general category and label (cat, dog), and the classification data set is a subdivision label (xx dog, xx dog). The labels are not always mutually exclusive during joint training, and the labels need to be merged. The authors propose to use a multi-label model to combine datasets that do not assume mutual exclusion. This approach ignores all structure we know about the data (eg: all COCO classes are mutually exclusive)

Solution:

Hierarchical classification:WordTree

Construct a label grammar set according to the hyponym relationship in the language base.

For example: ImageNet labels are extracted from WordNet, a language database for building concepts and their relationships. In WordNet, "Norfolk terrier" and "Yorkshire terrier" are both hyponyms of "terrier", the latter is a species of "hound", the latter is a species of "dog", the latter is a species of "canidae", etc. Most classification methods assume that the label structure is flat. For combined datasets, what we need is this structure

The structure of WordNet is a directed graph, not a tree (for example, "dog" is both a species of "canidae" and a species of "domestic animal", which are both syntactic sets in WordNet). Instead of using a full graph structure, we simplify the problem by building a hierarchical tree from concepts in ImageNet. Therefore, convert WordNet to Tree, and use the length of the path to determine the distance to the root node . Many sets of grammars have only one path in the graph, and first add all of these paths to the tree. We then iterate over the concepts we have left, and add paths that grow the tree as little as possible (if a concept has two paths to the root, and one path adds three edges to our tree, the other only adds one edge, we choose the shorter path.)

The end result is WordTree, a hierarchical model of visual concepts. To classify using WordTree, we predict the conditional probability of each node, i.e. the probability of each hyponym of that grammar set given that grammar set. If we want to calculate the absolute probability of a particular node, we simply follow the path of the tree to the root node, multiplying the conditional probabilities. That is, if an image is labeled "Norfolk terrier", it will also be labeled "dog" and "mammal". For new or unknown object classes, performance drops. For example, if the network sees a picture of a dog but is not sure what type of dog it is, it will still predict "dog" with high confidence, but with lower confidence distributed among the hyponyms.

Dataset combination with WordTree:

We can use WordTree to combine multiple datasets together in a reasonable way. We simply map the categories in the dataset to the sets of grammars in the tree.

Joint classification and detection:(YOLO9000)

Now that we can combine datasets using WordTree, we can train our joint model on both classification and detection. We wanted to train a very large-scale detector, so we created our combined dataset using the COCO detection dataset and the first 9000 classes from the full ImageNet release.

Using this dataset, we trained YOLO9000. Uses the basic YOLOv2 architecture, but only has 3 priorities instead of 5 to limit the output size. When our network sees a detection image, we backpropagate the loss as normal.

For classification loss: we only backpropagate the loss at or above the corresponding level of the label.

For example, if the label is "dog", we assign any errors to the predictions down the tree," since we don't have information about, "German Shepherd" or "Golden Retriever".

Summarize

YOLOv2 is faster than other detection systems on a variety of detection datasets, and can run at various image sizes for a smooth trade-off between speed and accuracy.

YOLO9000 is a real-time framework that detects more than 9000 object categories by jointly optimizing detection and classification. Using WordTree to combine data from different sources and joint optimization techniques to train on ImageNet and COCO simultaneously.

YOLO V3

The idea of ​​yolov3 is very wonderful!

operate method Effect
feature extraction The backbone network changes Darknet-19 to Darknet-53 (residual connection is used), and the convolutional layer with a step size of 2 is used for downsampling instead of pooling
multi-scale forecasting Anchors of three different shapes are selected, and FPN is used to make predictions of three scales at the same time (each scale predicts 3 Anchors Fixed an issue where small targets could not be detected
category prediction The softmax layer used for classification is modified to a logical classifier, and each category is classified into two multi-label classification
loss function Confidence Loss and Class Confidence Loss calculation changed from sum of squared error to cross entropy

Each grid in V3 only selects the Anchor with the largest iou of the ground truth to predict the position , and the gridcell corresponding to the Anchor is responsible for predicting the object (positive sample). Only participate in confidence prediction (negative samples) for anchors that exceed the threshold iou but are not the largest.

The entire v3 has no pooling layer and fully connected layer. During the forward propagation process, the size transformation of the tensor is realized by changing the step size of the convolution kernel, such as stride=(2, 2), which is equivalent to reducing the image side length by half (that is, the area is reduced to 1/4 of the original)

insert image description here

YOLO V4

The paper starts to discuss common models, and then discusses the improvement of small skills from Bag of freebies and Bag of specials respectively.

1)Bag of freebies:

The purpose of this part is to enable the object detector to achieve better accuracy without increasing the cost of inference

There are three directions:

1. Data enhancement

2. When dealing with the problem of semantic distribution deviation, a very important problem is the problem of data imbalance between different categories

3. The objective function of bbox regression

The GIoU loss includes the shape and orientation of objects in addition to the coverage area. They propose to find the smallest area BBox that can cover the predicted BBox and the real BBox at the same time, and use this as the denominator to replace the denominator originally used in the IOU loss. For DIoU loss, it additionally considers the distance of object centers, while Ciou loss considers overlapping area, distance between center points and aspect ratio at the same time. On the BBox regression problem, Ciou can achieve better convergence speed and accuracy.

2)Bag of specials:

For those plug-in modules and post-processing methods that only increase the cost of inference by a small amount, but can significantly improve the accuracy of object detection

1. The common modules of strong receptive field are SPP, ASPP and RFB

2、feature integration

3. Activation function

4. Post-processing

About v4 model selection : CSPDarknet53 is the backbone network, SPP additional module, Panet path aggregation Neck network and YOLOv3 (anchor-based) Head

  • Mosaic Data Augmentation: A Data Augmentation Method for Mixing 4 Training Images

4 pictures, randomly scaled, randomly cropped, and randomly arranged for splicing. This approach enriches the dataset, and random scaling increases small objects

insert image description here

  • Backbone network: CSPDarknet53

On the basis of Darknet53 used by yolov3, CSPNet is integrated. The size of the convolution kernel in front of each CSP module is 3×3, and the step size is 2 (downsampling). The CSP module first divides the feature map of the base layer into two parts, and then merges them through a cross-stage hierarchy, which reduces the amount of calculation while ensuring accuracy.

  • Activation function: Mish

The backbone network of V4 uses Mish, and the other parts continue to use leaky Relu

insert image description here

  • Neck:SPP、FPN+PAN

    • SPP: Use the maximum pooling of k={1×1,5×5,9×9,13×13}, and then concat the feature maps of different scales

    • Using the SPP module is more effective than simply using the k×k maximum pooling method to increase the receiving range of the backbone features, significantly

    The separation of the most important contextual features

    • FPN+PAN: On the basis of FPN, a path from the lower layer to the upper layer is added

    • The FPN layer conveys strong semantic features from top to bottom, while the feature pyramid conveys strong localization features from bottom to top.

    Different detection layers perform feature aggregation

insert image description here

  • DropBlock regularization: directly delete and discard local areas

insert image description here

  • Cross mini-Batch Normalization (CmBN)

  • Diou nms: NMS can be used to filter those BBoxes that are poorly predicted for the same target, and only keep candidate BBoxes with higher responses. The original method proposed by NMS does not consider the contextual information. For soft-margin NMS, it considers that the occlusion of objects may cause the confidence score drop of greedy NMS with IOU score. DIoU NMS adds the center point distance information to the BBox screening process on the basis of soft interval NMS.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-LuwnUfXY-1655302014455)(C:\Users\Erutalon\Desktop\8888.png)]

Supplement: NMS non-maximum suppression

NMS: Directly screen out low-scoring frames, regardless of coincidence degree
Soft-NMS: Consider score and coincidence degree, acting on a large number of dense similar overlapping scenes

NMS (Non-Maximum Suppression):

In a nutshell, repeated frames can be removed during segmentation detection, such as: the first frame does not perform NMS, and the second frame uses NMS. That is to filter out the boxes that belong to the same category with the highest score in a certain area.

insert image description here
insert image description here

Implementation:

Non-maximum suppression is performed as follows:

1. Cycle through all pictures.

2. Find the frame in the picture whose score is greater than the threshold function. Score screening before coincident box screening can greatly reduce the number of boxes.

3. Determine the type and score of the frame obtained in step 2. Take out the position of the box in the prediction result and stack it with it. At this time, the content in the last dimension changes from 5+num_classes to 4+1+2. Four parameters represent the position of the frame, one parameter represents whether the prediction frame contains objects, and two parameters represent the confidence and type of the category.

4. Cycling the category, the function of non-maximum suppression is to filter out the frame with the highest score belonging to the same category in a certain area. Cycling the category can help us perform non-maximum suppression for each category separately.

5. Sort the category from largest to smallest according to the score.

6. Take out the box with the highest score each time, and calculate the degree of overlap between it and all other predicted boxes. If the degree of overlap is too large, remove it.

Soft-NMS (soft non-maximum suppression):

Soft-NMS believes that the score and the degree of overlap should be considered at the same time when performing non-maximum suppression

Implementation:

On the basis of NMS, Soft-NMS takes the Gaussian index of the obtained IOU in the form of a weight, multiplies it by the original score, and then reorders it. continue to loop

Finally: The article is just a backup of my own study notes. If the picture infringes, you can private message me to delete it.

Guess you like

Origin blog.csdn.net/qq_43842886/article/details/125305678