anchor adaptive anchor frame calculation

Adaptive anchor frame calculation

A predefined frame is a set of preset frames. During training, it is constructed with the offset of the real frame position relative to the preset frame (that is, the label we lay)

Training samples. This is equivalent to "frame" the target out of the preset frame roughly at a possible position, and then adjust on the basis of these preset frames .

An Anchor Box can be defined by: the aspect ratio of the frame and the area (scale) of the frame, which is equivalent to a series of preset frame generation rules. According to the Anchor Box, a series of frames can be generated at any position of the image

Since the Anchor box usually takes the point of the Feature Map extracted by CNN as the center position to generate the frame, an Anchor box does not need to specify the center position.

Faster R-CNN defines three sets of aspect ratios ratio = [0.5,1,2]and three scales scale = [8,16,32], and can combine 9 different shapes and sizes of borders.

In other words, ratio defines the aspect ratio,scale is the area of ​​the border

The generation of the Anchor Box is centered on the point on the Feature Map last generated by the CNN network (mapped back to the coordinates of the original image). Take Faster R-CNN as an example.Using the VGG network to downsample the input image 16 times, that is, a point on the Feature Map corresponds to a 16×16 square area (receptive field) on the input image. According to the predefined Anchor, a point on the Feature Map can generate 9 different shapes and sizes of borders on the original image, as shown in the following figure:

img

From the picture above, you can also see why Anchor is needed. According to CNN’s receptive field, a point on a Feature Map corresponds to a 16×1616×16 square area in the original image.Using only the border of the area to locate the target, its accuracy will undoubtedly be very poor, and it may not even "frame" the target at all.. With the addition of Anchor, a point on the Feature Map can generate 9 different shapes and sizes of boxes, so the probability of "boxing" the target will be very large.Greatly improve the recall rate of inspections; And then through the subsequent network to adjust these borders, its accuracy can also be greatly improved.

The Feature Map generated by Faster R-CNN by Anchor Box is obtained by down-sampling 16 times the original image. In this way, the different aspect ratios actually stretch the area of ​​16×16 into different shapes, as shown in the following figure:

img

In other words, only the center point is the same, the area is the same, and the shape has changed

The area of ​​the frame generated by different ratios is the same and has the same size. The three different areas (scales) actually enlarge or reduce the above-mentioned area of ​​16×1616×16.
128×128 128×128 is 16×1616×16 magnification 8 times; 256×256 256×256 is 16 times magnification; 512×512 512×512 is 32 times magnification. As shown below:

img

Border calculation

In the Yolo algorithm, for different data sets, there will be anchor boxes with initial length and width .

In the network training, the network outputs the prediction frame based on the initial anchor frame, and then compares it with the groundtruth of the real frame , calculates the gap between the two, and then reverses the update to iterate the network parameters .

img

In Yolov3 and Yolov4, when training different data sets, calculating the value of the initial anchor box is run through a separate program.

But in Yolov5, this function is embedded in the code, and each time it is trained, the best anchor box value in different training sets is calculated adaptively.

Of course, if you feel that the calculated anchor frame effect is not very good, you can also turn off the automatic anchor frame calculation function in the code .

img

Question 1: Why do I need an anchor box?

To understand why you need an anchor box, you need to first understand some of the previous target recognition methods.

1. Sliding window

This is a relatively primitive target detection method. Given a fixed size window, slide from left to right and top to bottom step by step according to the set pace, and input each window into the convolutional neural network for prediction And classification, this has two disadvantages:

  • Because the window size is fixed, it is not suitable for objects with large deformation
  • More windows and a lot of computation

2. Regional recommendations

img

This is the core idea in the R-CNN series. Taking Faster R-CNN as an example, two neural networks are used in the model, one is CNN and the other is RPN (Regional Proposal). The regional proposal network is not responsible for image classification. It is only responsible for selecting candidate regions in the image that may belong to one of the data sets. The next step is to input the candidate regions generated by RPN into the classification network for final classification.

3.anchor box

The anchor box first appeared in the Faster R-CNN paper. To understand the anchor box, we must first understand two issues.

* Why propose anchor box? *

There are two main reasons:

  • A window can only detect one target
  • Unable to solve the multi-scale problem.

img

The previous model can only predict one target per window. Input the window into the classification network, and finally get a prediction probability. To which category this probability is biased, the target in the window is predicted as the corresponding category, such as regression in the red box in the figure. If the pedestrian probability is higher, the target is considered to be a pedestrian. In addition, one idea is mainly used when solving multi-scale problems-pyramids, or classic feature pyramids such as DPM models. Detect targets of different sizes under feature maps of different resolutions. But there is a problem with this, that is, it greatly increases the amount of calculation.

* Why use different sizes and different aspect ratios? *

img

In order to get a greater intersection over union (IOU).

Take the training phase as an example.

For computer vision, it’s easier to understandGround truth, Human-made labels for each target. But after adding the idea of ​​anchor box, in the training set,We treat each anchor box as a training sample. Therefore, in order to train the target model, it is necessary to mark the label of each anchor box. The label here includes two parts:

  • Category label
  • Offset

There are multiple anchor boxes, which one should I choose? This is to make a selection through cross-combination. Imagine that if a fixed-size anchor is used, then the anchor mark is not targeted.

img

For example, the brown in the figure is the real label of the pedestrian, the yellow is the real label of the vehicle, and the red box is the anchor box mapped from the feature map. In this way, it is difficult to obtain each unit in the feature map through the intersection and comparison. The corresponding label.

img

img

In this case, the intersection ratio between anchor box 1 and pedestrians can be relatively large, which can be used for training and predicting pedestrians, and the intersection ratio between anchor box 2 and cars is relatively large, and it can be used for training and predicting cars. Use anchor boxes with different aspect ratios and sizes, which is more targeted.

How to choose the size of the anchor box?

There are currently three main ways to choose the anchor box:

  • Human experience selection
  • k-means clustering
  • Learn as a hyperparameter

Question 2: At what stage is the anchor box used?

Of course it is used both in the training phase and in the prediction phase. But here comes the question, how do you use it in the training phase and the prediction phase?

It's okay in the clouds and mists that I saw earlier, let's talk about it in detail here.

1. Training phase

The main understanding of the training phase is that the anchor box is actually set during training. In other words, multiple anchor boxes will be generated during actual prediction, and then through iteration, our loss function is the smallest, so that the predicted box is the same as the previous input The anchor frames are as consistent as possible

Label

In the training phase, the anchor box is used as the training sample. For training samples, we need to label each anchor box with two types of labels: one is the target category contained in the anchor box, referred to as category; the other is the deviation of the true bounding box relative to the anchor box. Shift, referred to as offset (offset)

During target detection,We first generate multiple anchor boxes, then predict the category and offset for each anchor box, then adjust the position of the anchor box according to the predicted offset to obtain the predicted bounding box, and finally filter the predicted bounding box that needs to be output.

Knowing the ground truth of each target, how to label the anchor box?

Use maximum intersection ratio (IOU)

img

The picture comes from hands-on deep learning

Suppose there is
an anchor box in the image, there are

A real bounding box, in this way, a corresponding relationship matrix between the anchor box and the real bounding box is formed
Insert picture description here

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-pStVuRZK-1612342739834)(https://www.zhihu.com/equation?tex=X%5Cin%20R %5E%7Bn_a%5Ctimes%20n_b%7D#pic_center)]

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-54ufaga1-1608376802802)(https://www.zhihu.com/equation?tex=X%5Cin+R %5E%7Bn_a%5Ctimes+n_b%7D)], then according to this correspondence, find the true bounding box with the largest intersection ratio with each anchor box, and then use the label of the true bounding box as the label of the anchor box, and then calculate the anchor The offset of the box relative to the true bounding box.

In this way, each anchor box is marked: label and offset.

training

img

* When does the anchor box trigger during the training phase?*

After a series of convolution and pooling, the anchor box is used in the feature map layer. As shown in the figure above, after a series of feature extraction, finally for the [external link image transfer failure, the source site may have an anti-leech chain mechanism, it is recommended Save the picture and upload it directly (img-NIuAJ1rz-1608376802804)(https://www.zhihu.com/equation?tex=3%5Ctimes3)] The grid will get a [External link image transfer failed, the source site may There is an anti-leech chain mechanism, it is recommended to save the picture and upload it directly (img-sGwmbK5n-1608376802805)(https://www.zhihu.com/equation?tex=3%5Ctimes3%5Ctimes2%5Ctimes8)] feature layer, where 2 is The number of anchor boxes, take the "deep learning" course as an example, select two anchor boxes, 8 represents the number of variables contained in each anchor box, which are 4 position offsets and 3 categories (one-hot labeling method) , 1 anchor box label (if the intersection ratio of the anchor box and the real frame is the largest, it is 1, otherwise it is 0).

After reaching the feature layer, map each cell to the original image, find the pre-labeled anchor box, and then calculate the loss between this anchor box and ground truth. The main purpose of training is to train the anchor box to fit the real frame Model parameters.

Looking at the loss function will be more helpful to understand this concept. The loss function used in the original text of Faster R-CNN is:

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-Qsi29swv-1608376802806)(https://www.zhihu.com/equation?tex=L%5Cleft%28 %5Cleft%5C%7Bp_%7Bi%7D%5Cright%5C%7D%2C%5Cleft%5C%7Bt_%7Bi%7D%5Cright%5C%7D%5Cright%29%3D%5Cfrac%7B1%7D%7BN_%7Bc +l+s%7D%7D+%5Csum_%7Bi%7D+L_%7Bc+l+s%7D%5Cleft%28p_%7Bi%7D%2C+p_%7Bi%7D%5E%7B%2A%7D%5Cright %29%2B%5Clambda+%5Cfrac%7B1%7D%7BN_%7Br+e+g%7D%7D+%5Csum_%7Bi%7D+p_%7Bi%7D%5E%7B%2A%7D+L_%7Br+e +g%7D%5Cleft%28t_%7Bi%7D%2C+t_%7Bi%7D%5E%7B%2A%7D%5Cright%29)]

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-NObpkzft-1608376802806)(https://www.zhihu.com/equation?tex=L_%7Bc+l +s%7D%5Cleft%28p_%7Bi%7D%2C+p_%7Bi%7D%5E%7B%2A%7D%5Cright%29)] is the category loss, [external link image transfer failed, the source site may have Anti-hotlinking mechanism, it is recommended to save the picture and upload it directly (img-Ssoa3IjB-1608376802807)(https://www.zhihu.com/equation?tex=p%5E%2A_i)] is the real label, if it is a positive sample, mark it as 1. If it is a negative sample, mark it as 0. Similarly, [External link image transfer failed, the source site may have an anti-leech link mechanism, it is recommended to save the image and upload it directly (img-GGBUOqum-1608376802808)(https://www.zhihu.com/equation?tex=L_% 7Br+e+g%7D%5Cleft%28t_%7Bi%7D%2C+t_%7Bi%7D%5E%7B%2A%7D%5Cright%29)] is the position offset loss, [External link image transfer failed , The source site may have an anti-leech chain mechanism. It is recommended to save the picture and upload it directly (img-84SaTL87-1608376802809)(https://www.zhihu.com/equation?tex=t%5E%2A_i)). The vector of the 4 parameterized coordinates of the anchor box. The ultimate goal of training is to make the loss function [the external link image transfer fails, the source site may have an anti-theft link mechanism, it is recommended to save the image and upload it directly (img-G9QykzKn-1608376802809) ( https://www.zhihu.

2. Prediction phase

* In the prediction stage of the model, how can we get the predicted bounding box? *

First, multiple anchor boxes are generated in the image, and then the types and offsets of these anchor boxes are predicted according to the trained model parameters, and then the predicted bounding box is obtained. Due to the choice of threshold and the number of anchor boxes, the same target may output multiple similar predicted bounding boxes, which is not only not concise, but also increases the amount of calculation. In order to solve this problem, the commonly used measure is to use non-maximum suppression (non-maximum suppression, NMS).

* How to understand NMS?*

img

NMS is an iterative-traversal process that suppresses redundancy.

For a predicted bounding box [External link image transfer failed, the source site may have an anti-leech link mechanism, it is recommended to save the image and upload it directly (img-MnIIAqSU-1608376802814)(https://www.zhihu.com/equation?tex= B)], the model will finally calculate the probability value that it belongs to each category, and the category with the largest probability value is the category of the predicted bounding box. On the same image, arrange the predicted probabilities of all predicted bounding boxes (without distinguishing categories) from large to small, and then take out the predicted bounding box with the largest probability. [External link image transfer failed, the source site may have an anti-theft chain mechanism, It is recommended to save the picture and upload it directly (img-TrZbZ1vm-1608376802815)(https://www.zhihu.com/equation?tex=B_1)] as a benchmark, and then calculate the remaining predicted bounding box and [External link image transfer failed , The source site may have an anti-leech link mechanism. It is recommended to save the picture and upload it directly (img-bMDg2HrO-1608376802816)(https://www.zhihu.com/equation?tex=B_1)] if the ratio is greater than the given A certain threshold of, then this prediction bounding box is removed. In this way, the predicted bounding box with the highest probability is retained and other similar bounding boxes are removed. The next thing to do is to select the predicted bounding box with the largest probability value from the remaining predicted bounding boxes [The external link image transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the image and upload it directly (img-UAMXo6XT- 1608376802817)(https://www.zhihu.com/equation?tex=B_2)] The calculation process repeats the above process.


Guess you like

Origin blog.csdn.net/ahelloyou/article/details/111409090