Anchor Base 和 Anchor Free

1 concept

1.1 What is Anchor

Anchor, also called anchor, is actually a set of preset bounding boxes used to learn the offset of the real border position relative to the preset border during training. In layman's terms, it is to pre-set the approximate location where the target may exist, and then make fine adjustments based on these preset borders. And its essence is to solve the problem of label assignment.
Anchor is a series of prior frame information, and its generation involves the following parts:

  • Use the network to extract the points of the feature map to locate the position of the frame;
  • Use the size of the anchor to set the size of the border;
  • Use the aspect ratio of the anchor to set the shape of the frame;
    by setting different scales and different sizes of prior frames, there is a higher probability of a prior frame constraint with a good match for the target object.
    insert image description here

Target detection algorithms can generally be divided into anchor-based, anchor-free, and fusion of the two.
The difference between the three lies in whether the anchor is used to extract the candidate target frame .

1.2 How to use Anchor

In the traditional image processing period, in order to detect the target in the image, the image features are usually extracted first, and then encoded into a series of feature descriptors and sent to the classifier of machine learning for discrimination. For example, the HOG feature extractor cuts images one by one through the sliding window + pyramid method. This region-based method has been continued in deep learning.
insert image description here

Anchor (also known as anchor box) is a group of rectangular boxes clustered on the training set using methods such as k-means before training, representing the length and width scales of the main distribution of targets in the data set.
In the two-stage target detection branch, the concept of the anchor is gradually evolved from the idea of ​​the violent sliding window to obtain the region . From the Faster RCNN network, it is formally based on the anchor regression coordinates, and the candidate frame is generated through the RPN strategy.
In the single-stage target detection branch, from SSD to YOLOv2, v3, v4, and v5 continue the route of regression based on anchor.
Friends here will find out how to miss YOLOv1?
YOLOv1 only has the idea of ​​​​regional division without the concept of anchor boxes. Although the image is gridded and the target is predicted in each grid, the regression of the scale is in the entire image, so the detection accuracy of the network is compared with the same period. It is a lot lower, which is why the later versions of YOLO have added the idea of ​​​​the anchor box, and the accuracy and recall of the model have been qualitatively improved under the constraints of the anchor box.

2 Anchor Free与Anchor Base

2.1 Anchor Free

Representative algorithms : CornerNet, ExtremeNet, CenterNet, FCOS, etc.;
Anchor-Free means that there is no prior anchor frame, and the anchor frame is obtained directly by predicting specific points.
There are two ways of target detection algorithm based on Anchor-Free:

  1. The key point detection method limits its search space by locating several key points of the target object;
  2. Locate by the center point of the target object, and then predict the distance from the center to the boundary.
    The most direct benefit is that there is no need to cluster multiple width and height anchor parameters on the current training data before training.

2.1.1 Detection algorithm based on key points

This type of method converts the target detection problem into a combination of key point positioning to solve. The following introduces several key point detection algorithms:
CornerNet directly predicts the probability of each point being the upper left and lower right corner points, and extracts the target frame by pairing the upper left and lower right corner points.
The entire network is shown in the figure below: the input image is extracted by connecting multiple Hourglass modules in series, and then two branches are output, namely the upper left corner point prediction branch and the lower right corner point prediction branch; each branch model outputs three parts after Corner Pooling :

  • Heatmaps: predicting corner positions
  • Embeddings: Predicted corner groupings
  • Offsets: fine-tuning the prediction box
    insert image description here

The Grid R-CNN algorithm finds candidate regions based on RPN, and extracts feature maps independently for each ROI feature. Pass the feature map to the heat map of the output probability in the full convolutional network layer, which is used to locate the grid points of the bounding box aligned with the target object; with the help of the grid point fusion at the feature map level, the accurate boundary of the target object is finally determined frame.
insert image description here

The ExtremeNet algorithm predicts 5 key points (upper, lower, left, right four poles and a center point) for each target by connecting multiple Hourglass modules in series. If the five key points are geometrically aligned, that is to say, different heat The extreme points of the graph are combined to determine whether the combined geometric center meets the requirements of the value on the center point heat map, and then group them into a bounding box.
insert image description here

2.1.2 Detection algorithm based on target center

Such methods regard it as a point when building the model, which is the center point of the object box. The detector gets its associated attributes while regressing the center point. Below we introduce several detection algorithms based on the center of the target:

As an early anchor-free algorithm, YOLO regards target detection as a spatially separated bounding box and related class probability regression problem, and can directly predict bounding boxes and classification scores from the entire picture.

However, it finally uses the fully connected layer to directly predict the bounding box. Since there are objects of different scales and aspect ratios (scales and ratios) in the picture, it is difficult for YOLO to learn to adapt to the shape of different objects during the training process.
insert image description here

We also mentioned above that YOLO does not pre-assume the size and aspect ratio of the frame in the network, so in the process of training, it knows that each network outputs several detection frames, and has no information about other preset frames. known.

CenterNet only needs to extract the center point of the target, without grouping and post-processing the key points. The network structure of this article is relatively clear. From the open source code, we can see that the codec method is used to extract features (Resnet/DLA/Hourglass), and the output terminal is divided into three parts:

  • Heatmap: predict the position of the center point;
  • wh: the width and height of the corresponding center point;
  • reg: offset corresponding to the center point;
    insert image description here

So why does the target detection field consider removing the anchor?

  1. The pre-set anchor size needs to be changed according to the different data sets, which can be manually set or obtained by clustering the data sets;
  2. The number of anchors is much larger than the number of targets, resulting in the imbalance of positive and negative samples;
    of course, in order to solve the problems of target scale change and category imbalance after removing anchors, technologies such as FPN, PAN, and Focal loss have played a very good role. The role of FPN and PAN for the fusion of different levels of features makes the feature map used for prediction contain target features of different scales, so that there is no need to use anchors of different scales to lock the target size and then perform regression. The weighting of Focal loss for positive and negative samples also alleviates the imbalance problem to a certain extent.

Anchor Free algorithm summary :

  1. The method based on the joint expression of multiple key points:
    1. CornerNet/CornerNet-lite: upper left corner point + lower right corner point
    2. ExtremeNet: up, down, left, and right 4 extreme points + center point
    3. CenterNet: Keypoint Triplets for Object Detection: upper left corner + lower right corner + center point
    4. RepPoints: 9 sampling points of learned adaptive beating
    5. FoveaBox: center point + upper left corner point + lower right corner point
    6. PLN: 4 corner points + center point
  2. Methods based on single center point prediction:
    1. CenterNet: Objects as Points: center point + width + height
    2. CSP: center point + height (the author presets the target aspect ratio to be fixed, and the width is calculated according to the height)
    3. FCOS: center point + 2 distances to box

2.2 Anchor Base

Representative algorithms : Faster R-CNN, SSD, Yolo (2, 3, 4, 5), etc.;
Anchor-Base has a priori anchor frame, and first obtains some anchor frame scales and sizes by clustering a certain amount of data , and then combine the prior anchor box with the predicted offset to get the predicted anchor box.
In the past few years, the field of target detection has been dominated by anchor-based detectors. The process of such algorithms can be divided into three steps:

  1. Preset a large number of anchors (2D/3D) in the image or point cloud space;
  2. The four offsets of the regression target relative to the anchor;
  3. Correct the precise target position with the corresponding anchor and regression offset;

2.2.1 Detection algorithm based on single stage

Slide possible anchors over the image and directly classify the boxes. In the past two years, the single-stage detection method has continuously expanded the YOLO family. The common ones are YOLOv1—v5, Complex-YOLO, YOLO3D, YOLO-Lite, Poly-YOLO, YOLOP, etc. There are dozens of them in a search on GitHub. out of shape.

The overall framework is generally divided into three parts: BackBone, Neck, and Head. The basic feature part is actually quite good with ResNet-50, or use CSP/C3 and other modules to cascade to extract features. Each network in the Neck part is similar, using FPN or PAN to fuse the high and low layer feature map information. The head part is mainly determined by the task. If you do two-dimensional frame detection, you will return to the center point, width, height, category and other information; if you do three-dimensional frame detection, you can increase the orientation angle or mask map, etc.

2.2.2 Detection algorithm based on two stages

Image features are recomputed for each potential box, and these features are then classified. In the past two years, there have been relatively few two-stage new methods. The two-stage detection algorithms that appear in the 2D, 3D or pre-fusion fields still rely on the concept of Faster-RCNN.

The two-stage detection algorithm is mainly the RCNN series, including RCNN, Fast-RCNN, Faster-RCNN, Mask-RCNN, etc. Among them, a SPPNet is transitioned between RCNN and Fast-RCNN. Later, the feature pyramid appeared on the basis of the Faster-RCNN framework. Then Mask-RCNN combines the Faster-RCNN architecture, ResNet and FPN, and the segmentation method in FCN to improve mAP.

The following figure is a two-stage detection network development process found on the Internet. It is described in detail from the perspective of market segments, for friends to learn accordingly:
insert image description here

Faster R-CNN-Set 3 scales, 3 width and height ratios, a total of 9 anchors to extract candidate frames
insert image description here

2.3 Fusion method

Fusion method (method of fusing anchor-based and anchor-free branches)
representative algorithms : FSAF, SFace, GA-RPN, etc.;
insert image description here

FSAF-There are both anchor-based branches based on prior settings and anchor-free branches to enhance the detection ability of abnormal ratio targets

3 extensions

3.1 Anchor Free and Anchor Base difference geometry

This question first needs to answer why there is an anchor. In the era of deep learning, object detection problems are usually modeled as classification and regression problems for some candidate regions. In the single-stage detector, these candidate areas are the anchors generated by the sliding window method; in the two-stage detector, the candidate areas are the proposals generated by the RPN, but the RPN itself still classifies and returns the anchors generated by the sliding window method .

  1. Feature Selective Anchor-Free Module for Single-Shot Object Detection
  2. FCOS: Fully Convolutional One-Stage Object Detection
  3. FoveaBox: Beyond Anchor-based Object Detector
  4. High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection

In the anchor-free method of the above papers, another method is used to solve the detection problem. It is also divided into two sub-problems, namely determining the center of the object and predicting the four borders. When predicting the center of an object, the specific implementation can define a hard center area like 1 and 3, integrate the center prediction into the target of category prediction, or predict a soft centerness score like 2 and 4. For the prediction of the four borders, it is more consistent, and they all predict the distance from the pixel to the four sides of the ground truth box, but some tricks will be used to limit the range of regress.

3.2 Why anchor-free can make a comeback

The anchor-free method is comparable to the anchor-based method in terms of accuracy. I think the biggest credit should be attributed to FPN, followed by Focal Loss. (Inner OS: RetinaNet Saigao). In the case where only one frame is predicted for each position, the structure of FPN makes up for the scale very well, and FocalLoss is very helpful for the prediction of the central area. Of course, it is not so easy to adjust the method to work. I believe that some details will have a great impact, such as the processing of overlapping areas, the limitation of the regression range, how to assign the target to different FPN levels, whether the head shares parameters, etc. .

3.3 Anchor Free 和 Single Anchor

The anchor-free mentioned above and each position has a square anchor can be equivalent in form, that is, the structure of FCN is used to predict a box (including position and category) for each position of the feature map. But anchor-free is still meaningful, we can also call it anchor-prior-free. In addition, although the two are equivalent in form, there are still differences in actual operation. In the anchor-based method, although there may be only one anchor for each position, the predicted object is matched based on this anchor, while in the anchor-free method, it is usually matched based on this point.

3.4 Limitations of Anchor Free

Although the accuracy of the above methods is comparable to that of RetinaNet, there is no obvious advantage (perhaps in speed), and it is still far from the two-stage and cascade methods. Like the anchor-based single-stage detector, the feature representation of the instance-level is not as good as the two-stage detector, and there will be fewer tricks on the head. By the way, in order to achieve better-looking results, the few papers above hide some details or have some unfair comparisons in experiments.

3.5 Other routines of Anchor Free

Anchor-free In addition to determining the center point and border separately as mentioned above, there is another bottom-up routine, represented by CornerNet. If the anchor-free method above still retains the idea of ​​​​regional classification and regression, this routine has jumped out of this idea and turned to solve the problem of key point positioning and combination.

4 Outlook

The anchor-free method may be more friendly to industrial applications due to its simple network structure. For the development of the method itself, I feel that one is a new instance segmentation pipeline, because anchor-free is naturally closer to segmentation. One is to move closer to a two-stage or cascaded detector to further improve performance. If the feature align problem can be solved without using RoI Pooling, it will be more interesting. There is also a new post-processing method, and I also look forward to seeing the flexibility of anchor-free bring new methods and ideas.

quote

  1. Anchor-base and anchor-free in target detection
  2. What are anchor-based and anchor free?
  3. Anchor base, anchor free and segmentation based image detection

Guess you like

Origin blog.csdn.net/Lc_001/article/details/129436513