深度学习(11)之Anchor-Free详解

一、anchor free 概述
1 、 先要知道anchor 是什么(这需要先了解二阶段如faster rcnn,一阶检测器如YOLO V2以后或SSD等)。
在过去,目标检测通常被建模为对候选框的分类和回归,不过,按照候选区域的产生方式不同,分为二阶段(two-step)检测和单阶段(one-step)检测,前者的候选框通过RPN(区域推荐网络)网络产生proposal,后者通在特定特征图上的每个位置产生不同大小和宽高比的先验框(anchor)。如下图:
2、为什么要抛弃anchor,做anchor free
1)Anchor的设置需要手动去设计(长宽比,尺度大小,以及anchor的数量),对不同数据集也需要不同的设计,相当麻烦。
2)Anchor的匹配机制使得极端尺度(特别大和特别小的object)被匹配到的频率相对于大小适中的object被匹配到的频率更低,DNN在学习的时候不太容易学习好这些极端样本。
3)Anchor的庞大数量使得存在严重的不平衡问题,这里就涉及到一个采样的过程,实际上,类似于Focal loss的策略并不稳定,而且采样中有很多坑。
4)Anchor数量巨多,需要每一个都进行IOU计算,耗费巨大的算力,降低了效率。
3、anchor free 的方向
最早可以追溯到YOLO算法,这应该是最早的anchor-free模型,而最近的anchor-free方法主要分为 基于密集预测基于关键点估计两种。
4、anchor free 的局限性
 At present, in order to achieve better-looking results, the paper hides some details or has some unfair comparisons in the experiment (such as using hourglass in the backbone network to compare other people's resnet, etc.).
5. Anchor free project recommendation
Due to the introduction of YOLOV5, we mainly understand the idea of ​​anchor free, and engineering applications can mainly try:
1) centerNet (object as point version)
2) extremeNet (change the regression bounding box to extreme points)
2. Interpretation of anchor free
Introduce according to the following structure
X、XXXXX
1. Main contributions
2. Main ideas
3. Specific details
1)input
2)backbone
3)neck & head
4)loss function
5)trics
4. Results
    
A. dense box (Baidu IDL & Horizon in September 2015)
1. Main contributions
1) Introducing FCN into target detection can achieve good efficiency and accuracy; end2end multi-task target detection framework.
2) Introduce the target landmark positioning into the multi-task learning of DenseBox, and the detection accuracy can be further improved;
3) In face detection, DenseBox achieved sota on MALF and KITTI;
2. Main ideas
As shown above:
1) Use the image pyramid, input CNN, directly predict the bbox + classification score on the feature map (that is, a 5-layer feature map: 4 layers represent Bbox regression, and one layer represents the binary classification face score)
2) Except for the NMS step of DenseBox, the entire model is a full convolution operation, there is no fully connected layer, and no region proposal generation step is required;
3) The image is down-sampled + up-sampled (bilinear interpolation), which is similar to the segmentation network FCN;
3. Specific details
1)input
a. Training phase
The original image contains too much background, and the training takes a lot of time in the calculation and judgment of the background area, which is of little significance; therefore, using the cropped image patch to participate in the DenseBox training, as long as the cropped image patch contains a face and has Enough background information will do.
Specific operation: DenseBox training is a bit of an image segmentation method. First, crop the image patch from the original image, and then resize it to 240 x 240. The center of the image contains a face gt bbox with a height of about 50 pix; this operation is easy to implement , which is equivalent to cropping a patch from the original image + gt image, but this patch must ensure that the face is in the center of the patch, and the size of the gt bbox becomes 50 x 50 after resize, so the specific patch sub-area of ​​how large the crop is needs to be done Calculate in equal proportions; take a chestnut: a face gt bbox in the original image is 80 x 80, we need to ensure that the scale of the face after resize is 50 x 50, then the patch sub-area that needs to be cropped in the original image is 240 x 80 / 50 = 384, that is, 384 x 384 sub-area.
b. Testing phase
Use an image pyramid
2)backbone
A total of 16 layers of convolution operations, multiplexing the first 12 layers of network parameters of VGG19.
3)neck & head
Conv 4-4 is followed by four 1 x 1 conv layers; as can be seen from fig 3, it is actually divided into two branches: two 1 x 1 conv layers finally output a 1-channel feature map for face classification scores calculation; the other two 1 x 1 conv layers output a 4-channel feature map, which is used for the calculation of face bbox regression; conv 3-4 and conv 4-4 feature map features are fused, and the concatenation is done; the author believes that the lower layer The features on the feature map contain more local details of the target, making it easier to judge the local area of ​​the target; the features on the high-level feature map, because they have a larger receptive field, contain more voice information and context information, are easier to Recognize the global information of the target; the receptive field of conv 3-4 is 48 x 48, which is almost the same as the face gt bbox scale, and the receptive field of conv 4-4 is 118 x 118, so more global information + context information can be integrated ;In addition, the scale of conv 4-4 feature map is half of the scale of conv3-4 (more pool3 operation), so before doing concate, you need to do a bilinear upsampling of conv 4-4 to ensure that both The scale is the same; because only concate, not element-wise addition, there is no need to ensure that the number of channels is the same.
4)loss
a. Classification loss, pixel-wise L2-loss on feature map.
The author said that the L2-loss performance in DenseBox is very good, and he did not try such as hinge loss, cross-entropy loss, etc.
b. The bbox regression loss is also the pixel-wise L2-loss on the feature map.
4)trick
a. Define positive and negative samples
每次mini-batch迭代中,正负样本数量差异很大,负样本占据绝大多数;如果这些负样本都用于训练,那么最终的loss将会偏向于海量的负样本;同时,如果使用那些 分类边界上的模棱两可的正负样本训练 ,模型又学不到有价值的信息,性能也会下降.
在最后的特征图上:
正样本:标注框的中心点(个人理解以及中心点为圆心,标注框高一定比例如0.3为半径的圆内的点);
负样本:上述正样本外的点;
忽略样本: 负样本的像素(x, y),如果它附近两个像素距离内有正样本像素,就将该像素认为忽略样本。
b.困难样本挖掘
mini-batch中,所有样本先做次前向操作, 对所有pixel输出依照公式(1)的loss作降序排序,选择top 1%作为难负样本; 并保留所有的正样本(positive labeled pixels),控制正负样本比例为1:1;所有的负样本,一半来自从non-hard negative(也即剩余top 99%负样本)中随机取样,一半来自从top 1% hard-negative中的采样;每个mini-batch中,通过设置 Fsel = 1 来标识样本是否被选中(正样本 + 难负样本 + 随机选中的负样本)
Positive sample: the input 240 x 240 pix patch, and the center position of the patch contains a positive sample within a specific scale range (around 50 x 50 pix), and each patch may contain several negative samples near the positive sample; negative sample: random patch : Randomly crop patches from the image at random scale (randomly crop patches at random scale from training images), and then resize to 240 x 240 pix patch input network; during training, the ratio of positive samples to random patch negative samples is 1: 1;
c. Data Augmentation
Randomly flip left and right for positive and negative patches, 25 pix translation, scale transformation [0.5, 1.25];
B. YOLO V1
1. Main contributions
One-stage, fast, end-to-end training
2. Main ideas
Through the fully convolutional network, the input image is mapped to a 7*7*30 tensor, 7*7 means that the entire image is divided into 7*7 grids, and each grid is set with 2 pre-selection boxes (training can understand For the idea of ​​evolution), as shown in the left figure below; 30 is the encoding of each position as shown in the right figure.
Probability of 20 object classifications: YOLO supports the recognition of 20 different objects (people, birds, cats, cars, chairs, etc.)
The position of 2 bounding boxes: each bounding box needs 4 values ​​to represent its position, (Center_x, Center_y, width, height)
The confidence of the two bounding boxes: the probability of an object in the bounding box * the IOU between the bounding box and the actual bounding box of the object. When the grid does not contain an object, the label of the confidence is 0; if it contains an object, the label of the confidence is the IOU value of the predicted frame and the real object frame
3. Specific details
1)inout
The input is the original image, and the only requirement is to scale it to 448*448. The main reason is that in YOLO's network, the convolutional layer is connected to two fully connected layers at the end, and the fully connected layer requires a fixed-size vector as input.
2)backbone
Darknet
3) neck & head
As shown in the above two figures, there are a total of 49*2=98 candidate areas, the size and shape of the two bounding boxes are not preset, and there is no prediction of an object output for each bounding box. It means only to predict two bounding boxes for an object, and choose the one that is relatively accurate. It's not exactly a supervised algorithm, but like an evolutionary algorithm. A picture can detect up to 49 objects. Calculate the center position of the bounding box of the object, which grid the center position falls on, and the category probability of the object in the output vector corresponding to the grid is 1, which is responsible for predicting the object; the prediction probability of other grids for the object is set to 0 (not responsible for predicting the object).
4) loss function
          , the object detection problem faced by Yolo is a typical problem of unbalanced number of categories. Of the 49 grid points, there are usually only 3 or 4 grid points that contain objects, and the rest are all grid points that do not contain objects. At this time, if no point measures are taken, the mAP of object detection will not be too high, because the model is more inclined to grid points that do not contain objects. The role of is to make the grid points containing objects have a greater weight in the loss function, so that the model will pay more attention to the loss caused by the grid points containing objects. In the paper,   the values ​​of are 5 and 0.5 respectively.
5)tricks
a. Do not directly return the coordinate value of the center point, but return the displacement value relative to the coordinate of the upper left corner of the grid point
b. Each grid point predicts two or more rectangular boxes. In the calculation of the loss function, only the loss is calculated for the box closest to the real object, and the rest of the boxes are not corrected. The author found that two boxes of a grid point are in The size, aspect ratio, or some categories are gradually divided, and the overall recall rate is improved.
c. During inference, use the maximum value p of the category prediction of the object multiplied by the maximum value c of the prediction box as the confidence of the output prediction object.
C, FCOS (Fully Convolutional One-Stage Object Detection has been open source)
1. Main contributions
1) How to solve the overlapping points: use FPN layering + the smallest box with the same layer overlapping attribution area
2) How to solve the pixel-by-pixel low-quality frame: the center-ness strategy is proposed (personally think that the essence is the same as the prediction object of YOLOV1)
2. Main ideas
Use FPN to perform all points on the feature map: predict classification + position regression + whether it is the center (personally think that the essence is actually the same as the prediction object of YOLOV1)
3. Specific details
1)input
no special requirements
2)backbone
Nothing special, experiment tried resnext-32x8d-101-fpn, etc.
3)neck & head
a. As shown in figure 2 above, it should be noted that it is slightly different from the FPN native implementation:
FCOS uses the C3, C4, and C5 feature maps output by the backbone, and horizontally synthesizes P3, P4, and P5; P6 and P7 are obtained by P5 and P6 through a convolutional layer with a step size of 2. All the final feature maps have a total of 5, and the downsampling multiples are 8, 16, 32, 64, and 128, respectively.
b.label consists of classification + regression.
Given a position (x, y), if the point falls in any GT bounding box, then the position will be identified as a positive example and a category label c* will be attached to it, otherwise c*=0 means the background; In addition, for this position (x, y), give it a 4-dimensional vector t = (l*, t*, r*, b* ) as the target of regression. (l*, t*, r*, b* represent the distance from the point to the left (left), top (top), right (right), and bottom (bottom) of the bounding box respectively) as shown in the figure below.
The regression targets are almost all positive samples, so use the exp() function to map the regression target to (0, ∞), which has the advantage of increasing the recognition of the bounding box.
Given an image, the category score Px, y of each position and the position regression prediction t (x, y) are obtained through FCOS, and then the position with px, y>0.05 is selected as a positive example.
When the anchor-based method selects the anchor box, it considers its IOU ratio with GT, and the ones that exceed the threshold are positive samples. The difference is that FCOS can use as many foreground samples as possible for training regression.
c. Overlapping area processing
The first case (in the same layer of feature maps): Select the bounding box with a small area as the regression target, which can greatly reduce blurred samples.
The second case (feature maps in different layers due to FPN): Use feature maps of different layers to detect objects of different sizes.
specific methods:
if a location satisfies max(l∗, t∗, r∗, b∗) > mi or max(l∗, t∗, r∗, b∗) < mi-1, it is set as a negative sample and is thus not required to regress a bounding box anymore. Here mi is the maximum distance that feature level i needs to regress. In this work, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞
d.center-ness
After using FPN, there is still some gap between the effect of the network and the state-of-the-art target detection algorithm. After analysis, it is found that it is caused by low-quality detection frames .
Since in the convolutional neural network, a point on the feature map corresponds to the center of the receptive field, the orange point in Figure 5 must be better than the two green points in predicting the current person. Then the two green points are called low-quality prediction points. The center-ness is to suppress the weight of the green point. First look at the calculation formula of center-ness:

According to the formula, the closer the point is to the center of the ground truth, the higher the center-ness value. But here comes a new problem. There are two ways to obtain center-ness. One: calculate the center-ness directly according to the predicted values ​​l*, r*, t*, b*; two: return a center-ness value separately, according to Label values ​​l, r, t, b to train the branch. It can be seen from the network structure in Figure 4 that the author adopted method 2. In fact, the author compared the two methods:

The result has the final say, that is, the single center-ness regression branch improves mAP.
4)loss function

The category loss function uses focal loss, and the regression loss function uses IOU loss in the UnitBox paper.
After looking at the code, there is actually a centerness loss function branch, which uses BCEloss, but the label is calculated for the above centerness, not the binary classification label of onehot encoding.
5)trics
none
4. Results
Comparative experimental results in the paper:
Figure 8 FCOS accuracy comparison
At least it can be seen that the accuracy of FCOS exceeds the old classic algorithm Faster R-CNN.
FCOS performance
The specific value of the speed is not mentioned in detail in the paper. The above figure is taken from the official website code of the paper, and the value may be slightly different for different machines. The reasoning speed is not faster than the two-stage Faster R-CNN, but it basically meets the real-time performance. mAP does surpass Faster R-CNN by a lot.
D、 FSAF(Feature Selective Anchor-Free Module for Single-Shot Object Detection)
1. Main contributions
For the current target detection and matching ground truth and anchor ideas that use FPN, it mainly relies on iou or hierarchical mechanisms to provide automatic matching methods.
2. Main ideas
Try to use each layer of FPN to detect the instance, and then see which layer's detection result is the smallest loss of this instance, and this layer is the most suitable for detecting this instance.
3. Specific details
1)input
no resolution required
2)backbone
The author experimented with resnext-101
3) neck & head
a. Branch structure
FSAF模块让每个instance自动的选择FPN中最合适的特征层,在这个模块中, feature 选择的依据由原来的instance size变成了instance content,实现了模型自动化学习选择FPN中最合适的特征层。
FSAF以RetinaNet为基础结构,添加一个FSAF分支和原来的classification subnet、regression subnet并行,可以不改变原有结构的基础上实现完全的end-to-end training。FSAF同样包含 classification(使用的是sigmoid函数) 和 box regression 两个分支,用于预测目标所属的类别和坐标值,如图
b. babel定义
第一:一个目标(Instance),假设它的类别是 lable = c,并且边界框坐标是 ,(x,y)是目标的中心坐标,
(w,h)是目标的宽和高。
第二:此目标在FPN中的第 特征层投影的坐标是 ,其中
第三:定义投影的有效目标框坐标是 ,也就是图3中的"car"class的白色区域,其中  ,式中的
Fourth: The coordinates of the ignored target frame of the definition projection are , that is, the gray area of ​​the "car" class in Figure 3, where  , in the formula .
class output: class output is a parallel structure with anchor-base brances, its dimension is W×H×K, K is the total number of categories (should include background categories), class output has a total of K feature maps, above we Assuming that the category of this target is c (the car category in the figure), then the label dimension of the class output is a tensor of W×H×K, and the definition of the c-th feature map among the K feature maps is in Figure 3 "car" class. Among them, the white area is the positive target area with a defined value of 1, the gray area is the ignored area, that is, no gradient backpropagation is performed, and the black area is the negative target area with a defined value of 0. The loss function used is Focal Loss.
Box output: The box output is a parallel structure with anchor-base brances, its dimension is W×H×4, 4 represents the offset. Give an example to illustrate the meaning of the offset: The label of the box output is for the pixels in the effective area , and the values ​​​​of the four dimensions are where are the pixel positions relative to the top, left, bottom, and right respectively, as shown in Figure 4. In addition, S =4.0. The loss function used is IoU Loss.
At the pixel point , if the predicted offset is , then the distance to is , and the coordinates of the predicted upper left corner and lower right corner are respectively and , then the predicted bounding box coordinates are multiplied by and and respectively .
3)loss function

a. In the anchor-based algorithm, the size of the target is usually assigned to the specified feature layer, while the FSAF module selects the optimal feature layer based on the content of the target. The classification loss and localization loss assigned to the th feature layer are as follows:
where is the number of pixels in the effective area.
Then the feature layer with the optimal prediction target is obtained by the following formula, that is, the joint loss function is the smallest.
b. How to combine anchors-free brances and anchors-base brances?
In inference, FSAF can output prediction results as a branch alone, or output prediction results at the same time as the original anchor-based branch. When both exist, the output results of the two branches are combined and then NMS is used to obtain the final prediction result. In training, multi-task loss is used, that is, the formula is as follows.
The weight factor is 0.5.
5)trics
none
4. Results
E、FeveaBox
1. Main contributions
Because many ideas are similar to fcos and FSAF, the difference is that regression does not directly learn the distance from the center of the target to the four sides, but to learn the mapping relationship between a predicted coordinate and the real coordinate.
2. Main ideas
Directly learn the existence probability of the object and the coordinates of the ground truth box (no preselection box is generated). Mainly through two branches:
Predict the probability that a class-sensitive semantic graph exists as an object
Predict a mapping relationship between a center point and frame coordinates
3. Specific details
1) input
There is no special requirement for image resolution.
2)backbone
Mainly compared with retina net, backbone has tried ResNet101 and ResNext101.
3)neck & head
a. In order to make a fair comparison with RetinaNet, the author used exactly the same network structure, that is, the structure of ResNet+FPN, in which the number of layers of the pyramid   and   the resolution of the input image.
b. Match Bbox
Assume that each layer in FPN predicts a bounding box within a certain range, and each feature pyramid has a basic area, that is, 32*32 to 512*512, so it is expressed as, and  set  . Just when l=3, the overall area is 1024=32*32. However, in order to make each layer respond to a specific object scale, FoveaBox calculates an effective range for each pyramid layer as follows, which is used to control this scale range (my understanding is to rely on area matching):
After experiments, the author found that when the control factor is 2, it is the optimal situation.
c. Positive and negative area determination
First, let's look at the sub-network for predictive classification. Its output is a set of pyramidal heatmaps, and each heatmap has a size of HxW and a dimension of K, where K is the number of categories. If the given ground truth box is (x1,y1,x2,y2), first map it to the target pyramid, namely:
而正样本区域(fovea)被设计为原区域的一个衰减区域,这个与DenseBox的设置一样,这样设置的原因是为了防止语义区域的相互交叠!其中   为缩放因子,正样本区域的每一个cell都被指定相对应的类别标签。对于负样本,我们也设置了一个缩放因子  ,用于生成负样本区域。如果一个cell没有被分配,那么就是ignore区域不参与反向传播。这样的设置和FSAF又很像!由于样本间不均衡,所以我们使用focal loss来优化。Fovea区域计算公式如下:

如下图所示,通过缩放因子的设置,划分成正负区域和忽略区域。可以看到,正样区域相对于整张feature map来说,所占的比重较小,在训练中就会让正负样本失衡,所以分类的loss为focal loss
4)loss function
采用分类+回归损失(边框预测)
a.分类损失
分类采用focal loss的,具体的实现论文没有提及,猜测是每个类(分类分支的每个特征图,注意正负以及忽略区域)分别做二分类,然后采用sigmod。
b.回归分支
与DenseBox和UnitBox不同,FoveaBox并不是直接学习目标中心到四个边的距离,而是去学习一个预测坐标与真实坐标的映射关系,假如真实框为   ,我们目标就是学习一个映射关系   ,这个关系是中心点与边框坐标的关系,如下
接着使用简单的L1损失来进行优化,其中为   一个归一化因子,将输出空间映射到中心为1的空间,使得训练稳定。最后使用log空间函数进行正则化。
5)trics
4. Results
From the above figure, we can see that compared to anchor-based, the anchor-free algorithm is more robust to the scale of the target, and there is no need to laboriously design the size of the anchor.
Next is the comparison of AP and AR:
1) Comparison of FoveaBox and RetinaNet
2) Comparison with ASFA (CVPR2019)
3) Comparison with other SOTA methods (coco test-dev)
F, CenterNet (Object as Points has been open source)
1. Main contributions
1) The algorithm removes inefficient and complex Anchors operations, further improving the performance of the detection algorithm;
2) The algorithm performs filtering operations directly on the heatmap, eliminating the time-consuming NMS post-processing operations and further improving the running speed of the entire algorithm;
3) The algorithm can not only be applied to 2D target detection, but also can be applied to other tasks such as 3D target detection and human key point detection after simple changes, that is, it has good versatility.
2. Main ideas
CenterNet converts the target detection problem into a center point prediction problem, that is, the target is represented by the center point of the target, and the rectangular frame of the target is obtained by predicting the offset and width of the target center point.
The module contains 3 branches, including the center point heatmap branch, the center point offset branch, and the target size branch:
1) The heatmap branch contains C channels, each channel contains a category, and the white bright area in the heatmap indicates the center point position of the target;
2) The center point offset branch is used to make up for the pixel error caused by mapping the points on the pooled low heatmap to the original image;
3) The target size branch is used to predict the deviation value of w and h of the target rectangle.
The implementation steps of the inference phase of the CenterNet network are as follows:
Step 1 - Input an image, process the image size into 512*512 and use it as the input of the network;
Step 2 - Executing the forward calculation of the network will result in 3 outputs: heatmap of size [1,80,128,128]; size prediction of size [1,2,128,128]; offset prediction of size [1,2,128,128];
Step 3-heatmap will pass a sigmoid function to make the range from 0 to 1, and then perform a maximum pooling operation on the heatmap (kernel is set to 3, stride is set to 1, pad is set to 1, this step is actually doing a repeat box Filtering is also an important reason why NMS operations are no longer needed in the future. After all, the 3✖️3 size kernel plus the stride=4 between the feature map and the input image is equivalent to every 12✖️12 size area in the input image There will be no repeated central points, the idea is very simple and effective). Then select the top K points with the highest scores based on the heatmap (the default K=100, so that the center points of the 100 prediction boxes with the highest confidence are determined, and this step will also remove certain repeated boxes). A confidence threshold can be set, and only outputs higher than the threshold;
Step 4 - Determine the size of the prediction box through the output size prediction value and offset. The obtained prediction frame information is all on the 128✖️128 size feature map, so finally the prediction frame information is mapped to the input image to get the final prediction result.
3. Specific details
1)input
512*512
The input image needs to be cut to 512*512 size, that is, the long side is scaled to 512, and the short side is filled with 0. The specific effect is shown in the figure below. Since the original picture’s W>512, it is directly scaled to 512; due to the original picture’s H<512, so perform 0 complement operation on it;

2)backbone
In the paper, three network architectures of Hourglass, ResNet and DLA were tried. The accuracy and frame rate of each network architecture are as follows:
Resnet-18 with up-convolutional layers:28.1% coco and 142 FPS
DLA-34:37.4% COCOAP and 52 FPS
Hourglass-104:45.1% COCOAP and 1.4 FPS
3)neck & head
This module contains 3 branches, including the center point heatmap branch, the center point offset branch, and the target size branch.
The heatmap graph branch contains C channels, each channel contains a category, and the white bright area in the heatmap indicates the center point position of the target;
The center point offset branch is used to make up for the pixel error caused by mapping the points on the pooled low heatmap to the original image;
The target size branch is used to predict the w and h deviation values ​​of the target rectangle.
a. Heatmap represents classification information, and each category will generate a separate Heatmap. For each Heatmap, when a certain coordinate contains the center point of the target, a key point will be generated at the target, and we use the Gaussian circle (as long as the predicted corner point is within a certain radius r of the center point , and when the IoU between the rectangular box and gt_bbox is greater than 0.7, we set the value at these points to a Gaussian distribution value instead of a value of 0. ) to represent the entire key point, the following figure shows the specific details.

The specific steps to generate a Heatmap are as follows:
Step 1 - Scale the input image to a size of 512*512, and perform a downsampling operation of R=4 on the image to obtain a Heatmap image of a size of 128*128;
Step 2 - Scale the Box in the input picture to the Heatmap of 128*128 size, calculate the coordinates of the center point of the Box, perform the rounding down operation, and define it as point;
Step 3 - Calculate the radius R of the Gaussian circle according to the size of the target Box; The determination of the radius of the Gaussian circle mainly depends on the width and height of the target box. In practice, IOU=0.7 is usually taken, that is, overlap=0.7 in the figure below As a critical value, the radii of the three cases are calculated respectively, and the minimum value is taken as the radius R of the Gaussian kernel. The specific implementation details are shown in the figure below:
(1) Case 1 - The prediction box pred_bbox contains the gt_bbox box, which corresponds to the first case in the figure below. After expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
(2) Case 2-gt_bbox contains the prediction box pred_bbox box, which corresponds to the second case in the figure below. After expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
(3) Situation 3-gt_bbox and the prediction box pred_bbox overlap each other, corresponding to the third situation in the figure below, after expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
Step 4 - On the Heatmap of 128*128 size, use point as the center point and radius R to calculate the Gaussian value. The value at the point point is the largest, and the value decreases with the increase of the radius R;

The figure above shows a sample. The left side shows the 512512-sized input image after cropping, and the right side shows the 128128-sized Heatmap image generated after the Gaussian operation. Since the picture contains two cats, these two cats belong to the same category, so two Gaussian circles are generated on the same Heatmap, and the size of the Gaussian circle is related to the size of the rectangular box.
4)loss function
The overall loss consists of three parts: L k represents the heatmap center point loss, Loff represents the target center point offset loss, and L size represents the target length and width loss function.
a. Heatmap loss function
N是输入图片中目标的个数,也是heatmap中关键点的个数。上式中的 是交叉熵损失函数, 是focal loss项, 压制半径范围内的损失,因为这些位置里物体的bounding box中心很近。
b.offset 损失函数
其中O ^ p ~ 表示网络预测的偏移量数值,p表示图像中心点坐标,R表示Heatmap的缩放因子, p ~ 表示缩放后中心点的近似整数坐标,整个过程利用L1 Loss计算正样本块的偏移损失。由于骨干网络输出的 feature map 的空间分辨率是原始输入图像的四分之一。即输出 feature map 上的每一个像素点对应到原始图像的一个4x4 区域,这会带来较大的误差,因此引入了偏置的损失值。
假设目标中心点p为(125, 63),由于输入图片大小为512*512,缩放尺度R=4,因此缩放后的128x128尺寸下中心点坐标为(31.25, 15.75), 相对于整数坐标(31, 15)的偏移值即为(0.25, 0.75)。
c. 目标长宽损失函数
其中N表示关键点的个数,Sk表示目标的真实尺寸,S ^ p k  表示预测的尺寸,整个过程利用L1 Loss来计算正样本块的长宽损失。
5)trics
a. 如果两个物体的heat map有交集,那么该处的值,取最大的那个。
b. 如果两个物体的中心点重合,该算并未解决,在论文评测集中(coco数据集),该情况只占0.1%。
4、结果
The above table shows the accuracy and speed of CenterNet target detection on the COCO validation set. Row 1 shows that using Hourglass-104 as a benchmark network can not only obtain 40.4AP, but also obtain a speed of 14FPS; row 2 shows the AP and FPS obtained by using DLA-34 as a benchmark network; row 3 and The 4 lines show the effect of the ResNet-101 and ResNet-18 benchmark networks on the COCO verification set. Through observation, we can find that the benchmark network based on DLA-34 can achieve a trade-off between accuracy and speed.
Based on dense detection summary
Similarities and differences between FSAF, FCOS, and FoveaBox:
1. Both use FPN for multi-scale target detection.
2. Both classification and regression are decoupled into two sub-networks for processing.
3. Both classification and regression are performed through dense prediction.
4. The regression of FSAF and FCOS predicts the distance to the four boundaries, while the regression of FoveaBox predicts a coordinate transformation.
5. FSAF selects more appropriate features to improve performance through online feature selection. FCOS improves performance by eliminating low-quality bboxes through the center-ness branch. FoveaBox improves performance by only predicting the center area of ​​the target.
Similarities and differences between (DenseBox, YOLO) and (FSAF, FCOS, FoveaBox):
1. Both classification and regression are performed through dense prediction.
2. (FSAF, FCOS, FoveaBox) use FPN for multi-scale target detection, while (DenseBox, YOLO) only have single-scale target detection.
3. (DenseBox, FSAF, FCOS, FoveaBox) are obtained by decoupling classification and regression into two sub-networks, while (YOLO) classification and positioning are obtained uniformly.
The following is anchor free detection based on key point estimation
A. CornerNet (open source)
1. Main contributions
1) 设计了一个针对top-left和bottom-right的heatmap,找出那些最有可能是top-left和bottom-right的点,并使用一个分支输出embedding vector,帮助判断top-left与bottom-right之间的匹配关系。
2) 提出了Corner Pooling,因为检测任务的变化,传统的Pooling方法并不是非常适用该网络框架。(后面介绍)
2、主要思路
网络有两个分支,一个分支预测Top-left Corners,另一个预测Bottom-right corners。
每个分支有三个线路,heatmaps预测哪些点最有可能是Corners点,embeddings主要预测每个点所属的目标(解决如何匹配一个物体的左上角点和右下角点),最后的offsets用于对点的位置进行修正。
在检测角点部分,CornerNet产生两个热点图,分别对应左上角和右下角。热点图可以表示不同类别的角点在图中的位置并且对每个角点附上一个置信度分数,此外还产生一个嵌入向量和偏移量,嵌入向量用来确认左上角和右下角的两个点是否属于同一个物体,偏移量对角点的位置进行微调。在生成目标候选框阶段,排名top-k的左上角和右下角角点被从heatmaps中选择出来,然后,计算一对角点间的嵌入向量的距离,如果距离小于预设的阈值,就认为这两个点属于同一个物体,就会根据这两个角点生成一个bounding box,同时,根据两个角点的得分计算一个平均分数作为该bounding box的得分。
3、具体细节
1)input
511 × 511 (四倍下采样后输出为128*128)
2)backbone
hourglass网络(修改过,且没有pre train)
3)neck & head
It is divided into four parts: heatmap branch (prediction category), offset branch (refined corner offset), embedding branch (matching upper and lower corners), Corner Pooling
a. Heatmaps and Reduce penalty strategy
Each Heatmaps has C Channels, where C is the number of categories, excluding BG categories, the heatmap has the largest value at the gt position, and the closer to the gt position, the greater the value of the heatmap. The realization is to construct a Gaussian function, the center is the gt position, and the farther away from this center, the stronger the attenuation, that is:
Sigma is one-third of the radius, and the calculation method of the radius is to make the minimum value of the iou of the formed bbox and GT greater than 0.3.
b. offset branch
The input size of the featmap predicted by the author is different from that of the original image. Assuming that the downsampling factor is n, then the corresponding position from the original position (x, y) to the feature map is ([x/n], [y/n ]), where [ ] represents the rounding operation. When remapping back to the original image position, there will undoubtedly be a certain error. The offsets route corrects the position on this basis. The convergence target is expressed as:
c. embedding branch
The main function of this branch is to group corners. The author uses 1-dimensional embeddings... In other words, assign different ids to different targets, such as 1, 2, 3... Then when predicting, if The embedding values ​​of the top-left corner and the bottom-right corner are very close. For example, if one is 1.2 and the other is 1.3, then these two are likely to belong to the same target; if there is a big gap between 1.2 and 2.3, then generally Those are two different goals. To achieve such a prediction, two things must be done:
The embedding values ​​predicted by two corners of the same target should be as close as possible
  1. The embedding values ​​predicted by different targets should be as far away as possible 
d.  Corner Pooling
Our commonly used max pooling is generally centered on the current location, the size of which is a 3x3 kernel, and the receptive field is naturally centered on the current location. But the detection of the corner is more concerned with a single direction than such a square receptive field... Taking top-left as an example, it is more concerned with the information in the two directions of horizontal right and vertical downward. Considering this For this reason, the author proposes a Corner Pooling, the principle is as follows:
For example, the size of the feature map is 10x10, and now a point is (2,1), then top-left corner pooling is to calculate the maximum value on the line from (2,1) to (2,10) and (2,1 ) to the maximum value on the line (10,1) and superimpose them. In actual calculation, it can be realized by reverse calculation. The schematic diagram is as follows:
As an example in the above figure 2,1,3,0,2, the last 2 remains unchanged, and the penultimate one is max(0,2)=2, so 2,2 is obtained. Then the penultimate one is max(3,2)=3...and so on.
I personally feel that it is through one-dimensional max pooling. The formula is as follows:
In addition, the author used the residual structure of resnet as a basis to change the prediction structure of Backbone and the final predict. The specific method is to change the first 3x3 convolution operation. The final prediction structure is:

4)loss function
It is divided into three parts: heatmap branch (prediction category), embedding branch (matching upper and lower corner points), offset branch (refined corner point offset)
a. Heatmaps

Where p is the predicted value and y is the real value. This function is modified on the basis of focal loss.
b. offset branch
c. embedding branch
Among them, etk and ebk are the embeddings predicted by the two branches of top-left and bottom-right, ek is the average value of the two, the role of Lpull is to bring the predicted value of the same target closer, and the role of Lpush It is to push the embeddings values ​​of different targets farther.
The overall loss is as follows:
Where the hyperparameters: alpha and beta are 0.1, and gamma is 1.
5)trics
4. Results
During the test, 3x3 max pooling will be used to suppress the non-maximum value of the heatmap, and the first 100 results on the heatmap will be taken, and then matched according to the L1 distance; note that the matching is between each category , Different categories or L1 distances are generally not considered. The author's final experimental results are as follows:
It is a very good result in the single-stage method, and it is also a battle with many two-stages methods.
B. CenterNet (open source)
It should be noted that this centerNet is an improved version of CornerNet. The original paper is called: CenterNet: Keypoint Triplets for Object Detection
1. Main contributions
Two methods are proposed to enrich the information of center points and corner points:
Center pooling: Predict central keypoints in branches. Central pooling can help central key points become more identifiable in an object to help filter candidate boxes.
Cascade Corner pooling: Based on CornerNet Corner pooling, the Corner has the ability to perceive internal information.
2. Main ideas
1) Use Hourglass as the backbone to extract the feature of the image;
2) Use the Cascade Corner Pooling module to extract the Corner heatmaps of the image, and use the same method as in CornerNet to obtain the bounding box of the object according to the upper left corner and the lower right corner. All bounding boxes define a central area (details later - neck&head introduction)

3) Use the Center Pooling module to extract the Center heatmap of the image, and get all the object center points according to the Center map.
Specifically: In the center heatmap, according to the response value of the heatmap, select top-k center keypoints, use the corresponding offset map to fine-tune these keypoints, and obtain more accurate keypoint positions.
4) Use the center point of the object to further filter the bounding box extracted in 2): If there is no center point in the middle area of ​​the box, the box is considered unreliable; if a center point falls in this center area, then this The bounding box is retained, and the score of its bbox becomes the average of the three points.
3. Specific details
1)input
544*511 (the feature map is 128*128 after quadruple downsampling)
2)backbone
hourglass
3)neck & head
Here you need to pay attention to the following concepts: center pooling, cascade Corner pooling, and center area settings.
a.Center pooling
The information conveyed by the absolute geometric center of an object is not necessarily the most accurate. For example, for human recognition, the face is an important source of information, but the center point of the human as a whole target is not on the face. To solve this problem, center pooling is proposed to capture richer visual information.
The following figure shows the process of Center pooling. The specific method is: obtain a feature map through the network backbone, and check whether it is the center point pixel by pixel. The method is: find the maximum values ​​​​in the horizontal and vertical directions and add them together. Through this This kind of operation, center pooling can help us find the center point better.
Remarks: Its branch is similar to the branch of CornerNet, except that the embed branch is cut off, the heatmap is the number of categories, and the offset feature map is two layers.

Method to realize:
Combined with Corner pooling to achieve easy implementation, as shown in Figure a below, for example, if we want to select the maximum value in the horizontal direction, we only need to connect left pooling and right pooling together.

b.Cascade corner pooling
Corner is also a corner point, which is usually free from the real object. Corner pooling aims to find the maximum value in the vertical and horizontal directions to determine the position of the Corner. However, it is very sensitive to the boundary and wants the Corner to "see" The information inside the "object" is cascade Corner pooling: first find a maximum value along the boundary, and then "look in" in the direction of the maximum value to find an internal maximum value. As shown in Figure b above.
c. Central area
The size setting of the central area is critical. For example, for a small bounding box, the smaller the central area, the lower the recall rate (recall) will be, because many positive examples will be judged as negative examples; for a large bounding box, the larger the central area, The precision will be lower, because many negative examples will be judged as positive examples; therefore, this paper proposes a scale-aware central area to adapt to the size of the bounding box, and generate a relatively small central area in a large bounding box. This can effectively improve the precision rate (precision); similarly, generating a relatively large central area in a small bounding box can effectively improve the recall rate (recall).

Among them, tlx, tly are the coordinates of the top-left Corner; brx, bry are the coordinates of the bottom-right; ctlx, ctly are the coordinates of the upper left corner of the central area; cbrx, cbry are the coordinates of the lower right corner of the central area. n is an odd number, which determines the size of the central area. The method of determining the central area is very simple, that is, the bbox is divided into 9 parts (n=3) or 25 parts (n=5). The author said that when the size of the bbox is greater than 150 , n takes 5, and when it is less than 150, n takes 3. The final graphic effect is as follows: the solid rectangle represents the bounding box, and the shaded part represents the central area.

4)loss function
Among them, the first two are to detect the loss caused by the Corner point and the center point respectively, using focal loss; the corresponding back is similar to CornerNer, using a pull and a push to correctly classify the corner points; the last two are to use L1 loss for Fine-tune the corner and center positions. The specific meaning of loss is not mentioned in the original text, it is mainly an improvement on Cornernet, refer to CornerNet.
Hyperparameter settings: α, β, γ=0.1,0.1,1.
5)trics
4. Results
The effect is very good, and it is the best among the anchor free and one-stage methods.
C. ExtremeNet (open source)
1. Main contributions
It is proposed to detect extreme points instead of corner points, and obtain the detection results directly through the combination of geometric relations.
Solve the following problems:
1) From the bottom-up detection method, compared with CornerNet, the detected corners are often located outside the target, and the detection is more difficult.
2) Labeling difficulty: Manually labeling extreme points is easier and less time-consuming than labeling monitoring frames.
2. Main ideas
The improvement method based on CornerNet predicts the 4 extreme points (extreme point) and 1 center point of the object, combines them according to the geometric distribution, constructs the prediction frame from the extreme points, and obtains the prediction result.
3. Specific details
1)input
511*511
2)backbone
Hourglass-104
(In fact, the calculation of the network is very large, about 140T under the input of 256x256, under the same conditions, ResNet50 is about 8T, which also leads to a very slow overall speed.)
3)neck & head
After the prediction head outputs 5 H*W heat maps, corresponding to 4 poles and center points, the number of channels of each heat map is C, corresponding to the number of object categories.
In addition, the prediction generates 4 offset outputs with a channel of 2, corresponding to 4 poles, which are used to fine-tune the loss of sampling accuracy under the pole coordinates. It is class-agnostic, and the center point does not require fine-tuning.
Use the Center Grouping process to group the 4 poles as well as the center point.

4)loss function
The loss calculation of ExtremeNet also follows some practices in CornerNet, calculating the classification loss of focal loss deformation and the loss of positioning accuracy caused by downsampling, without the Embedding process in CornerNet.
5) tricks
a. Training sample generation
The COCO data set is used for training. Since there is no direct pole label, the segmented mask label is used, the pole is calculated, and then the pole is used for training.
(ExtremeNet's training data set is not consistent with ordinary target detection data sets, or the information contained in the training data is more, that is, the poles have more semantic information than the corners. From a practical point of view, manual labeling The efficiency of the pole is higher than that of the corner, and it still has advantages in terms of use.)
b.  Ghost box suppression
Since the combination of poles is an exhaustive process and is determined only based on the coordinate relationship between the poles and the center point, a false positive result will be introduced. For example, when three small objects of the same category appear side by side, the left pole of the left object and the right pole of the right object are calculated as the center point of the central object after coordinate calculation, and there will be a higher response. A long object with three small objects.
The method of suppression is: if the sum of the scores of all target boxes contained in a target box exceeds 3 times the score of the target box itself , the score of the target box itself is divided by 2.
  • The 3 times here is a lower limit, if it contains more objects (such as 5), it will also be suppressed.
  • Dividing by 2 reduces the score of the large prediction frame, and the large target frame can be filtered out in the subsequent NMS process.
c.  Edge aggregation (currently not understood)
Another problem with predicting the pole is that if the object has a side parallel to the coordinate axis, according to the design, the points on the entire side should have a pole response. The problem is that the response of each point may be relatively low, or lower than The only pole after rotating a certain angle, Edge aggregation is intended to enhance the response of parallel edges to improve the prediction effect.
 

4. Results
The results of the ablation experiment:
  • Among them, the first line is the standard ExtremeNet, and the multi-scale in the second line is multi-scale augmentation of the input image.
  • In the second part, when the Center grouping is removed, the combination process is replaced by Embedding similar to CornerNet, and the performance drops by 2.1% mAP.
  • Edge aggregation and Ghost removal are more obvious for large objects, and have little effect on small objects.
  • The third part is error analysis, replacing the prediction results with ground truth. The improvement after replacing the center is not particularly large, indicating that the extraction of center features is okay. After replacing the extreme and replacing the center at the same time, it has brought a great improvement, which shows that there is still a lot of room for improvement in the process of extracting the extreme and combining the extreme center points.
Comparison with other target detection frameworks:
Reference link:
18. Centernet (object as point) details:  https://blog.csdn.net/WZZ18191171661/article/details/113753991
19. Centernet (object as point) details:  https://mp.weixin.qq.com/s/hlc1IKhKLh7Zmr5k_NAykw
20. Centernet (object as point) details: https://zhuanlan.zhihu.com/p/66048276
21. Centernet (object as point) explains Gaussian kernel radius:  https://zhuanlan.zhihu.com/p/96856635 
23、 cornetNet:  https://arxiv.org/abs/1808.01244
1. Overview of anchor free
1. You must first know what the anchor is (this requires you to understand the second stage such as faster rcnn, the first-order detector such as YOLO V2 or later or SSD, etc.).
In the past, target detection was usually modeled as the classification and regression of candidate frames. However, according to the different generation methods of candidate regions, it was divided into two-stage (two-step) detection and single-stage (one-step) detection. The former's The candidate box generates a proposal through the RPN (regional recommendation network) network, which generates a priori box (anchor) of different sizes and aspect ratios at each position on a specific feature map. As shown below:
2. Why abandon the anchor and make anchor free
1) Anchor settings need to be manually designed (aspect ratio, scale size, and the number of anchors), and different designs are required for different data sets, which is quite troublesome.
2) The matching mechanism of Anchor makes the matching frequency of extreme scales (extremely large and extremely small objects) lower than that of moderately sized objects. It is not easy for DNN to learn these extreme samples well during learning.
3) The huge number of Anchors causes a serious imbalance problem, which involves a sampling process. In fact, the strategy similar to Focal loss is not stable, and there are many pits in the sampling.
4) The number of Anchors is huge, and IOU calculations are required for each one, which consumes a huge amount of computing power and reduces efficiency.
3. The direction of anchor free
It can be traced back to the YOLO algorithm, which should be the earliest anchor-free model, and the recent anchor-free method is mainly divided into two types based on dense prediction and key point estimation .
4. Limitations of anchor free
 At present, in order to achieve better-looking results, the paper hides some details or has some unfair comparisons in the experiment (such as using hourglass in the backbone network to compare other people's resnet, etc.).
5. Anchor free project recommendation
Due to the introduction of YOLOV5, we mainly understand the idea of ​​anchor free, and engineering applications can mainly try:
1) centerNet (object as point version)
2) extremeNet (change the regression bounding box to extreme points)
2. Interpretation of anchor free
Introduce according to the following structure
X、XXXXX
1. Main contributions
2. Main ideas
3. Specific details
1)input
2)backbone
3)neck & head
4)loss function
5)trics
4. Results
    
A. dense box (Baidu IDL & Horizon in September 2015)
1. Main contributions
1) Introducing FCN into target detection can achieve good efficiency and accuracy; end2end multi-task target detection framework.
2) Introduce the target landmark positioning into the multi-task learning of DenseBox, and the detection accuracy can be further improved;
3) In face detection, DenseBox achieved sota on MALF and KITTI;
2. Main ideas
As shown above:
1) Use the image pyramid, input CNN, directly predict the bbox + classification score on the feature map (that is, a 5-layer feature map: 4 layers represent Bbox regression, and one layer represents the binary classification face score)
2) Except for the NMS step of DenseBox, the entire model is a full convolution operation, there is no fully connected layer, and no region proposal generation step is required;
3) The image is down-sampled + up-sampled (bilinear interpolation), which is similar to the segmentation network FCN;
3. Specific details
1)input
a. Training phase
The original image contains too much background, and the training takes a lot of time in the calculation and judgment of the background area, which is of little significance; therefore, using the cropped image patch to participate in the DenseBox training, as long as the cropped image patch contains a face and has Enough background information will do.
Specific operation: DenseBox training is a bit of an image segmentation method. First, crop the image patch from the original image, and then resize it to 240 x 240. The center of the image contains a face gt bbox with a height of about 50 pix; this operation is easy to implement , which is equivalent to cropping a patch from the original image + gt image, but this patch must ensure that the face is in the center of the patch, and the size of the gt bbox becomes 50 x 50 after resize, so the specific patch sub-area of ​​how large the crop is needs to be done Calculate in equal proportions; take a chestnut: a face gt bbox in the original image is 80 x 80, we need to ensure that the scale of the face after resize is 50 x 50, then the patch sub-area that needs to be cropped in the original image is 240 x 80 / 50 = 384, that is, 384 x 384 sub-area.
b. Testing phase
Use an image pyramid
2)backbone
A total of 16 layers of convolution operations, multiplexing the first 12 layers of network parameters of VGG19.
3)neck & head
Conv 4-4 is followed by four 1 x 1 conv layers; as can be seen from fig 3, it is actually divided into two branches: two 1 x 1 conv layers finally output a 1-channel feature map for face classification scores calculation; the other two 1 x 1 conv layers output a 4-channel feature map, which is used for the calculation of face bbox regression; conv 3-4 and conv 4-4 feature map features are fused, and the concatenation is done; the author believes that the lower layer The features on the feature map contain more local details of the target, making it easier to judge the local area of ​​the target; the features on the high-level feature map, because they have a larger receptive field, contain more voice information and context information, are easier to Recognize the global information of the target; the receptive field of conv 3-4 is 48 x 48, which is almost the same as the face gt bbox scale, and the receptive field of conv 4-4 is 118 x 118, so more global information + context information can be integrated ;In addition, the scale of conv 4-4 feature map is half of the scale of conv3-4 (more pool3 operation), so before doing concate, you need to do a bilinear upsampling of conv 4-4 to ensure that both The scale is the same; because only concate, not element-wise addition, there is no need to ensure that the number of channels is the same.
4)loss
a. Classification loss, pixel-wise L2-loss on feature map.
The author said that the L2-loss performance in DenseBox is very good, and he did not try such as hinge loss, cross-entropy loss, etc.
b. The bbox regression loss is also the pixel-wise L2-loss on the feature map.
4)trick
a. Define positive and negative samples
In each mini-batch iteration, the number of positive and negative samples varies greatly, and negative samples occupy the vast majority; if these negative samples are used for training, then the final loss will be biased towards a large number of negative samples; at the same time, if those If the ambiguous positive and negative samples on the classification boundary are trained , the model will not learn valuable information, and the performance will also decline.
On the final feature map:
Positive sample: the center point of the label box (personal understanding and the center point is the center of the circle, and the height of the label box has a certain ratio, such as a point in a circle with a radius of 0.3);
Negative samples: points outside the above positive samples;
Ignore sample: The pixel (x, y) of the negative sample, if there is a positive sample pixel within two pixel distances near it, the pixel is regarded as an ignored sample.
b. Difficult Sample Mining
In mini-batch, all samples are first performed forward operation, and all pixel outputs are sorted in descending order according to the loss of formula (1), and top 1% are selected as difficult negative samples; and all positive samples (positive labeled pixels) are retained, Control the ratio of positive and negative samples to 1:1; half of all negative samples come from random sampling from non-hard negative (that is, the remaining top 99% negative samples), and half come from sampling from top 1% hard-negative; In a mini-batch, set Fsel = 1 to identify whether the sample is selected (positive sample + difficult negative sample + randomly selected negative sample)
Positive sample: the input 240 x 240 pix patch, and the center position of the patch contains a positive sample within a specific scale range (around 50 x 50 pix), and each patch may contain several negative samples near the positive sample; negative sample: random patch : Randomly crop patches from the image at random scale (randomly crop patches at random scale from training images), and then resize to 240 x 240 pix patch input network; during training, the ratio of positive samples to random patch negative samples is 1: 1;
c. Data Augmentation
Randomly flip left and right for positive and negative patches, 25 pix translation, scale transformation [0.5, 1.25];
B. YOLO V1
1. Main contributions
One-stage, fast, end-to-end training
2. Main ideas
Through the fully convolutional network, the input image is mapped to a 7*7*30 tensor, 7*7 means that the entire image is divided into 7*7 grids, and each grid is set with 2 pre-selection boxes (training can understand For the idea of ​​evolution), as shown in the left figure below; 30 is the encoding of each position as shown in the right figure.
Probability of 20 object classifications: YOLO supports the recognition of 20 different objects (people, birds, cats, cars, chairs, etc.)
The position of 2 bounding boxes: each bounding box needs 4 values ​​to represent its position, (Center_x, Center_y, width, height)
The confidence of the two bounding boxes: the probability of an object in the bounding box * the IOU between the bounding box and the actual bounding box of the object. When the grid does not contain an object, the label of the confidence is 0; if it contains an object, the label of the confidence is the IOU value of the predicted frame and the real object frame
3. Specific details
1)inout
The input is the original image, and the only requirement is to scale it to 448*448. The main reason is that in YOLO's network, the convolutional layer is connected to two fully connected layers at the end, and the fully connected layer requires a fixed-size vector as input.
2)backbone
Darknet
3) neck & head
As shown in the above two figures, there are a total of 49*2=98 candidate areas, the size and shape of the two bounding boxes are not preset, and there is no prediction of an object output for each bounding box. It means only to predict two bounding boxes for an object, and choose the one that is relatively accurate. It's not exactly a supervised algorithm, but like an evolutionary algorithm. A picture can detect up to 49 objects. Calculate the center position of the bounding box of the object, which grid the center position falls on, and the category probability of the object in the output vector corresponding to the grid is 1, which is responsible for predicting the object; the prediction probability of other grids for the object is set to 0 (not responsible for predicting the object).
4) loss function
          , the object detection problem faced by Yolo is a typical problem of unbalanced number of categories. Of the 49 grid points, there are usually only 3 or 4 grid points that contain objects, and the rest are all grid points that do not contain objects. At this time, if no point measures are taken, the mAP of object detection will not be too high, because the model is more inclined to grid points that do not contain objects. The role of is to make the grid points containing objects have a greater weight in the loss function, so that the model will pay more attention to the loss caused by the grid points containing objects. In the paper,   the values ​​of are 5 and 0.5 respectively.
5)tricks
a. Do not directly return the coordinate value of the center point, but return the displacement value relative to the coordinate of the upper left corner of the grid point
b. Each grid point predicts two or more rectangular boxes. In the calculation of the loss function, only the loss is calculated for the box closest to the real object, and the rest of the boxes are not corrected. The author found that two boxes of a grid point are in The size, aspect ratio, or some categories are gradually divided, and the overall recall rate is improved.
c. During inference, use the maximum value p of the category prediction of the object multiplied by the maximum value c of the prediction box as the confidence of the output prediction object.
C, FCOS (Fully Convolutional One-Stage Object Detection has been open source)
1. Main contributions
1)如何解决重叠点:采用FPN分层+同层重叠归属面积最小框
2)如何解决逐像素低质量框:提出了center——ness策略(个人认为 其实本质感觉和YOLOV1的预测object是一样的 )
2、主要思路
使用FPN,在特征图上 对所有的点进行: 预测分类+位置回归+是否是中心( 个人认为 其实本质感觉和YOLOV1的预测object是一样的)
3、具体细节
1)input
无特殊要求
2)backbone
无特殊,实验尝试过resnext-32x8d-101-fpn等
3)neck & head
a.如上面figure2,需要注意的是 与FPN原生实现稍有不同的是:
FCOS采用了backbone输出的C3、C4、C5特征图,横向合成得到P3、P4、P5;P6、P7是P5、P6经过一个步长为2的卷积层得到的。所有最后的特征图总共有5个,下采样倍数分别为8,16,32,64,128。
b.label 有分类+回归构成。
给定一个位置(x,y),如果该点落在任意一个GT bounding box中,那么该位置就会被识别为正例并给它附上类别标签c*,否则c*=0表示背景;此外对于这个位置(x,y),还要给它一个4维向量t =(l*,t*,r*,b* )作为回归的目标。 ( l*,t*,r*,b* 分别表示该点到bounding box 左(left),上(top),右(right),下(bottom)的距离 )如下图。
回归目标几乎都是正样本,因此利用exp()函数,将回归目标映射到(0,∞),这样做的好处是增加了bounding box的辨识度。
给定一张image,通过FCOS得到每个位置的类别得分Px,y和位置回归预测t (x,y),然后选择px,y>0.05的位置作为正例。
anchor-based方法在选取anchor box时,考虑的是其与GT的IOU比,超过阈值的为正样本,与之不同的是,FCOS可以利用尽可能多的前景样本进行训练回归。
c.重叠区域处理
第一种情况(在同一层特征图): 选择区域小的那个bounding box作为回归目标,通过这种处理能极大的减少模糊样本。
第二种情况(由于有FPN,在不同层特征图): 利用不同层的feature map检测不同尺寸的物体。
具体做法:
if a location satisfies max(l∗, t∗, r∗, b∗) > mi or max(l∗, t∗, r∗, b∗) < mi-1, it is set as a negative sample and is thus not required to regress a bounding box anymore. Here mi is the maximum distance that feature level i needs to regress. In this work, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞
d.center-ness
采用了FPN之后,网络的效果和state-of-the-art目标检测算法还是有一些差距,分析之后发现是 低质量检测框 导致的。
由于卷积神经网络中,特征图上一点,对应的是感受野的正中心,所以图五中橘黄色的点预测当前person的效果,一定要比两个绿色点好。那么两个绿色点就被成为低质量预测点。center-ness就是为了抑制绿色点的权重。先看center-ness的计算公式:

According to the formula, the closer the point is to the center of the ground truth, the higher the center-ness value. But here comes a new problem. There are two ways to obtain center-ness. One: calculate the center-ness directly according to the predicted values ​​l*, r*, t*, b*; two: return a center-ness value separately, according to Label values ​​l, r, t, b to train the branch. It can be seen from the network structure in Figure 4 that the author adopted method 2. In fact, the author compared the two methods:

The result has the final say, that is, the single center-ness regression branch improves mAP.
4)loss function

The category loss function uses focal loss, and the regression loss function uses IOU loss in the UnitBox paper.
After looking at the code, there is actually a centerness loss function branch, which uses BCEloss, but the label is calculated for the above centerness, not the binary classification label of onehot encoding.
5)trics
none
4. Results
Comparative experimental results in the paper:
Figure 8 FCOS accuracy comparison
At least it can be seen that the accuracy of FCOS exceeds the old classic algorithm Faster R-CNN.
FCOS performance
The specific value of the speed is not mentioned in detail in the paper. The above figure is taken from the official website code of the paper, and the value may be slightly different for different machines. The reasoning speed is not faster than the two-stage Faster R-CNN, but it basically meets the real-time performance. mAP does surpass Faster R-CNN by a lot.
D、 FSAF(Feature Selective Anchor-Free Module for Single-Shot Object Detection)
1. Main contributions
For the current target detection and matching ground truth and anchor ideas that use FPN, it mainly relies on iou or hierarchical mechanisms to provide automatic matching methods.
2. Main ideas
Try to use each layer of FPN to detect the instance, and then see which layer's detection result is the smallest loss of this instance, and this layer is the most suitable for detecting this instance.
3. Specific details
1)input
no resolution required
2)backbone
The author experimented with resnext-101
3) neck & head
a. Branch structure
The FSAF module allows each instance to automatically select the most suitable feature layer in FPN. In this module, the basis for feature selection is changed from the original instance size to instance content, which realizes automatic model learning to select the most suitable feature layer in FPN. .
FSAF uses RetinaNet as the basic structure, adding a FSAF branch in parallel with the original classification subnet and regression subnet, so that complete end-to-end training can be realized without changing the original structure. FSAF also includes two branches of classification (using the sigmoid function) and box regression, which are used to predict the category and coordinate value of the target, as shown in the figure
b. babel definition
First: a target (Instance), assuming its category is lable = c, and the bounding box coordinates are , (x, y) is the center coordinates of the target,
(w, h) are the width and height of the object.
Second: The coordinates of the feature layer projection of this target in FPN are , where .
第三:定义投影的有效目标框坐标是 ,也就是图3中的"car"class的白色区域,其中  ,式中的
第四:定义投影的忽略目标框坐标是 ,也就是图3中的"car"class的灰色区域,其中  ,式中的
class output: class output是和anchor-base brances并行结构,它的维度是W×H×K,K是总的类别数 (应该包含背景类别) , c lass output总共有K个feature maps,在上面我们假设此目标的类别是c(图中的车类别),那么class output 的标签维度是W×H×K的张量,在K个feature  maps 中的第c个feature  map 的定义是图3中的“car”class。其中白 色区域就是正的目标区域 定义值是1,灰色是忽略区域也就是 不进行梯度反向传播,黑色是负的目标区域 定义值是0. 采用的损失函数是Focal Loss。
Box output: The box output is a parallel structure with anchor-base brances, its dimension is W×H×4, 4 represents the offset. Give an example to illustrate the meaning of the offset: The label of the box output is for the pixels in the effective area , and the values ​​​​of the four dimensions are where are the pixel positions relative to the top, left, bottom, and right respectively, as shown in Figure 4. In addition, S =4.0. The loss function used is IoU Loss.
At the pixel point , if the predicted offset is , then the distance to is , and the coordinates of the predicted upper left corner and lower right corner are respectively and , then the predicted bounding box coordinates are multiplied by and and respectively .
3)loss function

a. In the anchor-based algorithm, the size of the target is usually assigned to the specified feature layer, while the FSAF module selects the optimal feature layer based on the content of the target. The classification loss and localization loss assigned to the th feature layer are as follows:
where is the number of pixels in the effective area.
Then the feature layer with the optimal prediction target is obtained by the following formula, that is, the joint loss function is the smallest.
b. How to combine anchors-free brances and anchors-base brances?
In inference, FSAF can output prediction results as a branch alone, or output prediction results at the same time as the original anchor-based branch. When both exist, the output results of the two branches are combined and then NMS is used to obtain the final prediction result. In training, multi-task loss is used, that is, the formula is as follows.
The weight factor is 0.5.
5)trics
none
4. Results
E、FeveaBox
1. Main contributions
Because many ideas are similar to fcos and FSAF, the difference is that regression does not directly learn the distance from the center of the target to the four sides, but to learn the mapping relationship between a predicted coordinate and the real coordinate.
2. Main ideas
Directly learn the existence probability of the object and the coordinates of the ground truth box (no preselection box is generated). Mainly through two branches:
Predict the probability that a class-sensitive semantic graph exists as an object
Predict a mapping relationship between a center point and frame coordinates
3. Specific details
1) input
There is no special requirement for image resolution.
2)backbone
Mainly compared with retina net, backbone has tried ResNet101 and ResNext101.
3)neck & head
a. In order to make a fair comparison with RetinaNet, the author used exactly the same network structure, that is, the structure of ResNet+FPN, in which the number of layers of the pyramid   and   the resolution of the input image.
b. Match Bbox
Assume that each layer in FPN predicts a bounding box within a certain range, and each feature pyramid has a basic area, that is, 32*32 to 512*512, so it is expressed as, and  set  . Just when l=3, the overall area is 1024=32*32. However, in order to make each layer respond to a specific object scale, FoveaBox calculates an effective range for each pyramid layer as follows, which is used to control this scale range (my understanding is to rely on area matching):
After experiments, the author found that when the control factor is 2, it is the optimal situation.
c. Positive and negative area determination
First, let's look at the sub-network for predictive classification. Its output is a set of pyramidal heatmaps, and each heatmap has a size of HxW and a dimension of K, where K is the number of categories. If the given ground truth box is (x1,y1,x2,y2), first map it to the target pyramid, namely:
The positive sample area (fovea) is designed as an attenuation area of ​​the original area, which is the same as the setting of DenseBox. The reason for this setting is to prevent the overlapping of semantic areas! Among them   is the scaling factor, and each cell in the positive sample area is assigned a corresponding category label. For negative samples, we also set a scaling factor  for generating negative sample regions. If a cell is not allocated, then the ignore area does not participate in backpropagation. This setting is very similar to FSAF! Due to the imbalance between samples, we use focal loss to optimize. The calculation formula of Fovea area is as follows:

As shown in the figure below, by setting the zoom factor, it is divided into positive and negative areas and ignored areas. It can be seen that the proportion of the positive sample area is relatively small compared to the entire feature map, and the positive and negative samples will be unbalanced during training, so the classification loss is focal loss
4)loss function
Using classification + regression loss (border prediction)
a. Classification loss
分类采用focal loss的,具体的实现论文没有提及,猜测是每个类(分类分支的每个特征图,注意正负以及忽略区域)分别做二分类,然后采用sigmod。
b.回归分支
与DenseBox和UnitBox不同,FoveaBox并不是直接学习目标中心到四个边的距离,而是去学习一个预测坐标与真实坐标的映射关系,假如真实框为   ,我们目标就是学习一个映射关系   ,这个关系是中心点与边框坐标的关系,如下
接着使用简单的L1损失来进行优化,其中为   一个归一化因子,将输出空间映射到中心为1的空间,使得训练稳定。最后使用log空间函数进行正则化。
5)trics
4、结果
通过上图我们可以看到,相比于anchor-based的,anchor-free的算法对于目标的尺度更具有鲁棒性,并且完全不需要去费劲的设计anchor的尺寸。
接下来是AP和AR的比较:
1)FoveaBox与RetinaNet的比较
2)与ASFA(CVPR2019)的比较
3)与其他SOTA方法的对比(coco test-dev)
F、CenterNet(Object as Points 已开源)
1、主要贡献
1)该算法去除低效复杂的Anchors操作,进一步提升了检测算法性能;
2)该算法直接在heatmap图上面执行了过滤操作,去除了耗时的NMS后处理操作,进一步提升了整个算法的运行速度;
3)该算法不仅可以应用到2D目标检测中,经过简单的改变它还可以应用3D目标检测与人体关键点检测等其它的任务中,即具有很好的通用性。
2、主要思路
CenterNet将目标检测问题转换成中心点预测问题,即用目标的中心点来表示该目标,并通过预测目标中心点的偏移量与宽高来获取目标的矩形框。
模块包含3个分支,具体包括中心点heatmap图分支、中心点offset分支、目标大小分支:
1)heatmap图分支包含C个通道,每一个通道包含一个类别,heatmap中白色的亮区域表示目标的中心 点位置;
2)中心点offset分支用来弥补将池化后的低heatmap上的点映射到原图中所带来的像素误差;
3)目标大小分支用来预测目标矩形框的w与h偏差值。
CenterNet网络的推理阶段的实现步骤如下所述:
步骤1-输入一张图像,将图像尺寸处理成512*512大小后作为网络的输入;
步骤2-执行网络的前向计算将得到3个输出:[1,80,128,128]大小的heatmap;[1,2,128,128]大小的尺寸预测,;[1,2,128,128]大小的offset预测;
步骤3-heatmap会经过一个sigmoid函数使得范围为0到1,接下来对heatmap执行一个最大池化操作(kernel设置为3,stride设置为1,pad设置为1,这一步其实是在做重复框过滤,这也是为什么后续不再需要NMS操作的一个重要原因,毕竟这里3✖️3大小的kernel加上特征图和输入图像之间的stride=4,相当于输入图像中每12✖️12大小的区域都不会有重复的中心点,想法非常简单有效)。然后再基于heatmap选择top K个得分最高的点(默认K=100,这样就确定了100个置信度最高的预测框的中心点位置了,这一步也会去掉一定的重复框)。可以设定一个置信度阈值,高于阈值的才输出;
Step 4 - Determine the size of the prediction box through the output size prediction value and offset. The obtained prediction frame information is all on the 128✖️128 size feature map, so finally the prediction frame information is mapped to the input image to get the final prediction result.
3. Specific details
1)input
512*512
The input image needs to be cut to 512*512 size, that is, the long side is scaled to 512, and the short side is filled with 0. The specific effect is shown in the figure below. Since the original picture’s W>512, it is directly scaled to 512; due to the original picture’s H<512, so perform 0 complement operation on it;

2)backbone
In the paper, three network architectures of Hourglass, ResNet and DLA were tried. The accuracy and frame rate of each network architecture are as follows:
Resnet-18 with up-convolutional layers:28.1% coco and 142 FPS
DLA-34:37.4% COCOAP and 52 FPS
Hourglass-104:45.1% COCOAP and 1.4 FPS
3)neck & head
This module contains 3 branches, including the center point heatmap branch, the center point offset branch, and the target size branch.
The heatmap graph branch contains C channels, each channel contains a category, and the white bright area in the heatmap indicates the center point position of the target;
The center point offset branch is used to make up for the pixel error caused by mapping the points on the pooled low heatmap to the original image;
The target size branch is used to predict the w and h deviation values ​​of the target rectangle.
a. Heatmap represents classification information, and each category will generate a separate Heatmap. For each Heatmap, when a certain coordinate contains the center point of the target, a key point will be generated at the target, and we use the Gaussian circle (as long as the predicted corner point is within a certain radius r of the center point , and when the IoU between the rectangular box and gt_bbox is greater than 0.7, we set the value at these points to a Gaussian distribution value instead of a value of 0. ) to represent the entire key point, the following figure shows the specific details.

The specific steps to generate a Heatmap are as follows:
Step 1 - Scale the input image to a size of 512*512, and perform a downsampling operation of R=4 on the image to obtain a Heatmap image of a size of 128*128;
Step 2 - Scale the Box in the input picture to the Heatmap of 128*128 size, calculate the coordinates of the center point of the Box, perform the rounding down operation, and define it as point;
Step 3 - Calculate the radius R of the Gaussian circle according to the size of the target Box; The determination of the radius of the Gaussian circle mainly depends on the width and height of the target box. In practice, IOU=0.7 is usually taken, that is, overlap=0.7 in the figure below As a critical value, the radii of the three cases are calculated respectively, and the minimum value is taken as the radius R of the Gaussian kernel. The specific implementation details are shown in the figure below:
(1) Case 1 - The prediction box pred_bbox contains the gt_bbox box, which corresponds to the first case in the figure below. After expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
(2) Case 2-gt_bbox contains the prediction box pred_bbox box, which corresponds to the second case in the figure below. After expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
(3) Situation 3-gt_bbox and the prediction box pred_bbox overlap each other, corresponding to the third situation in the figure below, after expanding the entire IoU formula, it becomes a problem of solving a binary linear equation.
Step 4 - On the Heatmap of 128*128 size, use point as the center point and radius R to calculate the Gaussian value. The value at the point point is the largest, and the value decreases with the increase of the radius R;

The figure above shows a sample. The left side shows the 512512-sized input image after cropping, and the right side shows the 128128-sized Heatmap image generated after the Gaussian operation. Since the picture contains two cats, these two cats belong to the same category, so two Gaussian circles are generated on the same Heatmap, and the size of the Gaussian circle is related to the size of the rectangular box.
4)loss function
The overall loss consists of three parts: L k represents the heatmap center point loss, Loff represents the target center point offset loss, and L size represents the target length and width loss function.
a. Heatmap loss function
N is the number of targets in the input image and the number of key points in the heatmap. In the above formula and are the cross entropy loss function, and are the focal loss items, which suppress the loss within the radius range, because the center of the bounding box of the object in these positions is very close.
b.offset loss function
Among them, O ^ p ~ represents the offset value predicted by the network, p represents the coordinates of the center point of the image, R represents the scaling factor of the Heatmap, p ~ represents the approximate integer coordinates of the center point after scaling, and the whole process uses L1 Loss to calculate the positive sample block offset loss. Since the spatial resolution of the feature map output by the backbone network is a quarter of the original input image. That is, each pixel on the output feature map corresponds to a 4x4 area of ​​the original image, which will bring a large error, so a biased loss value is introduced.
Assuming that the target center point p is (125, 63), since the input image size is 512*512, and the scaling scale R=4, the coordinates of the center point in the scaled 128x128 size are (31.25, 15.75), relative to the integer coordinates (31 , 15) the offset value is (0.25, 0.75).
c. Target length and width loss function
其中N表示关键点的个数,Sk表示目标的真实尺寸,S ^ p k  表示预测的尺寸,整个过程利用L1 Loss来计算正样本块的长宽损失。
5)trics
a. 如果两个物体的heat map有交集,那么该处的值,取最大的那个。
b. 如果两个物体的中心点重合,该算并未解决,在论文评测集中(coco数据集),该情况只占0.1%。
4、结果
上表展示了CenterNet目标检测在COCO验证集上面的精度与速度。第1行展示了利用Hourglass-104作为基准网络后不仅能够获得40.4AP,而且可以获得14FPS的速度;第2行展示了利用DLA-34作为基准网络后获得的AP与FPS;第3行与第4行分别展示了ResNet-101与ResNet-18基准网络在COCO验证集上面的效果。通过观察我们可以发现,基于DLA-34的基准网络能够在精度与速度之间达到一个折中。
基于密集检测总结
FSAF、FCOS、FoveaBox的异同点:
1.都利用FPN来进行多尺度目标检测。
2.都将分类和回归解耦成2个子网络来处理。
3.都是通过密集预测进行分类和回归的。
4.FSAF和FCOS的回归预测的是到4个边界的距离,而FoveaBox的回归预测的是一个坐标转换。
5.FSAF通过在线特征选择的方式,选择更加合适的特征来提升性能,FCOS通过center-ness分支剔除掉低质量bbox来提升性能,FoveaBox通过只预测目标中心区域来提升性能。
(DenseBox、YOLO)和(FSAF、FCOS、FoveaBox)的异同点:
1.都是通过密集预测进行分类和回归的。
2.(FSAF、FCOS、FoveaBox)利用FPN进行多尺度目标检测,而(DenseBox、YOLO)只有单尺度目标检测。
3.(DenseBox、FSAF、FCOS、FoveaBox)将分类和回归解耦成2个子网络来得到,而(YOLO)分类和定位统一得到。
以下为基于关键点估计的anchor free检测
A、CornerNet(开源)
1、主要贡献
1) 设计了一个针对top-left和bottom-right的heatmap,找出那些最有可能是top-left和bottom-right的点,并使用一个分支输出embedding vector,帮助判断top-left与bottom-right之间的匹配关系。
2) 提出了Corner Pooling,因为检测任务的变化,传统的Pooling方法并不是非常适用该网络框架。(后面介绍)
2、主要思路
网络有两个分支,一个分支预测Top-left Corners,另一个预测Bottom-right corners。
每个分支有三个线路,heatmaps预测哪些点最有可能是Corners点,embeddings主要预测每个点所属的目标(解决如何匹配一个物体的左上角点和右下角点),最后的offsets用于对点的位置进行修正。
在检测角点部分,CornerNet产生两个热点图,分别对应左上角和右下角。热点图可以表示不同类别的角点在图中的位置并且对每个角点附上一个置信度分数,此外还产生一个嵌入向量和偏移量,嵌入向量用来确认左上角和右下角的两个点是否属于同一个物体,偏移量对角点的位置进行微调。在生成目标候选框阶段,排名top-k的左上角和右下角角点被从heatmaps中选择出来,然后,计算一对角点间的嵌入向量的距离,如果距离小于预设的阈值,就认为这两个点属于同一个物体,就会根据这两个角点生成一个bounding box,同时,根据两个角点的得分计算一个平均分数作为该bounding box的得分。
3、具体细节
1)input
511 × 511 (the output is 128*128 after quadruple downsampling)
2)backbone
hourglass network (modified, and no pre train)
3)neck & head
It is divided into four parts: heatmap branch (prediction category), offset branch (refined corner offset), embedding branch (matching upper and lower corners), Corner Pooling
a. Heatmaps and Reduce penalty strategy
Each Heatmaps has C Channels, where C is the number of categories, excluding BG categories, the heatmap has the largest value at the gt position, and the closer to the gt position, the greater the value of the heatmap. The implementation is to construct a Gaussian function, the center is the gt position, and the farther away from this center, the more attenuated, that is:
Sigma is one-third of the radius, and the calculation method of the radius is to make the minimum value of the iou of the formed bbox and GT greater than 0.3.
b. offset branch
The input size of the featmap predicted by the author is different from that of the original image. Assuming that the downsampling factor is n, then the corresponding position from the original position (x, y) to the feature map is ([x/n], [y/n ]), where [ ] represents the rounding operation. When remapping back to the original image position, there will undoubtedly be a certain error. The offsets route corrects the position on this basis. The convergence target is expressed as:
c. embedding branch
The main function of this branch is to group corners. The author uses 1-dimensional embeddings... In other words, assign different ids to different targets, such as 1, 2, 3... Then when predicting, if The embedding values ​​of the top-left corner and the bottom-right corner are very close. For example, if one is 1.2 and the other is 1.3, then these two are likely to belong to the same target; if there is a big gap between 1.2 and 2.3, then generally Those are two different goals. To achieve such a prediction, two things must be done:
The embedding values ​​predicted by two corners of the same target should be as close as possible
  1. The embedding values ​​predicted by different targets should be as far away as possible 
d.  Corner Pooling
Our commonly used max pooling is generally centered on the current location, the size of which is a 3x3 kernel, and the receptive field is naturally centered on the current location. But the detection of the corner is more concerned with a single direction than such a square receptive field... Taking top-left as an example, it is more concerned with the information in the two directions of horizontal right and vertical downward. Considering this For this reason, the author proposes a Corner Pooling, the principle is as follows:
For example, the size of the feature map is 10x10, and now a point is (2,1), then top-left corner pooling is to calculate the maximum value on the line from (2,1) to (2,10) and (2,1 ) to the maximum value on the line (10,1) and superimpose them. In actual calculation, it can be realized by reverse calculation. The schematic diagram is as follows:
As an example in the above figure 2,1,3,0,2, the last 2 remains unchanged, and the penultimate one is max(0,2)=2, so 2,2 is obtained. Then the penultimate one is max(3,2)=3...and so on.
I personally feel that it is through one-dimensional max pooling. The formula is as follows:
In addition, the author used the residual structure of resnet as a basis to change the prediction structure of Backbone and the final predict. The specific method is to change the first 3x3 convolution operation. The final prediction structure is:

4)loss function
It is divided into three parts: heatmap branch (prediction category), embedding branch (matching upper and lower corner points), offset branch (refined corner point offset)
a. Heatmaps

Where p is the predicted value and y is the real value. This function is modified on the basis of focal loss.
b. offset branch
c. embedding branch
Among them, etk and ebk are the embeddings predicted by the two branches of top-left and bottom-right, ek is the average value of the two, the role of Lpull is to bring the predicted value of the same target closer, and the role of Lpush It is to push the embeddings values ​​of different targets farther.
The overall loss is as follows:
Where the hyperparameters: alpha and beta are 0.1, and gamma is 1.
5)trics
4. Results
During the test, 3x3 max pooling will be used to suppress the non-maximum value of the heatmap, and the first 100 results on the heatmap will be taken, and then matched according to the L1 distance; note that the matching is between each category , Different categories or L1 distances are generally not considered. The author's final experimental results are as follows:
It is a very good result in the single-stage method, and it is also a battle with many two-stages methods.
B. CenterNet (open source)
It should be noted that this centerNet is an improved version of CornerNet. The original paper is called: CenterNet: Keypoint Triplets for Object Detection
1. Main contributions
Two methods are proposed to enrich the information of center points and corner points:
Center pooling: Predict central keypoints in branches. Central pooling can help central key points become more identifiable in an object to help filter candidate boxes.
Cascade Corner pooling: Based on CornerNet Corner pooling, the Corner has the ability to perceive internal information.
2. Main ideas
1) Use Hourglass as the backbone to extract the feature of the image;
2) Use the Cascade Corner Pooling module to extract the Corner heatmaps of the image, and use the same method as in CornerNet to obtain the bounding box of the object according to the upper left corner and the lower right corner. All bounding boxes define a central area (details later - neck&head introduction)

3) Use the Center Pooling module to extract the Center heatmap of the image, and get all the object center points according to the Center map.
Specifically: In the center heatmap, according to the response value of the heatmap, select top-k center keypoints, use the corresponding offset map to fine-tune these keypoints, and obtain more accurate keypoint positions.
4) Use the center point of the object to further filter the bounding box extracted in 2): If there is no center point in the middle area of ​​the box, the box is considered unreliable; if a center point falls in this center area, then this The bounding box is retained, and the score of its bbox becomes the average of the three points.
3. Specific details
1)input
544*511 (the feature map is 128*128 after quadruple downsampling)
2)backbone
hourglass
3)neck & head
Here you need to pay attention to the following concepts: center pooling, cascade Corner pooling, and center area settings.
a.Center pooling
The information conveyed by the absolute geometric center of an object is not necessarily the most accurate. For example, for human recognition, the face is an important source of information, but the center point of the human as a whole target is not on the face. To solve this problem, center pooling is proposed to capture richer visual information.
The following figure shows the process of Center pooling. The specific method is: obtain a feature map through the network backbone, and check whether it is the center point pixel by pixel. The method is: find the maximum values ​​​​in the horizontal and vertical directions and add them together. Through this This kind of operation, center pooling can help us find the center point better.
Remarks: Its branch is similar to the branch of CornerNet, except that the embed branch is cut off, the heatmap is the number of categories, and the offset feature map is two layers.

Method to realize:
Combined with Corner pooling to achieve easy implementation, as shown in Figure a below, for example, if we want to select the maximum value in the horizontal direction, we only need to connect left pooling and right pooling together.

b.Cascade corner pooling
Corner is also a corner point, which is usually free from the real object. Corner pooling aims to find the maximum value in the vertical and horizontal directions to determine the position of the Corner. However, it is very sensitive to the boundary and wants the Corner to "see" The information inside the "object" is cascade Corner pooling: first find a maximum value along the boundary, and then "look in" in the direction of the maximum value to find an internal maximum value. As shown in Figure b above.
c. Central area
The size setting of the central area is critical. For example, for a small bounding box, the smaller the central area, the lower the recall rate (recall) will be, because many positive examples will be judged as negative examples; for a large bounding box, the larger the central area, The precision will be lower, because many negative examples will be judged as positive examples; therefore, this paper proposes a scale-aware central area to adapt to the size of the bounding box, and generate a relatively small central area in a large bounding box. This can effectively improve the precision rate (precision); similarly, generating a relatively large central area in a small bounding box can effectively improve the recall rate (recall).

Among them, tlx, tly are the coordinates of the top-left Corner; brx, bry are the coordinates of the bottom-right; ctlx, ctly are the coordinates of the upper left corner of the central area; cbrx, cbry are the coordinates of the lower right corner of the central area. n is an odd number, which determines the size of the central area. The method of determining the central area is very simple, that is, the bbox is divided into 9 parts (n=3) or 25 parts (n=5). The author said that when the size of the bbox is greater than 150 , n takes 5, and when it is less than 150, n takes 3. The final graphic effect is as follows: the solid rectangle represents the bounding box, and the shaded part represents the central area.

4)loss function
Among them, the first two are to detect the loss caused by the Corner point and the center point respectively, using focal loss; the corresponding back is similar to CornerNer, using a pull and a push to correctly classify the corner points; the last two are to use L1 loss for Fine-tune the corner and center positions. The specific meaning of loss is not mentioned in the original text, it is mainly an improvement on Cornernet, refer to CornerNet.
Hyperparameter settings: α, β, γ=0.1,0.1,1.
5)trics
4. Results
The effect is very good, and it is the best among the anchor free and one-stage methods.
C. ExtremeNet (open source)
1. Main contributions
It is proposed to detect extreme points instead of corner points, and obtain the detection results directly through the combination of geometric relations.
Solve the following problems:
1) From the bottom-up detection method, compared with CornerNet, the detected corners are often located outside the target, and the detection is more difficult.
2) Labeling difficulty: Manually labeling extreme points is easier and less time-consuming than labeling monitoring frames.
2. Main ideas
The improvement method based on CornerNet predicts the 4 extreme points (extreme point) and 1 center point of the object, combines them according to the geometric distribution, constructs the prediction frame from the extreme points, and obtains the prediction result.
3. Specific details
1)input
511*511
2)backbone
Hourglass-104
(In fact, the calculation of the network is very large, about 140T under the input of 256x256, under the same conditions, ResNet50 is about 8T, which also leads to a very slow overall speed.)
3)neck & head
After the prediction head outputs 5 H*W heat maps, corresponding to 4 poles and center points, the number of channels of each heat map is C, corresponding to the number of object categories.
In addition, the prediction generates 4 offset outputs with a channel of 2, corresponding to 4 poles, which are used to fine-tune the loss of sampling accuracy under the pole coordinates. It is class-agnostic, and the center point does not require fine-tuning.
Use the Center Grouping process to group the 4 poles as well as the center point.

4)loss function
The loss calculation of ExtremeNet also follows some practices in CornerNet, calculating the classification loss of focal loss deformation and the loss of positioning accuracy caused by downsampling, without the Embedding process in CornerNet.
5) tricks
a. Training sample generation
The COCO data set is used for training. Since there is no direct pole label, the segmented mask label is used, the pole is calculated, and then the pole is used for training.
(ExtremeNet's training data set is not consistent with ordinary target detection data sets, or the information contained in the training data is more, that is, the poles have more semantic information than the corners. From a practical point of view, manual labeling The efficiency of the pole is higher than that of the corner, and it still has advantages in terms of use.)
b.  Ghost box suppression
Since the combination of poles is an exhaustive process and is determined only based on the coordinate relationship between the poles and the center point, a false positive result will be introduced. For example, when three small objects of the same category appear side by side, the left pole of the left object and the right pole of the right object are calculated as the center point of the central object after coordinate calculation, and there will be a higher response. A long object with three small objects.
The method of suppression is: if the sum of the scores of all target boxes contained in a target box exceeds 3 times the score of the target box itself , the score of the target box itself is divided by 2.
  • The 3 times here is a lower limit, if it contains more objects (such as 5), it will also be suppressed.
  • Dividing by 2 reduces the score of the large prediction frame, and the large target frame can be filtered out in the subsequent NMS process.
c.  Edge aggregation (currently not understood)
Another problem with predicting the pole is that if the object has a side parallel to the coordinate axis, according to the design, the points on the entire side should have a pole response. The problem is that the response of each point may be relatively low, or lower than The only pole after rotating a certain angle, Edge aggregation is intended to enhance the response of parallel edges to improve the prediction effect.
 

4. Results
The results of the ablation experiment:
  • Among them, the first line is the standard ExtremeNet, and the multi-scale in the second line is multi-scale augmentation of the input image.
  • In the second part, when the Center grouping is removed, the combination process is replaced by Embedding similar to CornerNet, and the performance drops by 2.1% mAP.
  • Edge aggregation and Ghost removal are more obvious for large objects, and have little effect on small objects.
  • The third part is error analysis, replacing the prediction results with ground truth. The improvement after replacing the center is not particularly large, indicating that the extraction of center features is okay. After replacing the extreme and replacing the center at the same time, it has brought a great improvement, which shows that there is still a lot of room for improvement in the process of extracting the extreme and combining the extreme center points.
Comparison with other target detection frameworks:
Reference link:
18. Centernet (object as point) details:  https://blog.csdn.net/WZZ18191171661/article/details/113753991
19. Centernet (object as point) details:  https://mp.weixin.qq.com/s/hlc1IKhKLh7Zmr5k_NAykw
20. Centernet (object as point) details: https://zhuanlan.zhihu.com/p/66048276
21. Centernet (object as point) explains Gaussian kernel radius:  https://zhuanlan.zhihu.com/p/96856635 
23、 cornetNet:  https://arxiv.org/abs/1808.01244

Guess you like

Origin blog.csdn.net/yohnyang/article/details/131497874