CenterNet paper analyzes Objects as Points in detail

1. Paper related information

​ 1. Thesis title: Objects as Points

​ 2. Posting time: 2019

​ 3. Document address: https://arxiv.org/pdf/1904.07850.pdf

4. Authors: Xingyi Zhou Dequan Wang Philipp Krahenb

​ 5. Paper source code: https://github.com/xingyizhou/CenterNet

2. Paper details

CenterNet uses the center of the bounding box to characterize the target. Other attributes such as width and height in 2d detection, depth in 3d detection, three-dimensional border and direction, and key points of multi-person pose estimation are all obtained from the center point regression.
Therefore, the target detection is transformed into a labeled key point estimation problem. You only need to input the image into a fully convolutional network to generate a heatmap heatmap. The peaks in the figure correspond to our target center object centers. Then the image features at each peak predict the width and height attributes of the target bounding box. In the inference, only the result is obtained through the forward propagation of a network, and there is no post-processing such as nms.
This method is versatile and can be used in other fields with simple modifications, such as the aforementioned 3D detection and multi-person pose estimation.
Because the network is very simple, so it is also very fast-142 FPS with 28.1% COCO bounding box AP
uses the most advanced key point estimation network hourglass network (Hourglass-104) and multi-scale detection, the accuracy will be greatly improved, but The speed will also slow down.

Compare with other methods

This method is very similar to the one-stage method based on anchors, and the center point can be regarded as a category-independent anchor.
Insert picture description here
The picture above is based on the comparison between the anchor detector and our center point detector.
The difference between anchor-based and center-point-based:
1. CenterNet assigns "anchor" only depends on location, not on IoU, so there is no need to manually set the threshold to distinguish between foreground and background.
2. Each object has only one positive sample, so non-maximum suppression (NMS) is not used, and only the local peaks
in the heatmap are simply extracted. 3. Compared with traditional detectors, the CenterNet output is more Large resolution, stride is only 4 (other stride is 16), so there is no need to use multi-scale feature map r.

CornerNet also uses keypoint maps for estimation, but its keypoint is corner, and it uses a pair of corners to predict objects. In contrast, after predicting which corners may be, it needs to perform a further operation to group and match these corners, that is, to determine which corners are of the same object. CenterNet eliminates this step, simply extracting a center point from each object, without grouping or post-processing (nms).

Network Architecture:

The backbone extracts the feature map. Here, the backbone gives three options, all of which are networks in the form of encode-decode. They are:
1. A stacked hourglass network, hourglass network, which is not modified in the text 2.
upconvolutional residual networks (ResNet)
3. Deep layer aggregation (DLA)
connects three branch networks after the backbone, which are used to predict heatmap (c channels), size (2 channels), and offset (2 channels).
There are a total of C+4 channels in each last position. Picture source
as shown below

Insert picture description here

The heatmap branch is the keypoint prediction branch. The heatmap has c channels corresponding to c classes. Finally, select the peaks in each channel as the center points of this category.
The size branch is used to predict the width and height, and the offset branch is used to compensate for the discrete error caused by the downsampling of the convolution operation, and is used for regression. These two branches are independent of the class and are shared by all classes.

Train

L k is the keypoint detection loss.
Insert picture description here

Y xyc represents the confidence of the ground truth at the point (x, y) on the heat map of the channel c. If it is gt, it is 1, and if it is not, it is not necessarily 0. Because our heat map is to map the ground truth from the input image position to the lower resolution image, and then use the ground truth as the center to do a Gaussian distribution to assign the value to the entire image, that is, to use the ground truth as the center, in its The Y xyc value of a point within a certain radius falls like a waterfall (Gaussian distribution). Kind of like this picture:
Insert picture description here

Y ^ xyc is our prediction x, and the pixel at y is the score of the center of the class c, with a value of 0~1, 0 refers to background, and 1 is considered ground truth.
α and β are focal loss hyperparameters. Here, the settings in CornerNet are used, α = 2 and β = 4. N is the total number of keypoints in the picture.

L off
Insert picture description here
Insert picture description here
p is the position on the original image, R is sride, and p~ is the position mapped to the low-resolution feature map. Insert picture description here
L size represents the loss of width and height, sk is the width and height of the original image, and S^pk is the width and height of the predicted point.

Insert picture description here
Total loss:
Insert picture description here
where λsize=0.1 and λoff=1.

Inference

First extract the peaks of each class from the heatmap: if the value of a point is greater than or equal to all the values ​​in its eight neighborhoods (the nearest 8 points around), it is regarded as a peak, and finally the 100 peaks with the largest value are retained. Let P^c be the set of n center points of the detected c category, each keypoint (keypoint) is given in integer coordinates (xi, yi). Use the predicted value Y ^xyc at the key point as the detected The confidence of the object, and then the final bounding box is obtained by the following calculation, that is, the center plus the offset, and then the width and height are calculated to get our bounding box.
Insert picture description here
Among them, the offset prediction result is the
Insert picture description here
size prediction result is
Insert picture description here
All outputs are directly estimated from the key points, without the need for IoU-based NMS.
However, when extracting peaks, we use a 3 × 3 max pooling operation, which is actually equivalent to a substitute for NMS.

Time v accuracy

Insert picture description here
The state of art detector is a comparison of detection in COCO test-dev: the two-stage above, the ones-stage below. Single-scale /multi-scale is used in one-stage
Insert picture description here

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/110202400