Objects as Points Paper Intensive Reading

object as point

Summary

Object detection defines the detected object as an axis-aligned box in the image. Most successful object detectors enumerate an almost exhaustive list of potential object locations and classify each location. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a point - the center point of its bounding box. Our detector uses keypoint estimation to find the center point and regress all other object properties such as size, 3D position, orientation, and even pose. Our center point based approach CenterNet is end-to-end differentiable, simpler, faster and more accurate than the corresponding bounding box based detector. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP at 1.4 FPS on multi-scale testing. We use the same approach to estimate 3D bounding boxes from the KITTI benchmark and human pose on the COCO keypoint dataset. Our method is competitive with complex multi-stage methods and operates in real-time.

Fig. 1 Speed-accuracy trade-off for real-time detector COCO verification. The proposed CenterNet outperforms a range of state-of-the-art algorithms

1. Introduction

Object detection provides support for many vision tasks, such as instance segmentation [7, 21, 32], pose estimation [3, 15, 39], tracking [24, 27] and action recognition [5]. Object detection has many extended applications, such as: surveillance [57], autonomous driving [53] and visual problem resolution [1]. Current object detectors represent each object by an axis-aligned bounding box that tightly surrounds the object [18, 19, 33, 43, 46]. They then reduced object detection to image classification of a large number of potential object bounding boxes. For each bounding box, the classifier determines whether the image content is a specific object or the background. Onestage detectors [33, 43] slide complex arrangements of possible bounding boxes (called anchors) over an image and classify them directly without specifying box contents. Two-stage detectors [18, 19, 46] recompute image features for each latent box and then classify these features. Post-processing, i.e. non-maximum suppression, then removes duplicate detections of the same instance by computing the bounding box IoU. This post-processing is difficult to differentiate and train [23], so most current detectors are not end-to-end trainable. Nonetheless, this idea has had good empirical success [12,21,25,26,31,35,47,48,56,62,63] over the past five years [19]. However, sliding window based object detectors are somewhat wasteful as they need to enumerate all possible object locations and sizes.

In this article, we provide a simpler and more efficient alternative. We represent objects by a point at the center of the bounding box (see Figure 2). Other properties such as object size, dimension, 3D extent, orientation, and pose are then regressed directly from the image features at the center location. Object detection is a standard keypoint estimation problem [3, 39, 60]. We simply feed the input image to a fully convolutional network [37, 40] that generates a heatmap. Peaks in this heatmap correspond to object centers. The image features of each peak predict the height and weight of the object bounding box. The model is trained using standard dense supervised learning [39,60]. Inference is a single network forward pass without non-maximum suppression for post-processing.

Our approach is general and can be easily extended to other tasks. We provide experiments on 3D object detection [17] and multi-person human pose estimation [4] by predicting additional outputs for each center point (see Figure 4). For 3D bounding box estimation, we regress object absolute depth, 3D bounding box dimensions, and object orientation [38]. For human pose estimation, we treat 2D joint positions as offsets from the center and regress directly to them at the center point position.

The simplicity of our approach enables it to run CenterNet at very high speeds (Figure 1). Using a simple Resnet18 and upconvolutional layers [55], our network runs at 142 FPS with a COCO bounding box AP of 28.1%. With the well-designed keypoint detection network DLA34 [58], our network achieves 37.4% COCO AP at 52 FPS. Equipped with state-of-the-art keypoint estimation networks, Hourglass-104 [30, 40] and multi-scale testing, our network achieves 45.1% COCO AP at 1.4 FPS. We compete with the state-of-the-art in terms of 3D bounding box estimation and human pose estimation with higher inference speed. 

 Fig. 2 We model the object as the center point of its bounding box. Bounding box size and other object properties are inferred from keypoint features at the center. Best viewed in color.

2. Related work

Object Detection via Region Classification

One of the first successful deep object detectors, RCNN [19], enumerates object locations from a large number of proposals [52], crops them, and uses a deep network to classify each object. Fast-RCNN [18] crops image features instead to save computation. However, both methods rely on slow low-level region proposal methods.

Object detection using implicit anchors

Faster RCNN [46] generates region proposals in a detection network. It samples fixed-shape bounding boxes (anchors) around a grid of low-resolution images and classifies each as "foreground" or "non-foreground". The IOU between the anchors area and any real area is greater than 0.7, which is considered a positive sample, and less than 0.3 is considered a negative sample, which is responsible for ignoring. Each generated region is again classified [18]. Changing the proposal classifier to a multi-class classification forms the basis of a one-stage detector. Some improvements to one-stage detectors include anchors shape priors [44, 45], different feature resolutions [36], and loss reweighting between different samples [33].

Our method is closely related to anchors-based onestage methods [33, 36, 43]. The center point can be seen as a shape-independent anchors (see Figure 3). However, there are some important differences. First, our CenterNet only assigns "anchors" based on location, not box overlap [18]. We do not have manual thresholding for foreground and background classification [18]. Second, we have only one positive "anchor" per object, so non-maximum suppression (NMS) [2] is not needed. We just extract local peaks in keypoint heatmaps [4, 39]. Third, CenterNet uses a larger output resolution (output stride 4) compared to traditional object detectors [21, 22] (output stride 16). This eliminates the need for multiple anchors [47].

Object Detection via Keypoint Estimation

We are not the first to use keypoint estimation for object detection. CornerNet [30] detects two bounding box corners as keypoints, while ExtremeNet [61] detects top, left, bottom, rightmost and center points of all objects. Both methods build on the same robust keypoint estimation network as our CenterNet. However, they require a combined grouping stage after keypoint detection, which slows down each algorithm significantly. On the other hand, our CenterNet only needs to extract a center point for each object without grouping or post-processing.

Monocular 3D object detection

3D bounding box estimation powers autonomous driving [17]. Deep3Dbox [38] uses a Slow RCNN [19] style framework to first detect 2D objects [46] and then feed each object into a 3D estimation network. 3D RCNN [29] adds an extra head to Faster-RCNN [46] followed by 3D projection. Deep Manta [6] uses a coarse-to-fine Faster-RCNN [46] trained on many tasks. Our approach is similar to the single-stage version of Deep3Dbox [38] or 3DRCNN [29]. Therefore, CenterNet is simpler and faster than competing methods.

a) Detection based on standard anchors. Anchors are counted as positive samples when their overlapping IoU > 0.7 with any object, and negative samples when their overlapping IoU < 0.3, otherwise they are ignored.

b) Center point based detection. The center pixel is assigned to the object. Negative loss reduction for nearby points. Object size has returned.

Figure 3. Differences between anchor-based detectors (a) and our center point detector (b). Best viewed on screen.

3. Preliminary work

Let I ∈ RW×H×3, I be the input image with width W and height H. Our goal is to generate a keypoint heatmap ^Y ∈ [0,1] WR × HR ×C, where R is the output stride and C is the number of keypoint types. Keypoint types include C = 17 human joints [4, 55] in human pose estimation, or C = 80 (COCO dataset) object categories in object detection [30, 61]. We use the default output stride of R = 4 in [4,40,42]. The output stride downsamples the output prediction by a factor R. The prediction ^Y x,y,c = 1 corresponds to the detected keypoint, while ^Y x,y,c = 0 is the background. We predict ˆY from image I using several different fully convolutional encoder-decoder networks: Stacked Hourglass Networks [30, 40], Upconvolutional Residual Networks (ResNet) [22, 55], and Deep Aggregation (DLA)

We train the keypoint prediction network following Law and Deng [30]. For each ground-truth keypoint p ∈ R2 of class c, we compute a low-resolution equivalent term ∼ p = bp R c. Then, we use the Gaussian kernel

 , where σp is the object size adaptive standard deviation [30]. If two Gaussians of the same class overlap, we take the element-wise maximum [4]. The training objective is reduced penalty pixel logistic regression with focal loss [33]:

where α and β are the hyperparameters of the focal loss [33], and N is the number of keypoints in image I. A normalization of N is chosen to normalize all positive focal loss instances to 1. We use α = 2 and β = 4 in all our experiments, following Law and Deng [30].

To recover the discretization error caused by the output stride, we additionally predict a local offset ^O ∈ RWR × HR ×2, for each center point. All classes c share the same offset prediction. Use L1 loss training offset

Supervision is only applied to keypoint positions ~ p, all other positions are ignored.

In the next section, we show how to extend this keypoint estimator to a general purpose object detector.

Figure 4: Outputs of our network for different tasks: top for object detection, middle for 3D object detection, bottom: for pose estimation. All modalities are generated from a common backbone with different 3 × 3 and 1 × 1 output convolutions, separated by ReLU. Numbers in parentheses indicate output channels. See Section 4 for details.

4. Objects as Points

Let (x(k) 1 ,y(k) 1 ,x(k) 2 ,y(k) 2 ) be the bounding box of object k of class ck. Its center point is located at pk = ( x (k) 1 +x (k) 2 2 , y (k) 1 +y (k) 2 2 ). We use our keypoint estimator ^Y to predict all center points. Furthermore, we regress to the object size sk = (x (k) 2 − x (k) 1 ,y (k) 2 − y (k) 1 ) for each object k. To limit the computational burden, we use a single size prediction ^S ∈ RWR × HR × 2 for all object categories. We use an L1 loss on the center points, similar to objective 2:

We do not normalize the scale and use the raw pixel coordinates directly. We instead scale the loss by a constant λ size. The overall training goal is to

Unless otherwise stated, we set λ size = 0.1 and λ off = 1 in all experiments. We use a single network to predict keypoints^Y, offset^O and size^S. The network predicts a total of C + 4 outputs at each location. All outputs share a common convolutional backbone network. For each modality, the features of the backbone are then passed through a separate 3 × 3 convolution, ReLU and another 1 × 1 convolution. Figure 4 shows an overview of the network output. Section 5 and the supplementary material contain additional architectural details.

from point to bounding box

At inference time, we first extract the peaks in the heatmap of each category independently. We detect all responses with values ​​greater than or equal to their 8 connected neighbors and keep the top 100 peaks. Let ^ P c be the set of n detected center points ^ P = {(^ xi , ^ yi )} ni=1 class c. Each keypoint position is given by integer coordinates (xi,yi). We use the keypoint value ^Y xiyic as a measure of its detection confidence and generate a bounding box at position

 Where (δ^ xi ,δ^ yi ) = ^ O ^ xi ,^ yi is the offset prediction, (^ wi , ^ hi ) = ^ S ^ xi ,^ yi is the size prediction. All outputs are generated directly from keypoint estimates, without IoU-based non-maximum suppression (NMS) or other post-processing. Peak keypoint extraction serves as a sufficient alternative to NMS and can be efficiently implemented on-device using a 3 × 3 max-pooling operation.

4.1 3D detection

3D detection estimates a 3D bounding box for each object, and requires three additional attributes for each center point: depth, 3D dimension, and orientation. We add a separate header for each one. Depth d is a single scalar for each center point. However, depth is difficult to directly regress. We instead use the output transform of Eigen et al. [13] and d = 1/σ( ^ d) - 1, where σ is the sigmoid function. We compute depth as an additional output channel ^D ∈ [0,1] WR × HR of the keypoint estimator. It again uses two convolutional layers separated by ReLU. Unlike previous models, it uses an inverse sigmoidal transformation at the output layer. After the sigmoid transformation, we train a depth estimator using the L1 loss in the original depth domain.

The 3D dimensions of an object are three scalars. We use separate head^Γ ∈ RWR × HR × 3 and L1 losses to regress directly to their absolute values ​​(in meters).

By default, direction is a single scalar. However, it can be difficult to return. We follow Mousavian et al. [38] and represent the orientation as two bins with in-bin regression. Specifically, the orientation is encoded using 8 scalars, 4 scalars per bin. For a bin, two scalars are used for softmax classification, and the remaining two scalars are regressed to an angle within each bin. See Supplement for details on these losses.

4.2. Human Pose Estimation

Human pose estimation aims to estimate k 2D human joint locations (for COCO, k = 17) for each human instance in an image. We treat pose as a k × 2 dimensional attribute of the center point, and parameterize each keypoint by an offset from the center point. We directly regress to jointoffsets(inpixels) ^ J ∈ RWR × HR × k × 2, loss L1. We ignore invisible keypoints by masking the loss. This leads to a regression-based single-stage multi-person pose estimator similar to the counterpart of the slow RCNN version in Toshev et al. [51] and Sun et al. [49].

To refine keypoints, we further estimate k-person joint heatmaps Φ ∈ RWR × HR × k using standard bottom-up multi-person pose estimation [4, 39, 41]. We train human joint heatmaps with focal loss and local pixel offset, similar to the center detection discussed in this Section 3.

We then snap the initial predictions to the most recently detected keypoints on this heatmap. Here, our center offset acts as a grouping cue to assign individual keypoint detections to their closest person instances. Specifically, let (^ x, ^ y) be the detected center point. We first regress to all joint positions lj = (^ x, ^ y) + ^J ^ x^ yj for j ∈ 1...k. We also extract all keypoint locations L j = { ∼ l ji } nji = 1 from the corresponding heatmap with confidence > 0.1 for each joint type j. We then assign each regression location lj to its nearest detected keypoint argmin l∈L j (l−lj ) 2 , only considering joint detections within the bounding boxes of detected objects.

5. Implementation details

We conduct experiments with 4 architectures: ResNet-18, ResNet101 [55], DLA-34 [58] and Hourglass-104 [30]. We modify ResNets and DLA-34 using deformable convolutional layers [12] and use the Hourglass network as-is.

Hourglass

Hourglass networks [30, 40] downsample the input by a factor of 4, followed by two sequential hourglass modules. Each hourglass module is a symmetric 5-layer up-down convolutional network with skip connections. This network is very large, but generally yields the best keypoint estimation performance.

ResNet

Xiao et al. [55] augment the standard residual network [22] with three upconvolutional networks to allow higher resolution output (output stride 4). We first change the channels of the three upsampling layers to 256, 128, 64 respectively to save computation. We then add a 3 × 3 deformable convolutional layer before each upconvolution for channels 256, 128, 64, respectively. The upper convolution kernel is initialized as bilinear interpolation. See Supplementary for a detailed architectural diagram.

FOR

Deep Aggregation (DLA) [58] is an image classification network with hierarchical skip connections. We leverage a fully convolutional upsampled version of DLA for dense prediction, which uses iterative depth aggregation to symmetrically increase feature map resolution. We augment skip connections using deformable convolutions [63] from lower layers to the output. Specifically, we replace the original convolutions with 3×3 deformable convolutions at each upsampling layer. See Supplementary for a detailed architectural diagram.

We add a 3 × 3 convolutional layer with 256 channels before each output head. The final 1 × 1 convolution then produces the desired output. We provide more details in the supplementary material.

train

We train on an input resolution of 512 × 512. This yields an output resolution of 128×128 for all models. We use random flipping, random scaling (between 0.6 and 1.3), cropping, and color dithering as data augmentation, and Adam [28] to refine the overall objective. We do not use augmentation to train the 3D estimation branch because clipping or scaling would alter the 3D measurements. For the residual network and DLA-34, we use a batch size of 128 (on 8 GPUs) and a learning rate of 5e-4 for 140 epochs, and the learning rate drops by a factor of 10 at 90 and 120 epochs ( follow [55]). For Hourglass-104, we follow ExtremeNet [61] and use batch size 29 (on 5 GPUs, main GPU batch size is 4) and learning rate 2.5e-4 for 50 epochs, dropping 10x learning rate at 40 epochs . For detection, we fine-tune Hourglass-104 in ExtremeNet [61] to save computation. The downsampling layers of Resnet101 and DLA-34 are initialized with ImageNet pre-training, and the upsampling layers are randomly initialized. Resnet-101 and DLA-34 take 2.5 days to train on 8 TITAN-V GPUs, while Hourglass-104 takes 5 days.

reasoning

We use three levels of test augmentation: no augmentation, flip augmentation, flip and multi-scale (0.5, 0.75, 1, 1.25, 1.5). For flipping, we average the network outputs before decoding bounding boxes. For multi-scale, we use NMS to combine the results. These enhancements create different speed-accuracy trade-offs, as shown in the next section

Paper address

https://arxiv.org/abs/1904.07850

the code

https://github.com/xingyizhou/CenterNet

Recommended explanation

Throw away the anchor! The real CenterNet - Interpretation of Objects as Points paper - 知乎

personal understanding

This position summarizes some personal understanding of anchors and no anchors

There are anchors

What are anchors? ? My personal understanding is that the image comes in and goes through a series of convolutions to get a lot of feature layers. Each point on the feature layer can represent the characteristics of an area on the original image to a certain extent, so as long as you choose the point on the feature layer Enough, you can exhaustively enumerate the target frames that you want to recognize in the original image. This method of exhaustively enumerating possible target frames is the core idea of ​​anchors. Due to the different target sizes, different target frames may be designed. The size of the anchors, due to the different sizes of the targets, may arrange the anchors on different feature maps, although the details of the anchors of different detection methods are not the same. This kind of exhaustive pre-selection boxes, and then classify all the pre-selection boxes, and then put the classified pre-selection boxes into non-maximum value suppression to get the final result, which is the core of the whole deep learning anchors target detection

Take SSD as an example. Personally, I think SSD is the first article in anchors that has all the basic ideas out.

 In the SSD, when the image comes in, a series of convolutions will be performed, and then anchors will be arranged on the six feature layers of 4, 7, 8, 9, 10, and 11, and then all the anchors will be classified, and then the same Boxes with similar categories are subjected to non-maximum suppression. The overall idea is to enumerate, classify, and pick the best.

This article

centernet can do many things, take target detection as an example

 After the target detection, three feature maps are obtained at the same time. One heat map is used to locate the center point of the target, one offset map is used to fine-tune the coordinates of the center point, and one size map is used to define the width and height of the target frame. COCO data Set 80 categories, the input image is 512*512 as an example, and the final three feature maps are scaled by 4 times. Therefore, the size of the first feature map is 1*80*128*128, 80 corresponds to the category, and the coordinates on the feature map correspond to the approximate coordinates of the center point; the size of the second feature map is 1*160*128*128, and 160 is 80 categories * (x, y) bias in two directions; the third feature map size is 1*160*128*128, 160 is 80 categories * (w, h);

Centernet, you said that he does not have anchors, right? It is true that anchors are useless, but his first feature map plus the third feature map, the result is not similar to anchors, and the second feature map is similar to the correction of the regression position. . This idea is really bullshit

Details

loss function

For formula 4, λ size = 0.1 and λ off = 1, the last three feature maps, the coefficients of the three are not the same when the loss is added, and the loss coefficient of wh is only one tenth of the other two, which is The point is described at the back of the paper. The author found through experiments that it would be better to set this coefficient to 0.1

network structure

resdcn34

Guess you like

Origin blog.csdn.net/XDH19910113/article/details/125077464