Dynamic Head study notes

Dynamic Head study notes

Dynamic Head: Unifying Object Detection Heads with Attentions

Abstract

The complexity of combining localization and classification in object detection has led to a boom in methods. Previous work attempted to improve the performance of various target detection heads, but failed to present a unified perspective. This paper proposes a new dynamic head framework that unifies head and attention for object detection. This method realizes scale perception and spatial position by coherently combining multiple self-attention mechanisms betweenfeature levels, between spatial positions and withinoutput channels. Interval and task awareness,significantly improves the head representation capability of target detection without increasing computational overhead. Further experiments demonstrate the effectiveness and efficiency of the proposed dynamic head on the COCO benchmark. Using the standard ResNeXt-101DCN backbone, we significantly improve the performance of popular object detectors and achieve the state-of-the-art at 54 AP. Furthermore, using the latest transformer backbone and additional data, we can push the current best COCO result to a new record of 60.6 AP. The code will be at https://github.com/microsoft/DynamicHead.

1. Introduction

Object detection is the answer to the question "what object is where" in computer vision applications. In the era of deep learning, almost all modern object detectors [11, 23, 12, 35, 28, 31, 33] have the same paradigm—one is the backbone of feature extraction, and the other is the head of localization and classification tasks. . How to improve the performance of target detection heads has become a key issue in existing target detection work.

The challenges of developing a good target detector can be summarized into three categories. First,the head should have scale awareness, because multiple objects with completely different scales often coexist in an image. Second,the head should be spatially aware, since objects often appear in significantly different shapes, rotations, and positionsat different viewing angles. Third, the head needs to be task-aware because objects may have different representations (e.g., bounding box [12], center [28] and corner point [33]), which have completely different goals and constraints. We found that recent studies [12, 35, 28, 31, 33] only focus on solving one of the above problems in various ways. How to develop a unified mind that addresses all these issues simultaneously remains an open question.

In this paper, we propose a new detection head, called a dynamic head, which unifies scale awareness, spatial awareness, and task awareness. If we consider the output of a backbone (i.e. the input to the detection head) as a three-dimensional tensor with dimensions horizontal × space × channel, we find that such a unified head can be viewed as an attention learning problem. An intuitive solution is to build a complete self-attention mechanism on this tensor. However, the optimization problem is too difficult to solve and the computational cost is unaffordable.

Instead, we can deploy attention mechanisms separately on each specific dimension of the feature, i.e. horizontal, spatial and channel. The scale-aware attention module is only deployed on the horizontal dimension. It learns the relative importance of various semantic levels to enhance features at appropriate levels based on the scale of individual objects. The attention module for perceiving space is deployed in the spatial dimension (i.e. Height×Width). It learns coherent discriminative representations at spatial locations. The task-aware attention module is deployed on the channel . It guides different feature channels to respectively favor different tasks (e.g., classification, box regression, and center/keypoint learning) based on different convolution kernel responses from objects.

In this way, we explicitly implement a unified attention mechanism for the detection head. Although these attention mechanisms are applied separately on different dimensions of the feature tensor, their performance can complement each other. Extensive experiments on the MS-COCO benchmark demonstrate the effectiveness of our approach. It offers great potential for learning better representations and can be used to improve various object detection models with AP gains of 1.2% ~ 3.2%. Under the standard ResNeXt101-DCN backbone, this method achieves a new level of 54.0% AP on COCO. In addition, compared with EffcientDet [27] and SpineNet [8], the training time of dynamic head is 1/20, but the performance is better. Furthermore, leveraging the latest Transformer backbone and additional data from self-training, we can push the current best COCO result to a new record of 60.6 AP (see appendix for details).

2. Related Work

Research in recent years has mainly studied target detection technology from the aspects of scale perception, space perception and task perception.

Scale-awareness.

Since objects with widely different scales often exist in natural images, many studies have considered the importance of scale perception in object detection. Early studies have demonstrated the importance of utilizing image pyramid methods for multi-scale training [6, 24, 25]. proposed to replace the image pyramid with a feature pyramid [15], which improves efficiency by connecting downsampled convolutional feature pyramids, and has become a standard component of modern object detectors. However, features from different levels are usually extracted from different depths of the network, which creates an obvious semantic gap. In order to solve this problem, [18] proposed a method based on the feature pyramid's bottom-up path to enhance the underlying features. Later, [20] improved it by introducingbalanced sampling and balanced feature pyramid. Recently, [31] proposed a pyramid convolution based on an improved three-dimensional convolution to simultaneously extract scale and spatial features.

In this work, we propose a scale-aware attention in the detection head that enablesthe importance of various feature levels to be adaptiveInput.

Spatial-awareness.

Previous work attempted to improve spatial awareness in object detection for better semantic learning. Convolutional neural networks are limited in learning the spatial transformations present in images [41]. Some works alleviate this problem by increasing model capabilities (size) [13, 32] or involving expensive data augmentation [14], resulting in extremely high computational costs for inference and training. Afterwards, a new convolution operator is proposed to improve the learning ability of spatial transformation. [34] proposed using dilated convolutions to aggregate contextual information from exponentially expanded receptive fields. [7] proposed a deformable convolution with AD conditional self-learning offsets to sample spatial locations. [37] redefined the offset by introducing a learned feature amplitude and further improved its capabilities.

In this work, we propose a space-aware attention in the detection head, whichnot only applies attention to each spatial location but also adaptively Aggregate multiple feature levels together to learn a more discriminative representation.

Task-awareness.

Object detection originates from the two-stage paradigm [39, 6], where object proposals are first generated and then the proposals are classified into different classes and backgrounds. [23] formalized the modern two-stage framework by introducing the Region Proposal Network (RPN) to define the two stages as a single convolutional network. Later, one-level object detectors [22] became popular due to their high efficiency. [16] further improved the architecture by introducing task-specific branches, thereby surpassing the accuracy of two-stage detectors while maintaining the speed of previous single-stage detectors.

Recently, an increasing number of studies have found that various representations of objects can potentially improve performance. [12] first demonstrated that combining bounding boxes and object segmentation masks can further improve performance. [28] proposed to use center representation to solve the object detection problem in a pixel-wise prediction manner. [35] further improved the performance of the center-based method by automatically selecting positive and negative samples based on the statistical characteristics of the object. Later, [33] used object detection as representative key points to simplify learning. [9] further improves performance by detecting each object as a triplet (instead of a pair of keypoints) to reduce false predictions. Recently, [21] proposed to enhance point features by extracting boundary features from the extreme points of each boundary, and archived state-of-the-art performance.

In this work, we propose atask-aware attention in the detection head, whichAllows attention to be deployed on channels, which can adaptively support various tasks, whether it is a single-stage/two-stage detector, or box/center/keypoint-based detection device.

More importantly, in our head design, all the above properties are integrated into a unified attention mechanism. To our knowledge, this is the first general detection-head framework that takes a step toward understanding the role of attention in target detection-head success.

3. Our Approach

3.1. Motivation

In order to simultaneously achieve scale awareness, spatial awareness, and task awareness in a unified object detection head, we need to have an overview of previous improvements to object detection heads.

Given L features at different levels in the feature pyramid F i n = { F i } i = 1 L F_{in} = \{F_i\}^L_{i =1} Fin={ Fi}i=1LFor the concatenation of , we can use upsampling or downsampling to resize the continuous level features to the scale of the median level features. The rescaled feature pyramid can be expressed as a four-dimensional tensor F ∈ R L × H × W × C F∈R^{L×H×W ×C} FRL×H×W ×C, where L represents the number of layers in the pyramid, H, W and C represent the height and width of the median horizontal feature respectively. and number of channels. We further define S = H × W S = H × W S=H×W, reshape the tensor into a three-dimensional tensor F ∈ R L × S × C F∈R^{L×S×C } FRL×S×C . Based on this representation, we will explore the role of each tensor dimension.

1)The difference in object scale is related to the characteristics of different levels. Improving the representation learning of different levels of F is beneficial to the scale perception of target detection.

2)Various geometric transformations of different object shapes are related to the characteristics of different spatial positions. Improving the representation learning of F at different spatial positions is beneficial to the spatial awareness of target detection.

3)Different object representations and tasks can be associated with features from different channels. Improving the representation learning of F between different channels is beneficial to the task perception of target detection.

In this paper, we find that all the above directions can be unified into an effective attention learning problem. Our work is the first attempt to combine the multiple concerns of allthree aspectsin order to develop a unified approach in order to maximize improvements they.

3.2. Dynamic Head: Unifying with Attentions

Given feature tensor F ∈ R L × S × C F∈R^{L×S×C} FRL×S×C,Using self-cautious general formula:

image-20221108091311720

where π(·) is an attention function. A good solution is to implement this attention function through fully connected layers. But due to the high dimensionality of tensors, directly learning attention functions for all dimensions is computationally expensive and practically unaffordable.

Instead, we convert the attention function into three consecutive attentions, each focusing on only one angle:

image-20221108091836082

使用, π L ( ⋅ ) 、 π S ( ⋅ ) 、 π C ( ⋅ ) π_L(·)、π_S(·)、π_C(·) PiL()πS()πC() are three different attention functions applied to dimensions L, S, and C respectively.

Scale-aware Attention π L π_L PiL.

We first introduce a scale-aware attention based on semantic importance to dynamically fuse features of different scales.

image-20221108092052576

where f(·) is a linear function approximated by a 1 × 1 convolutional layer, σ ( x ) = m a x ( 0 , m i n ( 1 , ( x + 1 ) / 2 ) ) σ(x) = max(0, min(1, (x+1)/2)) σ(x)=atx(0,min(1,(x+1)/2))Koreko sigmoid function.

Spatial-aware Attention π S π_S PiS.

We apply a fused feature-based spatial-aware attention module to focus on discriminative regions that consistently coexist between spatial locations and feature levels. Considering the high dimensionality in S, we decompose this module into two steps: first make the attention learning sparse using deformable convolution [7] , and then aggregate features across levels at the same spatial location:

image-20221108093333921

where K is the number of sparse sampling locations, p k + Δ p k p_k +Δp_k pk+pk is the self-learning space offset Δ p k Δp_k pkThe transferred position focuses on a discrimination area, Δ m k Δm_k mk是使用 p k p_k pkThe self-learning important scalar on. Both are learned from input features at the median level of F.

Task-aware Attention π C π_C PiC.

To achieve joint learning and generalization of different representations of objects, we deploy task-aware attention at the end. It dynamically opens and closes function channels to support different tasks:

image-20221108093750042

Among F c F_c Fc is the characteristic piece of the c-th channel, [ α 1 , α 2 , β 1 , β 2 ] T [α^1, α^2, β^1, β ^2]^T [α1α2β1β2]T = θ(·) is a hyperfunction that learns to control the activation threshold. The implementation of θ(·) is similar to [3]. First, the global average pooling dimensionality reduction is performed on the L × S dimension, then two fully connected layers and a normalization layer are used, and finally a displacement sigmoid function is used to normalize the output. Unified to [−1,1].

Finally, since the above three attention mechanisms are applied sequentially, we can apply formula 2 multiple times, effectively converting multiple π L , π S and π C π_L , π_S and π_C PiLπSsumπCBlocks are stacked on top of each other. The detailed configuration of our dynamic head (i.e., simplified to DyHead) block is shown in Figure 2(a).

Overall, the entire paradigm of our proposed dynamic head object detection is shown in Figure 1. Any backbone network can extract the feature pyramid and then adjust it to the same scale , forming a three-dimensional tensor F ∈ R L × S × C F∈R^{L×S×C} FRL×S×C , and then used as the input of the dynamic header. Next, several DyHead blocks are stacked in sequence, including perceptual scale, perceptual space, and attention to perceptual tasks. The output of the dynamic header can be used for different tasks and representations of object detection, such as classification, center/box regression, etc.

At the bottom of Figure 1, we show the output for each attention type. We can see that the initial feature maps from the backbone are noisy due to domain differences in ImageNet training. After ourscale-awareattention module, the feature map is more sensitive tothe scale difference of foreground objects ;After further passing through ourspatial-aware attention module, the feature map becomes sparser andFocus on the discriminative spatial location of foreground objects. Finally, after passing through ourtask-aware attention module, these feature maps are re-formed according to the needs of different downstream tasks Different activation. These visualizations nicely demonstrate the effectiveness of each attention module.

image-20221108094644150

image-20221108104434315

3.3. Generalizing to Existing Detectors

In this section, we demonstrate how the proposed dynamic head can be integrated into an existing detector to effectively improve its performance.

**One-stage Detector **

The one-stage detector predicts the target location by densely sampling locations from feature maps, simplifying the detector design. A typical first-level detector (such as Retina ANet [16]) consists of a backbone network and multiple task-specific sub-network branches, which extract dense features and process different tasks respectively. As shown in previous work [3], the object classification subnetwork behaves very differently from the bounding box regression subnetwork. Unlike this traditional approach, we only attach one unified branch to the trunk instead of multiple branches. It canhandle multiple tasks at the same time, thanks to the advantages of ourmultiple attention mechanism . This can further simplify the architecture and improve efficiency. Recently, anchor-free variants of one-level detectors have become popular, e.g., FCOS [28], A TSS [35] and RepPoint [33] which reformulate objects as centers and keypoints to improve performance. Compared with RetinaNet, these methods require additional centrality prediction or keypoint prediction on the classification branch or regression branch, which makes the construction of task-specific branches not simple. In contrast, deploying our dynamic header is more flexible as it only appends various types of predictions at the end of the header, as shown in Figure 2(b).

**Two-stage Detector **

The two-stage detector utilizes region proposal and ROI pooling [23] layers to extract intermediate representations from the feature pyramid of the backbone network. To match this feature, we first apply our scale-aware attention and spatial-aware attention to the feature pyramid before the ROI-pooling layer, and then use our task-aware attention to replace the original fully connected layer, as shown in Figure 2 © shown.

3.4. Relation to Other Attention Mechanisms

Deformable.

Deformed convolution [7, 37] significantly improves the transformation learning of traditional convolutional layers by introducing sparse sampling. It has been widely used in the object detection backbone to enhance feature representation capabilities. Although it is rarely used in object detection heads, we can think of it as a separate modeling of the S sub-dimension in our representation. We find that the deformable modules used in the skeleton can complement the proposed dynamic head. In fact, with a deformable variant of the ResNext-101-64x4d backbone, our dynamic head achieves a new state-of-the-art object detection result.

Non-local.

Non-Local Networks [30] is a pioneer in using attention modules to improve target detection performance. However, it uses a simple dot product formula to enhance the features of a pixel by fusing features of other pixels from different spatial locations. This behavior can be viewed as modeling only the L×S sub-dimension in our representation.

Transformer .

In recent years, it has become a trend to introduce the Transformer module [29] from natural language processing to computer vision tasks. Preliminary work [2, 38, 5] demonstrated promising results in improving object detection. Transformer provides a simple solution by applying multi-head fully connected layers to learn cross-attention correspondences and fuse features from different modalities. This behavior can be seen as modeling only theS × C sub-dimension in our representation.

The above three attentions only partially model sub-dimensions in the feature tensor. As a unified design, our dynamic header combines different dimensions of concern into a coherent and efficient implementation. The following experiments show that this dedicated design can help existing object detectors achieve significant gains. Furthermore, our attention mechanism explicitly addresses the challenge of object detection compared to the implicit working principles in existing solutions.

4. Experiment

image-20221108105100888

image-20221108105116916

image-20221108105130784

image-20221108105653658

image-20221108105726753

image-20221108105755307

image-20221108105831741

image-20221108105857269

image-20221108105906452

5. Conclusion

In this paper, we propose a new object detection head that unifies scale-aware, spatial-aware, and task-aware attention in a single framework. A new perspective focusing on target detection heads is proposed. As a plug-in block, the dynamic header can be flexibly integrated into any existing object detection framework to improve its performance. Furthermore, it is effective for learning. Research shows that the design and learning of attention points for target detection heads is an interesting direction worthy of further study. This work is just a step, and further improvements are needed in the following aspects: how to make the complete attention model easy to learn and calculate, how to systematically take more attention patterns into head design to obtain better performance.

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/127746776