ScarfNet: Multi-scale Features with Deeply Fused and Redistributed Semantics

1. The existing feature pyramid method

In order to detect targets of varying sizes, a feature pyramid-based detector, between different feature layers, is based on the decision on the k feature map, as shown in (a) below, the baseline detector uses features lon the feature layer Figure X_l.

           

                                                             X_l = B_l(X_{l-1})\\ \ Detection Oup = D_l(X_l)

Among them l = n-k+1,...,n. Among them X_{1:n-k}(=[X_1,X_2,...,X_{n-k}])is the feature map generated by the backbone network, which X_{n-k+1:n}is obtained from the bottom of the convolutional layer. B_l(\cdot)Represents the operation performed by the lth convolutional layer. D_l(\cdot)Represents a detection sub-network, usually using a single 3 \times 3convolutional layer to generate classification and box regression output. Due to the different depths input from the pyramid layer, the shallower underlying features lack semantic information.

In order to reduce the semantic gap between different pyramid layers, some works have proposed a top-down structure using horizontal connections, as shown in (c). This structure uses increased resolution to propagate the semantic information from the top layer to the bottom layer. Maintain high spatial resolution through horizontal connection. The feature map of layer l is X'_lgenerated as

                                                               

Among them l = n-k+1,...,n, it L_l(\cdot)is the horizontal connection of the first layer, which T_l(\cdot)is the top-down connection of the first layer . The operator \ oplusrepresents a combination of two feature maps, such as channel connection and addition. Different methods only use different T_l(\cdot)sums L_l(\cdot). These methods of feature pyramid are relatively abstract, and they still have some limitations. Because top-down connections propagate semantics in a directionless manner, these semantics are unevenly distributed across layers. The result is that the semantic separation between pyramid feature layers still exists. Next, on all feature layers, the unidirectional connection processing of features has limitations on the ability to generate increased semantic information. In order to solve this problem, we developed a semantic that uses biLSTM to create a deep fusion between all feature layers with a unidirectional horizontal connection. The following chapters will show the details of our proposed method.

3.2 ScarfNet: the entire structure

ScarfNet uses two steps to solve the inconsistency of semantic information: (1), use biLSTM to combine the broken semantic information. (2) Use the channel-by-channel attention module to redistribute the fused features to each feature layer. The whole structure is shown below:

                                    

Wherein the k-th pyramids X_{n-k+1:n}as input, ScarfNet generates a first feature FIG l X'_lis:

                                      

Among them l = n-k+1,...,n, ScarfNet is composed of two parts as shown in equation (6): Semantic Reorganization Network (ScNet) and Attention Redistribution Network (ArNet). First, ScNet uses biLSTM to fuse pyramid features X_{n-k+1:n}, and uses the fused semantics to generate output features. Second, ArNet collects the output features from biLSTM and uses channel-by-channel attention to generate high-quality semantic multi-scale features, which are connected to the original feature golden tower. Finally, the resulting feature map is D_l(\cdot)processed separately by the detection sub-network to produce the final detection result.

3. Semantic Composition Network (ScNet)

FIG characterized by ScNet produced X_{n-k+1:n}^{f}as follows:

                                     

X^f_lIt is the output feature map of layer l. The details are shown in the following figure, which describes the details of ScNet. ScNet uses biLSTM to uniformly fuse the broken features between different pyramids. biLSTM chooses to fuse semantic information on multi-scale layers through gate functions. ScNet consists of matching module and biLSTM. The matching module first X_{n-k+1:n}transforms the size of the pyramid features to make them the same size. Then use 1 \times 1the convolutional layer to adjust the channel dimension. As a result, the matching module generates feature maps with the same number of channels and sizes. The size conversion operation is completed by bilinear interpolation. biLSTM is the same as reference [23]. Based on the results of global pooling, convolutional layers are used for the calculation of input connections and gate parameters to significantly save calculations.

                                     

In particular, the operation of biLSTM can be simplified to:

                                   

Which \bigcircrepresents the Hadamard product, the status of biLSTM is updated in both forward and backward directions. The above formula is forward update, and the expression of backward update is similar.

4. Attention Redistribution Network (ArNet)

ArNet generates a high-level semantic feature map, which is connected to the original pyramid feature map X_l. The expression is:

                              

The operator \ oplusrepresents the main channel connection. The specific structure of ArNet is shown in Figure 4. ArNet connects the output of biLSTM and X_{n-k+1:n}^{f}applies a channel-by-channel attention mechanism to them. The weight of the attention mechanism is obtained by constructing 1 \times 1the vector by using global average pooling, and passing it to two fully connected layers, and finally a sigmoid function. Note that these channel-by-channel attention modules allow the choice to propagate semantics to each layer of the pyramid. Once the attention weights are used, the matching module downsamples the results of the feature map and applies 1 \times 1the convolution to match the channel dimensions, using these original pyramid features. Finally, the output matching module is connected to the original feature map X_lto generate high semantic features X'_l.

                             

Published 943 original articles · Like 136 · Visit 330,000+

Guess you like

Origin blog.csdn.net/weixin_36670529/article/details/105519441