1. The existing feature pyramid method
In order to detect targets of varying sizes, a feature pyramid-based detector, between different feature layers, is based on the decision on the k feature map, as shown in (a) below, the baseline detector uses features on the feature layer Figure .
Among them . Among them is the feature map generated by the backbone network, which is obtained from the bottom of the convolutional layer. Represents the operation performed by the lth convolutional layer. Represents a detection sub-network, usually using a single convolutional layer to generate classification and box regression output. Due to the different depths input from the pyramid layer, the shallower underlying features lack semantic information.
In order to reduce the semantic gap between different pyramid layers, some works have proposed a top-down structure using horizontal connections, as shown in (c). This structure uses increased resolution to propagate the semantic information from the top layer to the bottom layer. Maintain high spatial resolution through horizontal connection. The feature map of layer l is generated as
Among them , it is the horizontal connection of the first layer, which is the top-down connection of the first layer . The operator represents a combination of two feature maps, such as channel connection and addition. Different methods only use different sums . These methods of feature pyramid are relatively abstract, and they still have some limitations. Because top-down connections propagate semantics in a directionless manner, these semantics are unevenly distributed across layers. The result is that the semantic separation between pyramid feature layers still exists. Next, on all feature layers, the unidirectional connection processing of features has limitations on the ability to generate increased semantic information. In order to solve this problem, we developed a semantic that uses biLSTM to create a deep fusion between all feature layers with a unidirectional horizontal connection. The following chapters will show the details of our proposed method.
3.2 ScarfNet: the entire structure
ScarfNet uses two steps to solve the inconsistency of semantic information: (1), use biLSTM to combine the broken semantic information. (2) Use the channel-by-channel attention module to redistribute the fused features to each feature layer. The whole structure is shown below:
Wherein the k-th pyramids as input, ScarfNet generates a first feature FIG l is:
Among them , ScarfNet is composed of two parts as shown in equation (6): Semantic Reorganization Network (ScNet) and Attention Redistribution Network (ArNet). First, ScNet uses biLSTM to fuse pyramid features , and uses the fused semantics to generate output features. Second, ArNet collects the output features from biLSTM and uses channel-by-channel attention to generate high-quality semantic multi-scale features, which are connected to the original feature golden tower. Finally, the resulting feature map is processed separately by the detection sub-network to produce the final detection result.
3. Semantic Composition Network (ScNet)
FIG characterized by ScNet produced as follows:
It is the output feature map of layer l. The details are shown in the following figure, which describes the details of ScNet. ScNet uses biLSTM to uniformly fuse the broken features between different pyramids. biLSTM chooses to fuse semantic information on multi-scale layers through gate functions. ScNet consists of matching module and biLSTM. The matching module first transforms the size of the pyramid features to make them the same size. Then use the convolutional layer to adjust the channel dimension. As a result, the matching module generates feature maps with the same number of channels and sizes. The size conversion operation is completed by bilinear interpolation. biLSTM is the same as reference [23]. Based on the results of global pooling, convolutional layers are used for the calculation of input connections and gate parameters to significantly save calculations.
In particular, the operation of biLSTM can be simplified to:
Which represents the Hadamard product, the status of biLSTM is updated in both forward and backward directions. The above formula is forward update, and the expression of backward update is similar.
4. Attention Redistribution Network (ArNet)
ArNet generates a high-level semantic feature map, which is connected to the original pyramid feature map . The expression is:
The operator represents the main channel connection. The specific structure of ArNet is shown in Figure 4. ArNet connects the output of biLSTM and applies a channel-by-channel attention mechanism to them. The weight of the attention mechanism is obtained by constructing the vector by using global average pooling, and passing it to two fully connected layers, and finally a sigmoid function. Note that these channel-by-channel attention modules allow the choice to propagate semantics to each layer of the pyramid. Once the attention weights are used, the matching module downsamples the results of the feature map and applies the convolution to match the channel dimensions, using these original pyramid features. Finally, the output matching module is connected to the original feature map to generate high semantic features .