Single target tracking - [Transformer] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Thesis
code

Article focus

The starting point of this article is that the first two steps [feature extraction-feature fusion] of the existing multi-stage Siamese tracking framework [feature extraction-feature fusion-bounding box prediction] are unified. Originally [Feature Extraction] is to extract template and Search Region features separately; [Feature Fusion] is to fuse template and Search Region features. And MixFormer puts together the image pixels of the template and Search Region, uses the self-attention mechanism to complete feature extraction enhancement, and the cross-attention mechanism to complete feature cross fusion. The above mentioned is actually considering the spatial characteristics, and considering the timing, the template update strategy is applied to deal with challenges such as occlusion.

network structure

MAM —— Mixed Attention Module

The role of this module is to extract features and fuse features. Self-attention extraction
MAM

  • Input: Feature Token of Target Template and Search region (shallow features processed by convolution)
  • Step 2: Encode the spatial position of the Token. Reshape & pad Token into 2D features, regularize, and then use Depth-wise convolution to realize positional encoding. Flatten & Linear is to linearly map Token into the input format of Transformer .
  • Step 3: Apply the Attention operation to the Target Token and Search region Token . There is a strategy in the article, as shown in the blue line in the article, use Target Token as self-attention, and Search region Token + Target Token as cross-attention [ Search region Token as query, Search region Token + Target Token as vaule and key】. The orange line is a dotted line, because the article chooses not to do symmetrical cross-attention, that is, [ Target Token as query, Search region Token + Target Token as vaule and key ], because the author thinks that this will pollute the target template, and adds Search region Token some distracting elements. This can also be seen in the visualization of TransT .

MixFormer

The MAM module is a simple substructure that can be stacked as a backbone, just like the residual structure of ResNet. The overall network structure is as follows:
insert image description here

  • The detailed parameters of the stage are as follows:
    insert image description here
    Among them, each layer has a MAM module + linear mapping layer, HHH indicates the number of multi-heads in the attention mechanism;DDD represents the dimension of the feature Embedding;RRR is the feature scale expansion ratio in MLP.
  • The Head part is a fully convolutional network designed with reference to STARK for corner positioning. That is, the probability prediction of the upper left and lower right corners of the bounding box through several Conv-BN-ReLU layers.

Distressed today to check the strength of the sixth level for a second~

Guess you like

Origin blog.csdn.net/qq_42312574/article/details/126460042