A review of research on visual Transformers in the field of low-level vision

Basic principles of visual transformer

Insert image description here

In the image processing process, ViT first divides the input picture into blocks, performs linear coding and mapping on them, and then arranges them into a pile of vectors as the input of the encoder. In the classification task, a learnable vector is added to this one-dimensional vector. The embedding vector is used as a representation of the category prediction result for classification, and the result is finally output through a fully connected layer.

attention mechanism

The attention mechanism allows the network to focus more on relevant information in the input, thereby reducing attention to irrelevant information.

calculation steps:
  1. Divide the input _ _ _ _ _ _ v t
    q i = a i W q k i =a i W k v i =a i W v
    where q i represents the query vector, which will be matched with each k i
    later. k i represents the queried vector, which will be subsequently For each q i match, v i represents the information vector extracted from a i
  2. Calculate the similarity between q i and k i to obtain the weight
    Insert image description here
  3. Normalize the similarity weights. The softmax function is often used to normalize the similarity matrix into an attention weight matrix.
    Insert image description here
    The softMax function can be used to convert multi-class output values ​​into a probability distribution ranging from [0,1] to 1.
  4. Attention is obtained by summing the information vectors according to the weight:
    Insert image description here
    where L x represents the length of the input sequence, Similarity represents the similarity calculation, Q, K and V represent the query vector, the queried vector and the information vector respectively.
Image serialization and positional encoding

The input of Transfomer is a sequence. To be able to process the image, the two-dimensional image must be turned into a one-dimensional sequence.
Insert image description here

Transformer module

The Transfomer module is based on the encoder and decoder architecture, and the encoder and decoder are composed of multiple layers. The encoder is responsible for extracting features, and the decoder is responsible for converting the extracted features into results. The encoder consists of attention layer and fully connected layer.
Insert image description here

Advantages and Disadvantages of Visual Transformer

advantage
  • Strong multi-modal fusion ability
  • Wider receptive field
    Insert image description here
shortcoming
  • VIT has a huge amount of calculations, parameters and algorithm complexity.
  • High demand for data
    Insert image description here

Application of Transformer in low-level vision tasks

Commonly used data sets for low-level vision tasks

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_47020721/article/details/133012434