Transform【ViT】

reference

 tutor! The blogger's reproduction is too detailed. Make a note.

Layer Neural Network Learning Small Record 67 - Detailed Explanation of the Pytorch Version of the Vision Transformer (VIT) Model Reappearance

Summary of Transformer Model Innovation Ideas in Computer Vision_Tom Hardy's Blog-CSDN博客

Detailed Vision Transformer

ViT

pre-processing

network structure

overall thinking

Object Detection DETR (2020.5) --> Classification ViT (2020.10) --> Segmentation SETR (2020.12) --> Swin Transformer (2021.3) -->

The difficulty of using transformer in the field of computer vision is that the sequence is too long. The previous work includes the operation of transformer after using CNN to extract features, the operation of self-attention in a small window, and the mechanism of using self-attention for image length and width respectively. All are passable. Self-attention completely replaces convolution and has been applied before the CV field, but the network that directly uses the transformer in the visual field has not appeared.

The transformer lacks an inductive bias (the locality of the convolution kernel and the non-interference of the convolution operation, sliding translation, these inductive biases give CNN a certain prior information), and require a larger amount of training.

ViT only uses some image-specific inductive biases for segmenting image blocks and position encoding. This is to prove that the standard transformer in the NLP field can perform visual tasks as much as possible.

specific structure

feature extraction

(224, 224, 3)-->(14, 14, 768)/(196, 768)/(197, 768)-->(197, 768)

1.Patch

16*16 convolution with a step size of 16/tiling of height and width dimensions + Cls Token(1, 768)

Cls Token will perform feature extraction together.

2.Position Embedding

Add location information to all features so that the network has the ability to distinguish different regions.

nn.Parameter() generates a learnable tensor (196, 768) and the above Cls Token cat to get (197, 768). Then add to the tensor obtained by 1.

3.Transformer Encoder

(1) Description

1) The L above indicates how many Transformer blocks are to be superimposed.

2) After the internal qk of the attention is multiplied, the dropout will be set after the full connection. Outside the attention, the dropout will also be set outside the mlp. The dropout here is to set all the pixel values ​​​​of the input feature map to 0, and as the number of layers is superimposed, the probability of setting 0 is lower (it is speculated that the operation here is to destroy the network. Fitting effect, prevent overfitting, the probability of damage is very low, you can study the source code). Dropout is also set before entering the encoder.

3) The sequence length is only 3, and the characteristic length of each unit sequence is only 3. In VIT's Transformer Encoder, the sequence length is 197, and the characteristic length of each unit sequence is 768 // num_heads.Please add a picture description

(2) Internal execution details

(3) Specific modules

Norm

nn. LayerNorm

Multi-Head Attention

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads  = num_heads
        self.scale      = (dim // num_heads) ** -0.5

        self.qkv        = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop  = nn.Dropout(attn_drop)
        self.proj       = nn.Linear(dim, dim)
        self.proj_drop  = nn.Dropout(proj_drop)

    def forward(self, x):
        # batchsize, 197, 768
        B, N, C     = x.shape
        # 通过全连接层扩充维度为3倍,再将维度拆分为num_head份:3(qkv), batchsize, 12(nums_head), 197(patch), 64(768//12)
        qkv         = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        # 分配:batchsize, 12(nums_head), 197(patch), 64(768//12)
        q, k, v     = qkv[0], qkv[1], qkv[2]

        # q,k矩阵相乘
        attn = (q @ k.transpose(-2, -1)) * self.scale
        # softmax求每个元素在每个行上的占比是多少
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        # 得到的attn与v矩阵相乘
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        # 进入线性层
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

MLP

2 nn.Linear(), the activation function in the middle uses GELU

class Mlp(nn.Module):
    """ MLP as used in Vision Transformer, MLP-Mixer and related networks
    """
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=GELU, drop=0.):
        super().__init__()
        out_features    = out_features or in_features
        hidden_features = hidden_features or in_features
        drop_probs      = (drop, drop)

        self.fc1    = nn.Linear(in_features, hidden_features)
        self.act    = act_layer()
        self.drop1  = nn.Dropout(drop_probs[0])
        self.fc2    = nn.Linear(hidden_features, out_features)
        self.drop2  = nn.Dropout(drop_probs[1])

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop1(x)
        x = self.fc2(x)
        x = self.drop2(x)
        return x

add

Notice whether there are two connecting lines in the middle, the residual connection.

(One) patch is multiplied by all patches to calculate the importance, and then the importance of this patch feedback is multiplied by all patches to get (one) patch. Replacing (one) with all is the whole self-attention process.

Summary: Each patch has the weighted sum of other patches relative to the patch.

Classification

(1) Description

(197, 768)-->(, 768)

At this point, the Cls Token will be created. As mentioned earlier, the Cls Token has the information to interact with all other patches. Just do a full connection. done.

(2) Internal execution details

(3) Specific modules

Post-processing

loss function

おすすめ

転載: blog.csdn.net/qq_41804812/article/details/131083819