reference
tutor! The blogger's reproduction is too detailed. Make a note.
Summary of Transformer Model Innovation Ideas in Computer Vision_Tom Hardy's Blog-CSDN博客
ViT
pre-processing
network structure
overall thinking
Object Detection DETR (2020.5) --> Classification ViT (2020.10) --> Segmentation SETR (2020.12) --> Swin Transformer (2021.3) -->
The difficulty of using transformer in the field of computer vision is that the sequence is too long. The previous work includes the operation of transformer after using CNN to extract features, the operation of self-attention in a small window, and the mechanism of using self-attention for image length and width respectively. All are passable. Self-attention completely replaces convolution and has been applied before the CV field, but the network that directly uses the transformer in the visual field has not appeared.
The transformer lacks an inductive bias (the locality of the convolution kernel and the non-interference of the convolution operation, sliding translation, these inductive biases give CNN a certain prior information), and require a larger amount of training.
ViT only uses some image-specific inductive biases for segmenting image blocks and position encoding. This is to prove that the standard transformer in the NLP field can perform visual tasks as much as possible.
specific structure
feature extraction
(224, 224, 3)-->(14, 14, 768)/(196, 768)/(197, 768)-->(197, 768)
1.Patch
16*16 convolution with a step size of 16/tiling of height and width dimensions + Cls Token(1, 768)
Cls Token will perform feature extraction together.
2.Position Embedding
Add location information to all features so that the network has the ability to distinguish different regions.
nn.Parameter() generates a learnable tensor (196, 768) and the above Cls Token cat to get (197, 768). Then add to the tensor obtained by 1.
3.Transformer Encoder
(1) Description
1) The L above indicates how many Transformer blocks are to be superimposed.
2) After the internal qk of the attention is multiplied, the dropout will be set after the full connection. Outside the attention, the dropout will also be set outside the mlp. The dropout here is to set all the pixel values of the input feature map to 0, and as the number of layers is superimposed, the probability of setting 0 is lower (it is speculated that the operation here is to destroy the network. Fitting effect, prevent overfitting, the probability of damage is very low, you can study the source code). Dropout is also set before entering the encoder.
3) The sequence length is only 3, and the characteristic length of each unit sequence is only 3. In VIT's Transformer Encoder, the sequence length is 197, and the characteristic length of each unit sequence is 768 // num_heads.
(2) Internal execution details
(3) Specific modules
Norm
nn. LayerNorm
Multi-Head Attention
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
super().__init__()
self.num_heads = num_heads
self.scale = (dim // num_heads) ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
# batchsize, 197, 768
B, N, C = x.shape
# 通过全连接层扩充维度为3倍,再将维度拆分为num_head份:3(qkv), batchsize, 12(nums_head), 197(patch), 64(768//12)
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
# 分配:batchsize, 12(nums_head), 197(patch), 64(768//12)
q, k, v = qkv[0], qkv[1], qkv[2]
# q,k矩阵相乘
attn = (q @ k.transpose(-2, -1)) * self.scale
# softmax求每个元素在每个行上的占比是多少
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
# 得到的attn与v矩阵相乘
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
# 进入线性层
x = self.proj(x)
x = self.proj_drop(x)
return x
MLP
2 nn.Linear(), the activation function in the middle uses GELU
class Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=GELU, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
drop_probs = (drop, drop)
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.drop1 = nn.Dropout(drop_probs[0])
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop2 = nn.Dropout(drop_probs[1])
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.drop2(x)
return x
add
Notice whether there are two connecting lines in the middle, the residual connection.
(One) patch is multiplied by all patches to calculate the importance, and then the importance of this patch feedback is multiplied by all patches to get (one) patch. Replacing (one) with all is the whole self-attention process.
Summary: Each patch has the weighted sum of other patches relative to the patch.
Classification
(1) Description
(197, 768)-->(, 768)
At this point, the Cls Token will be created. As mentioned earlier, the Cls Token has the information to interact with all other patches. Just do a full connection. done.
(2) Internal execution details
(3) Specific modules