1. Traditional RNN network
Each layer requires the previous layer to be executed before it can be executed.
1.1 Self-attention
it_
What is the reference found in a sentence and what is its context?
self-attention calculation
1.2 multi-header mechanism
1.3 Stacking multiple layers of self-attention is equivalent to another convolution
1.4 Location information encoding
1.5 Residual connection and normalization
Normalization (makes training faster and more stable), u=0 means the mean is 0, =1 means the standard layer is 1
1.6 decoder
Others are the same as encoder
1.7 Overall architecture
Encoder: Input a text sequence, perform multiple (N times) encoder (self-attention), and then perform multi-head self-attention (multi-head attention). It may get worse as you learn, so residual connection and normalization are added. .
dcoder: Add a mask, the input is the k1...kn and v1...vn sequences of the encoder, and the q1...qn of the decoder. Others are the same as encoder
2 Processing image architecture VIT
The image is convolved to extract the features and then converted into a vector of 300. Then the vector passes through the fully connected layer, such as mapping a 300-dimensional vector to a 256-dimensional one (feature reintegration).
2.1 Position coding after VIT image segmentation
Method 1 in vit: No position coding; Method 2: Two-dimensional form ratio position coding; Method 3: Split sequential position coding.
0 in position coding is not used in all tasks. It is generally used for classification and is not used during segmentation and detection.
The encoder converts the image into a feature form that can be recognized by the computer.
When processing the classification task, the results from 1-9 will be integrated to 0 , and then the 0*feature vector will be used to drive the classification. 0-9 are the things completed in step 2 of the
10 token codes respectively , Figure 1 Figure 2 Figure 3
2.2 VIT image calculation formula
E represents encoding, p p c represents inputting a patch (image segmentation block), D is mapping (fully connected layer), that is, 256 is mapped to 512, and after mapping it becomes p p d .
The last dimension D of Epos position encoding must be the same as E. , N+1 represents one more 0* (N represents the number of patch blocks for image segmentation), which represents a classification token.
The first E represents a mapping to D.
z0 means adding position encoding information to each data.
MSA-multi-head attention mechanism, LN-normalization, plus Zt-1 represents plus residual connection.
3 TNT
↑The internal transformer divides each divided image patch into multiple patches again. External transformers do the same thing as normal.
↑TNT internal sequence recombination and construction
VIT summary:
For position_enbeddings on the image, only one convolution is required.
4.swin Transformer
The traditional transformer treats images as patches, and each patch is treated as a small part of the sequence. The traditional transformer divides the patches into as fine a detail as possible, but at this time, a longer sequence needs to be constructed, the more tokens will be needed. The transformer needs to calculate one of the tokens with other tokens, and the amount of calculation is large. If 400 tokens are entered in the first layer, there will still be 400 tokens in the next layer. The input vector dimension of a traditional transformer is the same as the output vector.
The first layer of swin transformer has 400, and the second layer is merged to 200, and so on.
step:
4.1 图像的初始输入
4.2 将图像的特征图中的序列转换为多个窗口,即基于window的自注意力机制
reshape operation (56 56->64 7 7), 64 windows, each window is 7 7 in size4.3 计算自己窗口内的自注意力得分,得到权重矩阵
. Each window is composed of 7 7 = 49 tokens. Each token is handled by a 3-head attention mechanism, and each head is handled A 32-dimensional vector.
The meaning of the attention result: 64 represents 64 windows, 3 represents 3 different weight items, 49, 49 represents each 7 7 (49 49 is 49 tokens, other 48 + own weight [1] = 49 score) The self-attention score of the window.
4.4窗口重构,将窗口还原为输入时的特征
The new features (64, 49, 96) represent 64 windows respectively. Each window has 7 7 = 49 points. Each point input is a 96-dimensional vector. At this time, the 96-dimensional vector also represents the relationship with other tokens in the window. point relationship.
Each window point corresponds to 96 vectors. At this time, the 96 vectors are the characteristic meanings expressed after attention.
4.5 计算窗口内部特征后,进行窗口滑动再次计算注意力特征
4.6 窗口偏移的问题及解决
It was originally 4 large blocks of ABC and spaces, and was divided into nine positions 0-8. However, the calculation is still based on four windows, that is, 4 is regarded as one of them, then 5 and 3 are regarded as one piece, 1 and 7 are regarded as one piece, and 0, 2, 6, and 8 are regarded as one piece, which is equal to four pieces.
Then the self-attention within the block is calculated within four blocks, and the mask is filled with 0 in meaningless places, which does not affect the calculation.
The inputs of W-MSA and SW-MSA are the same, both are in 4.3 (3,64,3,49,32), and the meaning is the same. It’s just that SW-MSA offsets the window and introduces masking, and the rest is the same as W-MSA. >4.7 下采样
Take image blocks at intervals.
The first time it was 64 windows, the second time it was 16 windows, the third time it was 4 windows, and the fourth time it was 1 window. I chose 7 because 7 counts. Finally get the feature map
4.8 Code summary
Figure 5
3136 is equivalent to 3136 feature points, each point is composed of 96-dimensional vector Figure
6
Figure 7