transformer source code

1. Traditional RNN network

Each layer requires the previous layer to be executed before it can be executed.
Insert image description here

1.1 Self-attention

it_What is the reference found in a sentence and what is its context? Here is the quote
self-attention calculation
Here is the quote
Insert image description here
Insert image description here
Insert image description here
Insert image description here

1.2 multi-header mechanism

Insert image description here
Here is the quote
Insert image description here

1.3 Stacking multiple layers of self-attention is equivalent to another convolution

Insert image description here

1.4 Location information encoding

Here is the quote

1.5 Residual connection and normalization

Normalization (makes training faster and more stable), u=0 means the mean is 0, =1 means the standard layer is 1=1

1.6 decoder

Here is the quote
Insert image description here
Others are the same as encoderInsert image description here
Insert image description here

1.7 Overall architecture

Encoder: Input a text sequence, perform multiple (N times) encoder (self-attention), and then perform multi-head self-attention (multi-head attention). It may get worse as you learn, so residual connection and normalization are added. .
dcoder: Add a mask, the input is the k1...kn and v1...vn sequences of the encoder, and the q1...qn of the decoder. Others are the same as encoderInsert image description here

2 Processing image architecture VIT

The image is convolved to extract the features and then converted into a vector of 300. Then the vector passes through the fully connected layer, such as mapping a 300-dimensional vector to a 256-dimensional one (feature reintegration).Insert image description here

2.1 Position coding after VIT image segmentation

Method 1 in vit: No position coding; Method 2: Two-dimensional form ratio position coding; Method 3: Split sequential position coding.
0 in position coding is not used in all tasks. It is generally used for classification and is not used during segmentation and detection.
The encoder converts the image into a feature form that can be recognized by the computer.
When processing the classification task, the results from 1-9 will be integrated to 0
, and then the 0*feature vector will be used to drive the classification. 0-9 are the things completed in step 2 of the
10 token codes respectively , Figure 1 Figure 2 Figure 3Insert image description here

Insert image description here

Insert image description here

Insert image description here

2.2 VIT image calculation formula

E represents encoding, p p c represents inputting a patch (image segmentation block), D is mapping (fully connected layer), that is, 256 is mapped to 512, and after mapping it becomes p p d .
The last dimension D of Epos position encoding must be the same as E. , N+1 represents one more 0* (N represents the number of patch blocks for image segmentation), which represents a classification token.
The first E represents a mapping to D.
z0 means adding position encoding information to each data.
MSA-multi-head attention mechanism, LN-normalization, plus Zt-1 represents plus residual connection.Insert image description here
Insert image description here
Insert image description here

3 TNT

Insert image description here
Insert image description here↑The internal transformer divides each divided image patch into multiple patches again. External transformers do the same thing as normal.
Insert image description here
↑TNT internal sequence recombination and construction
Insert image description here

VIT summary:

For position_enbeddings on the image, only one convolution is required.
Insert image description here

4.swin Transformer

The traditional transformer treats images as patches, and each patch is treated as a small part of the sequence. The traditional transformer divides the patches into as fine a detail as possible, but at this time, a longer sequence needs to be constructed, the more tokens will be needed. The transformer needs to calculate one of the tokens with other tokens, and the amount of calculation is large. If 400 tokens are entered in the first layer, there will still be 400 tokens in the next layer. The input vector dimension of a traditional transformer is the same as the output vector.
The first layer of swin transformer has 400, and the second layer is merged to 200, and so on.
Insert image description here
step:
Insert image description here
Insert image description here

4.1 图像的初始输入Insert image description here
4.2 将图像的特征图中的序列转换为多个窗口,即基于window的自注意力机制
reshape operation (56 56->64 7 7), 64 windows, each window is 7 7 in size Insert image description here4.3 计算自己窗口内的自注意力得分,得到权重矩阵
. Each window is composed of 7 7 = 49 tokens. Each token is handled by a 3-head attention mechanism, and each head is handled A 32-dimensional vector.
The meaning of the attention result: 64 represents 64 windows, 3 represents 3 different weight items, 49, 49 represents each 7 7
(49 49 is 49 tokens, other 48 + own weight [1] = 49 score) The self-attention score of the window. Insert image description hereInsert image description here
4.4窗口重构,将窗口还原为输入时的特征
The new features (64, 49, 96) represent 64 windows respectively. Each window has 7
7 = 49 points. Each point input is a 96-dimensional vector. At this time, the 96-dimensional vector also represents the relationship with other tokens in the window. point relationship.
Each window point corresponds to 96 vectors. At this time, the 96 vectors are the characteristic meanings expressed after attention. Insert image description here
4.5 计算窗口内部特征后,进行窗口滑动再次计算注意力特征 Insert image description here 4.6 窗口偏移的问题及解决Insert image description here
It was originally 4 large blocks of ABC and spaces, and was divided into nine positions 0-8. However, the calculation is still based on four windows, that is, 4 is regarded as one of them, then 5 and 3 are regarded as one piece, 1 and 7 are regarded as one piece, and 0, 2, 6, and 8 are regarded as one piece, which is equal to four pieces.
Then the self-attention within the block is calculated within four blocks, and the mask is filled with 0 in meaningless places, which does not affect the calculation. Insert image description hereInsert image description here
The inputs of W-MSA and SW-MSA are the same, both are in 4.3 (3,64,3,49,32), and the meaning is the same. It’s just that SW-MSA offsets the window and introduces masking, and the rest is the same as W-MSA. Insert image description here> Insert image description here 4.7 下采样
Take image blocks at intervals. Insert image description hereInsert image description here
The first time it was 64 windows, the second time it was 16 windows, the third time it was 4 windows, and the fourth time it was 1 window. I chose 7 because 7 counts. Finally get the feature mapInsert image description here
4.8 Code summary
Figure 5
3136 is equivalent to 3136 feature points, each point is composed of 96-dimensional vector Figure Insert image description here
6
Insert image description here
Figure 7
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_43917045/article/details/131964314