[Self-attention neural network] Swin Transformer network

I. Overview

Swin Transformer is a hierarchical Vision Transformer         that uses a moving window .

        In the image field, Transformer needs to solve the following two problems:

                ① Scale problem : Objects with the same semantics have different scales in the image. (different sizes)

                ②Resolution is too large : If the pixel is used as the unit, the sequence will be too long.

        Due to the use of moving windows, Swin Transformer's self-attention is only calculated in the window , which reduces the amount of calculation; at the same time, the application of moving windows also brings hierarchical relationships between images , making it better applied in the image field.

2. Important operations

        1. Move the window

                For vision tasks, multi-scale features are extremely important. For example, for target detection, the common method is to extract features of different scales and perform feature fusion (FPN), which can well grasp the features of different sizes of objects. For semantic segmentation, multi-size features also need to be processed. (skip connection/hole convolution, etc.)

                patch: the smallest computing unit

                window: There are 7x7 (total 49) patches by default in the original text

               Shift operation: Move the original window to the right down by the window size/2 and round it down (2 patches), and then divide the graph along the extension line of the original window. It can make the self-attention mechanism in a window notice the information in other patches .

                Masked mask:

                         After the window is moved, direct merging cannot be performed due to inconsistent window sizes. The usual operation is to pad the small patch until it is consistent with the large patch in the middle. But this will increase the computational complexity.

                         And Masked is to perform a cyclic shift after ;

                Due to the patch movement of the combined window, all the patches in the upper left corner are mixed with the patches in other positions except the patch in the upper left corner maintains the original position information; for this case. After performing the self-attention operation on it, and then summing the parts of the result that are not the original combination according to a specific mask (used to shield the wrong combination), as shown in the figure below.

                The mask template is as follows: 

        2.patch merging

Used to generate multi-size features, similar to Pooling                 in CNN . The specific method is to merge adjacent small patches into one large patch .

 3. Model Architecture

        ①Patch Partition : divide the patch on the picture (the original text is 4x4, for the standard 224*224*3 picture, the size becomes 56*56*48 after this operation)

        ②Linear Embeding : Convert the dimension of the vector to a preset value; the original text sets a hyperparameter c=96 (after the operation, the size becomes 96*96*96, and the previous 96*96 will be straightened into 3136 to become a sequence Length , the last 96 programs the dimension of each token ; since 96*96 is straightened, a total of 3136 is too long for Transformer, so window-based self-attention is used, and each window has only 7x7=49 patches by default)

        ③Swin Transformer Block : Each group does two multi-head self-attention: ①Window-based multi-head self-attention; ②Multi-head self-attention based on moving window. Doing so enables window-to-window communication.

        ④Patch Merging : If the Transformer is not restricted, the input and output scales will not change . If you want to get multi-scale features like CNN, you must use Patch Merging. The specific method is to merge adjacent small patches into one large patch (taking the original text as an example, the number of samples is doubled, and the merge is skipped - one is collected at every other point );

                        However, after Merging, the number of channels will become 4c. In order to convolve with CNN, the number of channels is only x2. After Merging, a 1x1 convolution is performed to adjust the number of channels to 2c. (space size/2, number of channels x2)

Supongo que te gusta

Origin blog.csdn.net/weixin_37878740/article/details/129299358
Recomendado
Clasificación