Basic thesis study (1) - ViT

The Vision Transformer (ViT) model architecture was presented in a research paper titled "An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale" presented as a conference paper at ICLR 2021. It was developed and published by Neil Houlsby, Alexey Dosovitskiy, and 10 other authors on the Google Research Brain team. Tuning code and pre-trained ViT models can be found on GitHub from the Google Research team. You can find them here . The ViT model is pre-trained on ImageNet and ImageNet-21k datasets.

In the following, we highlight some of the most important Vision Transformers developed over the years. They are based on the Transformer architecture, originally proposed in 2017 for Natural Language Processing (NLP).

date Model brief description Vision Transformer?
2017 Transformer Attention-only models. It performs well on NLP tasks. No
2018 BERT Pre-trained Transformer models are starting to dominate the NLP field. No
2020 DETR DETR is a simple yet effective high-level vision framework that treats object detection as a direct ensemble prediction problem. Yes
2020 GPT-3 GPT-3 is a huge transformer model with 170B parameters, which is an important step towards general NLP models. No
2020 iGPT Transformer models originally developed for NLP can also be used for image pre-training. Yes
2020 ViT Pure transformer architecture effective for visual recognition. Yes
2020 IPT/SETR/CLIP Transformers have been applied to low-level vision, segmentation, and multimodal tasks, respectively. Yes
After 2021 ViT variant There are several ViT variants including DeiT, PVT, TNT, Swin and CSWin (2022). Yes

Difference Between CNN and ViT (ViT vs CNN)

Compared with Convolutional Neural Networks (CNN), Vision Transformer (ViT) achieves remarkable results while obtaining much less computational resources for pre-training. Compared to Convolutional Neural Networks (CNNs), Visual Transformers (ViT) have been shown to generally 较弱的归纳偏差(inductive bias)lead to an increased reliance on model regularization or data augmentation (AugReg) when trained on smaller datasets. ViT is a visualization model based on the Transformer architecture, originally designed for text-based tasks. The ViT model represents an input image as a sequence of image patches, much like a sequence of word embeddings for text using a transformer, and directly predicts the image's class label. When trained on sufficient data, ViT exhibits remarkable performance, breaking similar state-of-the-art CNN performance with 4 times the computational resources.

CNN uses an array of pixels, while ViT splits the input image into visual markers. The visual transformer divides the image into fixed-size tiles, embeds each tile correctly, and uses the positional embeddings as input to the transformer encoder. In addition, the ViT model outperforms CNN by nearly four times in terms of computational efficiency and accuracy. The self-attention layer in ViT makes it possible to globally embed information throughout the image. The model also learns the training data to encode the relative positions of image patches to reconstruct the image's structure.

1. Visual transformer ViT architecture

Several vision transformer models have been proposed in the literature. The overall structure of the vision converter architecture consists of the following steps:

  • Split an image into patches (fixed sizes)
  • Flatten the image patches
  • Create lower-dimensional linear embeddings from these flattened image patches
  • Add positional embeddings
  • Feed the sequence as an input to a state-of-the-art transformer encoder
  • Pre-train the ViT model with image labels (fully supervised on a huge dataset)
  • Fine-tune the downstream dataset for image classification

insert image description here
Vision Transformer (ViT) is an architecture that uses self-attention to process images. The Vision Transformer architecture consists of a series of transformer blocks. Each Transformer encoder module consists of two sublayers: Patch Embedding layer, Multi-Head Self-Attention layer, FeedForward layer, MLP classification header.

Patch Embedding层, divides the image into fixed-size Ptach, and maps each patch (MLP) to a high-dimensional vector token representation. These token embeddings are then fed into Transformer blocks for further processing.

Multi-Head Self-Attention层Attention weights are computed for each pixel based on its relationship to all other pixels in the image, and multi-head attention extends this mechanism by allowing the model to simultaneously focus on different parts of the input sequence. Instead, FeedForward层a non-linear transformation is applied to the output of the self-attention layer.

The final output of the ViT architecture is the class prediction, obtained by passing the CLS token of the output of the last Transformer module, and the MLP分类头classification head usually consists of a single fully connected layer.

The only thing that changes is the number of these blocks Layer. For this reason, in order to further prove that more data can be used to train larger ViT variants, three models are proposed: the number of
insert image description here
Layerlayers stacked by the encoder, Headsthe number of heads of multi-head attention, MLP sizeand the feature dimension of the multi-layer perceptron , but it is actually a bunch of linear transformation layers, Hidden size Dwhich is the token size of the embedding, which remains fixed throughout the layer. Why keep it fixed? This way we can use short residual edge skip connections.

ViT is pre-trained on a large dataset and then fine-tuned to a small dataset. The only modification is to drop the prediction header (MLP header) and append a new D×Klinear layer, where K is the number of categories for the small dataset.

1 Image block and dimensionality reduction

Because the input of the transformer encoder requires a sequence, the easiest way is to divide the image into patches and then pull them into sequences. Assuming that the input image size is 256x256, it is intended to be divided into 64 patches, each patch is 32x32 pixels:

x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)

This writing method adopts the Einstein expression, specifically, it is realized by using the einops library, which integrates various operators inside, and rearrange is one of them, which is very efficient. p is the patch size, assuming the input is [b, 3, 256, 256], then the rearrange operation first becomes (b, 3, 8x32, 8x32), and finally becomes (b, 8x8, 32x32x3) that is (b, 64, 3072), Divide each picture into 64 small blocks, and the length of each small block is 32x32x3=3072, that is to say, input an image sequence with a length of 64, and each element is encoded with a length of 3072.

Considering that 3072 is a bit large, the author performs dimensionality reduction first:

# 将3072变成dim,假设是1024
self.patch_to_embedding = nn.Linear(patch_dim, dim)
x = self.patch_to_embedding(x)

2 Add CLS token

Looking carefully at the above picture of the paper, you can find that it is assumed to be cut into 9 blocks, but in the end, the input of the transfomer is 10 vectors, and an additional 0 and sum are added. Why do you want to add it? The reason is that we don’t have a decoder now, but classify and predict directly after encoding, then the encoder is responsible for a little bit of decoder function, that is: it needs a similar to enable decoding flag, very similar to the standard transformer decoder The input target embedding vector is right-shifted by one bit operation. Try if there is no additional input, 9 blocks are input and 9 encoding vectors are output, so for the classification task, which output vector should I take for subsequent classification? It doesn't make sense to choose either, so the authors append a learnable embedding vector input. So why should the additional learnable embedding vector be designed to be learnable, instead of using a fixed token like in nlp? My personal irresponsible guess is that this should be the difference between the image field and the nlp field. Each word in nlp actually has a specific meaning and is discrete, but there is no such discrete token in the image field, and some are just a bunch If continuous features or image pixels are not set to be learnable, then I really don’t know what content should be set to be more appropriate, and all 0s and all 1s don’t make sense. Since then, it has become 10 vector outputs, and the output is also 10 coded vectors, and then the 0th coded output is taken for classification prediction. From this point of view, it can be considered that the encoder has a little more decoder function. The specific method is super simple, 0 is the position encoding vector, which is a learnable patch embedding vector.

# dim=1024
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
# 变成(b,64,1024)
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)
# 额外追加token,变成b,65,1024
x = torch.cat((cls_tokens, x), dim=1)

3 position coding PE

1-D position encoding: for example, 3x3 has 9 patches in total, and the patch encoding is 1 to 9

2-D position encoding: the patch encoding is 11, 12, 13, 21, 22, 23, 31, 32, 33, that is, the information of the X and Y axes is considered at the same time, and the encoding dimension of each axis is D/2

No significant differences were found even when many position embedding schemes were applied. This is probably due to the transformer encoder working at the patch level. Learning embeddings that capture sequential relationships between patches (spatial information) is not that important. It is relatively easier to understand the relationship between patches of P x P than the relationship between the full image H x W.
insert image description here

What is done here is relatively simple. Instead of using sincos encoding, it is directly set to be learnable, and the effect is similar. Adjacent positions have similar position encoding vectors, and the overall arrangement of positions in 2D space is the same. Adding the patch embedding vector and position encoding vector can be used as encoder input:

# num_patches=64,dim=1024,+1是因为多了一个cls开启解码标志
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
x += self.pos_embedding[:, :(n + 1)]
x = self.dropout(x)

Visualize the trained pos_embedding as follows:
insert image description here

4 Transformer Encoder

The author uses a transformer without any changes, so there is nothing to say.

self.transformer = Transformer(dim, depth, heads, mlp_dim, dropout)

Suppose the input is (b, 65, 1024), then the transformer output is also (b, 65, 1024)

5 categories head

The fc classifier head can be connected after the encoder

self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, mlp_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(mlp_dim, num_classes)
        )

# 65个输出里面只需要第0个输出进行后续分类即可
self.mlp_head(x[:, 0])

Overall architecture

It’s all written so far, isn’t it very simple, the overall process of the outer layer is:
insert image description here

import torch
from torch import nn
from einops import rearrange, repeat
from einops.layers.torch import Rearrange

# helpers

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

# classes

class FeedForward(nn.Module):  
	# LN + FC + GELU + Dropout + FC + Dropout
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)
        

class Attention(nn.Module):
	# LN(x)  ->  qkv  ->  Softmax(q*k/dk)*v  ->  FC
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.norm = nn.LayerNorm(dim)

        self.attend = nn.Softmax(dim = -1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        x = self.norm(x)

        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)


class Transformer(nn.Module):
	# depth=12层Attention + FeedForward -> LN
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout),
                FeedForward(dim, mlp_dim, dropout = dropout)
            ]))
            
		self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return self.norm(x)


class ViT(nn.Module):
	# to_patch_embedding + cat_cls_token + add_pos_embedding + transformer + mlp_head
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width
        assert pool in {
    
    'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'

        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
            nn.LayerNorm(dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

        self.pool = pool
        self.to_latent = nn.Identity()  # 创建一个恒等映射层,用于不做任何改变地传递输入数据

        self.mlp_head = nn.Linear(dim, num_classes)

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape

        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)

        x = self.transformer(x)

        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]

        x = self.to_latent(x)
        return self.mlp_head(x)

2. Experimental Analysis

The author's conclusion is that the application of transformer in the cv field requires a large amount of data for pre-training, and the performance is not as good as cnn in the case of the same amount of data. Once the amount of data comes up, the corresponding training time will be much longer, then you can easily surpass cnn.

insert image description here

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/132398238