ViT: Visual Transformer backbone network ViT paper and code detailed

Visual Transformer

Author:louwill

Machine Learning Lab

    

Today begins the first article in the Visual Transformer series on Vision Transformer. Vision Transformer (ViT) can be regarded as the backbone network of the entire Visuier task.

The article that proposes the ViT model is titled An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale and was published in October 2020, although it is a little later than some Transformer visual task application models (such as DETR) , but as a visual classification network with pure Transformer structure, its work still has great pioneering significance.

The overall idea of ​​ViT is to do image classification tasks based on the pure Transformer structure. The relevant experiments in the paper prove that the ViT model after pre-training on large-scale data sets can be transferred to the classification tasks of small and medium-scale data sets. Better performance than CNN.

Detailed Explanation of ViT Model

An overview of the overall structure of the ViT model is shown in Figure 1.

The core process of ViT includes four main parts: image patch processing (make patches), image patch embedding (patch embedding) and position encoding, Transformer encoder and MLP classification processing. The following describes the basic design of ViT from these four process parts.

Image block processing (make patches)

The first step can be seen as an image preprocessing step. In CNN, two-dimensional convolution processing can be performed directly on the image, and no special preprocessing process is required. However, the Transformer structure cannot directly process the image, and it needs to be processed in blocks before that.

Suppose an image x∈H×W×C is now divided into P×P×C patches, then there are actually N=HW/P2 patches, and the dimension of all patches can be written as N×P×P×C. Then each patch is flattened, and the corresponding data dimension can be written as N×(P2×C). Here N can be understood as the length of the sequence input to the Transformer, C is the number of channels of the input image, and P is the size of the image patch.

patch embedding

Image segmentation is just a preprocessing process. To convert the vector dimension of N×(P2×C) into a two-dimensional input of N×D size, an image block embedding operation is also required, similar to the word embedding in NLP , block embedding is also a way to convert high-dimensional vectors into low-dimensional vectors.

The so-called image block embedding is actually a linear transformation of each flattened patch vector, that is, a fully connected layer, and the dimension after dimension reduction is D.

E in the above formula is the fully connected layer of block embedding, its input size is (P2×C), and the output size is D.

It is worth noting that a classification vector is added to the vector of length N in the above formula, which is used for the learning of category information in the Transformer training process. Assuming that the image is divided into 9 patches, that is, N=9, there are 9 vectors input into the Transformer encoder, but for these 9 vectors, which vector should be taken for classification prediction? Neither one is suitable. A reasonable approach is to artificially add a category vector, which is a learnable embedding vector, and input it into the Transformer encoder together with the other 9 patch embedding vectors, and finally take the first vector as the category prediction result. Therefore, this additional vector can be understood as the category information that the other 9 image patches look for.

position encoding

In order to maintain the spatial position information between the input image patches, it is also necessary to add a position encoding vector to the image block embedding. As shown in Epos in the above formula, the position encoding of ViT does not use the updated 2D position embedding method, but directly uses the The one-dimensional learnable position embedding variable of , originally the author of the paper found that 2D did not show better results than 1D in actual use.

ViT forward process

After integrating the embedded input vector of category vector addition, image block embedding and position encoding, you can directly enter the Transformer encoder part, which mainly includes MSA and MLP. Therefore, the forward calculation process of ViT's encoder can be summarized as follows:

The first formula is the aforementioned image block embedding, category vector addition and position encoding; the second formula is the MSA part, including multi-head self-attention, skip connection (Add) and layer normalization (Norm) three parts, which can be Repeat L MSA blocks; the third formula is the MLP part, including three parts: feedforward network (FFN), skip connection (Add) and layer normalization (Norm), and L MSA blocks can also be repeated. The fourth formula is layer normalization. Finally, an MLP is used as the Classification Head.

In order to more clearly show the structure of the ViT model and the vector changes during the training process, the following figure shows the vector dimension change diagram of ViT.

The picture comes from the pole city platform

ViT training and experiments

ViT training method

The basic training strategy of ViT is to do pre-training on a large data set, and then use it for migration on a small data set. The large datasets used by ViT for pre-training include:

  • ILSVRC-2012 ImageNet dataset:1000 classes

  • ImageNet-21k:21k classes

  • JFT:18k High Resolution Images

Among them, JFT is a Google internal large-scale image dataset with about 300M images with 18291 class annotations.

The datasets that ViT pre-training is transferred to include:

  • CIFAR-10/100

  • Oxford-IIIT Pets

  • Oxford Flowers-102

  • VTAB

The paper designs three ViT models of different sizes: Base, Large and Huge, which represent the basic model, the large model and the super model respectively. The parameters of the three models are shown in the following table.


For example, ViT-B/16 means the ViT-Base model with a patch size of 16.

ViT experimental design

The core experiment of ViT is to implement the aforementioned training method, that is, pre-training on a large-scale data set and then migrating to a small data set to see the model effect. In order to compare the CNN model, the paper specially uses Big Transfer (BiT), which uses a large ResNet for supervised transfer learning and is a large CNN model proposed on 2020 ECCV. Another comparison CNN model is the Noisy Student model on CVPR 2020, which is a semi-supervised large-scale CNN model.

The accuracy rates of the ViT, BiT and Nosiy Student models on each small dataset after pre-training on the three datasets are shown in the following table.


It can be seen that after pre-training on large datasets, the post-migration accuracy of ViT on each small dataset exceeds the results of some SOTA CNN models. But to achieve this kind of performance that surpasses CNN requires a combination of large pre-trained datasets and large models.

So the second experiment is what kind of requirements does ViT have on the size of the pre-training dataset? This paper makes a comparative experiment for this problem. Pre-training is performed on ImageNet, ImageNet-21k and JFT-300M respectively. The three datasets are small datasets, medium-sized datasets and large datasets respectively. The pre-training effect is shown in the figure below.


As can be seen from the figure, when pre-training on the smallest dataset ImageNet, although the author adds a lot of regularization operations, the performance of the ViT-Large model is not as good as the ViT-base model, and it is far worse than the performance of BiT. On the medium-sized ImageNet-21k dataset, everyone's performance is similar. Only on the large dataset such as JFT-30M, the ViT model can exert its advantages and effects.

All in all, a large pre-training dataset coupled with a large model is the key factor for ViT to achieve SOTA performance.

ViT code usage and interpretation

The ViT model implementation currently has an open source framework vit-pytorch that can be called directly, and can be installed directly by pip:

pip install vit-pytorch

The usage of vit-pytorch is as follows:

import torch
from vit_pytorch import ViT
# 创建ViT模型实例
v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)
# 随机化一个图像输入
img = torch.randn(1, 3, 256, 256)
# 获取输出
preds = v(img) # (1, 1000)

The meaning of each parameter is:

  • image_size: original image size

  • patch_size: the size of the image patch

  • num_classes: number of classes

  • dim: Transformer hidden variable dimension size

  • depth: Transformer encoder layers

  • Heads: The number of heads in MSA

  • dropout: dropout ratio

  • emb_dropout: Embedding layer dropout ratio

Let's focus on the code interpretation of vit.py. ViT is based on Attention and Transformer, so the construction logic is the same as that of Transformer. After building the underlying components, package them according to the forward process of ViT. The underlying building components required by ViT include normalization layer, FFN, Attention, and then build Transformer based on these three components, and finally build ViT based on Transformer and ViT forward process. Let's take a look at the construction process of ViT in three steps.

(1) The underlying component normalization layer, FFN, Attention

# 导入相关模块
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
from einops.layers.torch import Rearrange


# 辅助函数,生成元组
def pair(t):
    return t if isinstance(t, tuple) else (t, t)


# 规范化层的类封装
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)
# FFN
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)
# Attention
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)


        self.heads = heads
        self.scale = dim_head ** -0.5


        self.attend = nn.Softmax(dim = -1)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)


        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()


    def forward(self, x):
        b, n, _, h = *x.shape, self.heads
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
        dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
        attn = self.attend(dots)
        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

(2) Build Transformer

# 基于PreNorm、Attention和FFN搭建Transformer
class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x

(3) Build ViT

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)


        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
        # patch数量
        num_patches = (image_height // patch_height) * (image_width // patch_width)
        # patch维度
        patch_dim = channels * patch_height * patch_width
        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
        # 定义块嵌入
        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )
        # 定义位置编码
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        # 定义类别向量
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)


        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)


        self.pool = pool
        self.to_latent = nn.Identity()
        # 定义MLP
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )
        
    # ViT前向流程
    def forward(self, img):
        # 块嵌入
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        # 追加类别向量
        cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        # 追加位置编码
        x += self.pos_embedding[:, :(n + 1)]
        # dropout
        x = self.dropout(x)
        # 输入到transformer
        x = self.transformer(x)
        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
        x = self.to_latent(x)
        # MLP
        return self.mlp_head(x)

summary

As a pioneering study of Visual Transformer, ViT can be regarded as a must-read paper for understanding this direction. Since the first half of this year, a large number of visual tasks based on ViT have been continuously proposed, and ViT basically plays the role of Backbone in CNN similar to VGG16 or ResNet-52. Although groundbreaking work, ViT still has a number of limitations, large datasets and large models, which are already prohibitive for most people. Of course, these shortcomings are constantly being overcome in subsequent studies.

参考资料:
An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale
https://github.com/lucidrains/vit-pytorch
https://mp.weixin.qq.com/s/ozUHHGMqIC0-FRWoNGhVYQ
往期精彩:
【原创首发】机器学习公式推导与代码实现30讲.pdf
【原创首发】深度学习语义分割理论与实战指南.pdf
 谈中小企业算法岗面试

 算法工程师研发技能表

 真正想做算法的,不要害怕内卷
 算法工程师的日常,一定不能脱离产业实践

 技术学习不能眼高手低

 技术人要学会自我营销

 做人不能过拟合

求个在看


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324483565&siteId=291194637