Intensive reading of deep learning papers [14]: Vision Transformer

2f659bd7a83b783a3ce51d266621bdb5.jpeg

Starting from this article, we turn our attention to deep learning semantic segmentation to Transformer, which is a ViT-based semantic segmentation model. Before formally introducing the Transformer segmentation network, you need to understand the classification network of ViT. Vision Transformer (ViT) can be regarded as the backbone network of the entire Visuier task.

The article proposing the ViT model is titled An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , published in October 2020, although compared to some Transformer vision task application models (such as DETR), it was proposed later , but as a visual classification network with a pure Transformer structure, its work is still of great groundbreaking significance.

The general idea of ​​ViT is to do image classification tasks based on the pure Transformer structure. The relevant experiments in the paper prove that the ViT model after pre-training on large-scale datasets can achieve Better performance than CNN.

Detailed explanation of ViT model

An overview of the overall structure of the ViT model is shown in Figure 1.

1b1fb0b5ebdc1522e891f9e81068921c.png

The core process of ViT includes four main parts: image block processing (make patches), image patch embedding (patch embedding) and position encoding, Transformer encoder and MLP classification processing. The following describes the basic design of ViT from these four process parts.

Image block processing (make patches)

The first step can be seen as an image preprocessing step. In CNN, two-dimensional convolution processing can be directly performed on the image, and no special preprocessing process is required. However, the Transformer structure cannot directly process images, and it needs to be processed in blocks before that.

Suppose an image x∈H×W×C is now divided into P×P×C patches, then there are actually N=HW/P2 patches, and the dimensions of all patches can be written as N×P×P×C. Then each patch is flattened, and the corresponding data dimension can be written as N×(P2×C). Here N can be understood as the sequence length input to Transformer, C is the number of channels of the input image, and P is the size of the image patch.

Image block embedding (patch embedding)

Image segmentation is just a preprocessing process. To convert the vector dimension of N×(P2×C) into a two-dimensional input of N×D size, an image block embedding operation is required, similar to word embedding in NLP , block embedding is also a way to transform high-dimensional vectors into low-dimensional ones.

The so-called image block embedding is actually to perform a linear transformation on each flattened patch vector, that is, the fully connected layer, and the dimension after dimension reduction is D.

4750ef502060f03aa65ee7d2bee25d8c.png

E in the above formula is the fully connected layer of block embedding, its input size is (P2×C), and its output size is D.

It is worth noting that a classification vector is added to the vector of length N in the above formula, which is used for category information learning during Transformer training. Assuming that the image is divided into 9 patches, that is, N=9, there are 9 vectors input to the Transformer encoder, but for these 9 vectors, which vector should be used for classification prediction? Neither one is suitable. A reasonable approach is to artificially add a category vector, which is a learnable embedding vector, which is input into the Transformer encoder together with the other 9 patch embedding vectors, and finally takes the first vector as the category prediction result. Therefore, this additional vector can be understood as the category information that the other 9 image patches are looking for.

position encoding

In order to maintain the spatial position information between the input image patches, it is also necessary to add a position encoding vector to the image block embedding. As shown in Epos in the above formula, the position encoding of ViT does not use the updated 2D position embedding method, but directly uses The one-dimensional learnable position embedding variable of the original paper author found that 2D did not show better results than 1D in actual use.

ViT forward flow

After integrating the embedded input vector of category vector addition, image block embedding and position encoding, you can directly enter the Transformer encoder part, which mainly includes two parts: MSA and MLP. Therefore, the encoder forward calculation process of ViT can be summarized as follows:

490e0c71a1d23e5006377780a6d6c6c3.png

The first formula is the aforementioned image block embedding, category vector addition and position encoding; the second formula is the MSA part, including multi-head self-attention, skip connection (Add) and layer normalization (Norm). Repeat L MSA blocks; the third formula is the MLP part, including feedforward network (FFN), skip connection (Add) and layer normalization (Norm), and L MSA blocks can also be repeated. The fourth formula is layer normalization. Finally, an MLP is used as the classification head (Classification Head).

In order to more clearly show the structure of the ViT model and the vector changes during the training process, the following figure shows the vector dimension change diagram of ViT.

9837cc9a03cafa0cf66006d9423dfd1e.png

The picture comes from Jishi platform

ViT training and experiments

ViT training method

The basic training strategy of ViT is to do pre-training on a large data set first, and then use it for migration on a small data set. The large data sets used by ViT for pre-training include:

  • ILSVRC-2012 ImageNet dataset:1000 classes

  • ImageNet-21k:21k classes

  • JFT:18k High Resolution Images

Among them, JFT is an internal large-scale image dataset of Google, with about 300M images and 18291 categories labeled.

The data sets that ViT pre-training migrates to include:

  • CIFAR-10/100

  • Oxford-IIIT Pets

  • Oxford Flowers-102

  • VTAB

The paper designs three ViT models of different sizes, Base, Large and Huge, respectively representing the basic model, large model and super large model. The parameters of the three models are shown in the table below.

0ede2c4aa270701e122c6a1aeab61ceb.png

For example, ViT-B/16 means the ViT-Base model with a patch size of 16.

ViT experimental design

The core experiment of ViT is to implement the aforementioned training method, that is, after pre-training on a large-scale data set, migrate to a small data set to see the effect of the model. In order to compare the CNN model, the paper specially uses Big Transfer (BiT), which uses a large ResNet for supervised transfer learning. It is a large CNN model proposed on the 2020 ECCV. Another comparative CNN model is the Noisy Student model on CVPR in 2020, which is a semi-supervised large-scale CNN model.

The accuracy rates of the ViT, BiT and Nosiy Student models on each small data set after pre-training on the three major data sets are shown in the table below.

6e72365e40b4fbf783e21c36a42204ea.png

It can be seen that after ViT is pre-trained on a large data set, the accuracy rate after migration on each small data set exceeds the results of some SOTA CNN models. But to achieve this performance effect beyond CNN, a combination of large pre-trained data sets and large models is required.

So the second experiment is what are the requirements of ViT for the size of the pre-training dataset? The paper makes a comparative experiment on this problem. Pre-training is performed on ImageNet, ImageNet-21k and JFT-300M respectively. The three datasets are small datasets, medium-scale datasets and very large datasets respectively. The pre-training effects are shown in the figure below.

22230016ce2cd1cf1c1c91c2ad12bc0a.png

As can be seen from the figure, when pre-training on the smallest dataset ImageNet, although the author has added a large number of regularization operations, the performance of the ViT-Large model is not as good as the ViT-base model, and it is far worse than the performance of BiT. On the medium-scale ImageNet-21k dataset, everyone's performance is similar. Only on a large dataset such as JFT-30M, the ViT model can exert its advantages and effects.

All in all, a large pre-training dataset coupled with a large model is the key factor for ViT to achieve SOTA performance.

ViT code usage and interpretation

ViT model implementation currently has an open source framework vit-pytorch that can be called directly, and can be installed directly with pip:

pip install vit-pytorch

The usage of vit-pytorch is as follows:

import torch
from vit_pytorch import ViT
# 创建ViT模型实例
v = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048,
    dropout = 0.1,
    emb_dropout = 0.1
)
# 随机化一个图像输入
img = torch.randn(1, 3, 256, 256)
# 获取输出
preds = v(img) # (1, 1000)

The meanings of each parameter are:

  • image_size: original image size

  • patch_size: the size of the image patch

  • num_classes: number of categories

  • dim: Transformer hidden variable dimension size

  • depth: Transformer encoder layers

  • Heads: the number of heads in the MSA

  • dropout: inactivation ratio

  • emb_dropout: Embedding layer inactivation ratio

Let's focus on the code interpretation of vit.py. ViT is based on Attention and Transformer, so the building logic is the same as that of Transformer. After building the underlying components, they can be packaged according to the forward process of ViT. The underlying building components required by ViT include normalization layer, FFN, and Attention, and then build Transformer on the basis of these three components, and finally build ViT based on Transformer and ViT forward process. Let's look at the construction process of ViT in three steps.

(1) The underlying component normalization layer, FFN, Attention

# 导入相关模块
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
from einops.layers.torch import Rearrange


# 辅助函数,生成元组
def pair(t):
    return t if isinstance(t, tuple) else (t, t)


# 规范化层的类封装
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)
# FFN
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)
# Attention
class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
        super().__init__()
        inner_dim = dim_head *  heads
        project_out = not (heads == 1 and dim_head == dim)


        self.heads = heads
        self.scale = dim_head ** -0.5


        self.attend = nn.Softmax(dim = -1)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)


        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()


    def forward(self, x):
        b, n, _, h = *x.shape, self.heads
        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
        dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
        attn = self.attend(dots)
        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        return self.to_out(out)

(2) Build Transformer

# 基于PreNorm、Attention和FFN搭建Transformer
class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
                PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
            ]))
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x

(3) Build ViT

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)


        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
        # patch数量
        num_patches = (image_height // patch_height) * (image_width // patch_width)
        # patch维度
        patch_dim = channels * patch_height * patch_width
        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
        # 定义块嵌入
        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )
        # 定义位置编码
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        # 定义类别向量
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)


        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)


        self.pool = pool
        self.to_latent = nn.Identity()
        # 定义MLP
        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes)
        )
        
    # ViT前向流程
    def forward(self, img):
        # 块嵌入
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        # 追加类别向量
        cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        # 追加位置编码
        x += self.pos_embedding[:, :(n + 1)]
        # dropout
        x = self.dropout(x)
        # 输入到transformer
        x = self.transformer(x)
        x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
        x = self.to_latent(x)
        # MLP
        return self.mlp_head(x)

summary

As a pioneering research of Visual Transformer, ViT can be regarded as a must-read paper for understanding this direction. Since the first half of this year, a large number of ViT-based visual task research has been continuously proposed, in which ViT basically plays a role similar to VGG16 or ResNet-52 in CNN's Backbone. Although it is a pioneering work, ViT still has a lot of usage restrictions, large data sets and large models, these two points have discouraged most people. Of course, these deficiencies have been continuously overcome in subsequent research.

往期精彩:
 深度学习论文精读[13]:Deeplab v3+
 深度学习论文精读[12]:Deeplab v3

 深度学习论文精读[11]:Deeplab v2

 深度学习论文精读[10]:Deeplab v1

 深度学习论文精读[9]:PSPNet

 深度学习论文精读[8]:ParseNet

 深度学习论文精读[7]:nnUNet
 深度学习论文精读[6]:UNet++

 深度学习论文精读[5]:Attention UNet
 深度学习论文精读[4]:RefineNet
 深度学习论文精读[3]:SegNet

 深度学习论文精读[2]:UNet网络
 深度学习论文精读[1]:FCN全卷积网络


Guess you like

Origin blog.csdn.net/weixin_37737254/article/details/127046083