Visual Transformer
Author:louwill
Machine Learning Lab
Today begins the first article in the Visual Transformer series on Vision Transformer. Vision Transformer (ViT) can be regarded as the backbone network of the entire Visuier task.
The article that proposes the ViT model is titled An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale and was published in October 2020, although it is a little later than some Transformer visual task application models (such as DETR) , but as a visual classification network with pure Transformer structure, its work still has great pioneering significance.
The overall idea of ViT is to do image classification tasks based on the pure Transformer structure. The relevant experiments in the paper prove that the ViT model after pre-training on large-scale data sets can be transferred to the classification tasks of small and medium-scale data sets. Better performance than CNN.
Detailed Explanation of ViT Model
An overview of the overall structure of the ViT model is shown in Figure 1.
The core process of ViT includes four main parts: image patch processing (make patches), image patch embedding (patch embedding) and position encoding, Transformer encoder and MLP classification processing. The following describes the basic design of ViT from these four process parts.
Image block processing (make patches)
The first step can be seen as an image preprocessing step. In CNN, two-dimensional convolution processing can be performed directly on the image, and no special preprocessing process is required. However, the Transformer structure cannot directly process the image, and it needs to be processed in blocks before that.
Suppose an image x∈H×W×C is now divided into P×P×C patches, then there are actually N=HW/P2 patches, and the dimension of all patches can be written as N×P×P×C. Then each patch is flattened, and the corresponding data dimension can be written as N×(P2×C). Here N can be understood as the length of the sequence input to the Transformer, C is the number of channels of the input image, and P is the size of the image patch.
patch embedding
Image segmentation is just a preprocessing process. To convert the vector dimension of N×(P2×C) into a two-dimensional input of N×D size, an image block embedding operation is also required, similar to the word embedding in NLP , block embedding is also a way to convert high-dimensional vectors into low-dimensional vectors.
The so-called image block embedding is actually a linear transformation of each flattened patch vector, that is, a fully connected layer, and the dimension after dimension reduction is D.
E in the above formula is the fully connected layer of block embedding, its input size is (P2×C), and the output size is D.
It is worth noting that a classification vector is added to the vector of length N in the above formula, which is used for the learning of category information in the Transformer training process. Assuming that the image is divided into 9 patches, that is, N=9, there are 9 vectors input into the Transformer encoder, but for these 9 vectors, which vector should be taken for classification prediction? Neither one is suitable. A reasonable approach is to artificially add a category vector, which is a learnable embedding vector, and input it into the Transformer encoder together with the other 9 patch embedding vectors, and finally take the first vector as the category prediction result. Therefore, this additional vector can be understood as the category information that the other 9 image patches look for.
position encoding
In order to maintain the spatial position information between the input image patches, it is also necessary to add a position encoding vector to the image block embedding. As shown in Epos in the above formula, the position encoding of ViT does not use the updated 2D position embedding method, but directly uses the The one-dimensional learnable position embedding variable of , originally the author of the paper found that 2D did not show better results than 1D in actual use.
ViT forward process
After integrating the embedded input vector of category vector addition, image block embedding and position encoding, you can directly enter the Transformer encoder part, which mainly includes MSA and MLP. Therefore, the forward calculation process of ViT's encoder can be summarized as follows:
The first formula is the aforementioned image block embedding, category vector addition and position encoding; the second formula is the MSA part, including multi-head self-attention, skip connection (Add) and layer normalization (Norm) three parts, which can be Repeat L MSA blocks; the third formula is the MLP part, including three parts: feedforward network (FFN), skip connection (Add) and layer normalization (Norm), and L MSA blocks can also be repeated. The fourth formula is layer normalization. Finally, an MLP is used as the Classification Head.
In order to more clearly show the structure of the ViT model and the vector changes during the training process, the following figure shows the vector dimension change diagram of ViT.
The picture comes from the pole city platform
ViT training and experiments
ViT training method
The basic training strategy of ViT is to do pre-training on a large data set, and then use it for migration on a small data set. The large datasets used by ViT for pre-training include:
ILSVRC-2012 ImageNet dataset:1000 classes
ImageNet-21k:21k classes
JFT:18k High Resolution Images
Among them, JFT is a Google internal large-scale image dataset with about 300M images with 18291 class annotations.
The datasets that ViT pre-training is transferred to include:
CIFAR-10/100
Oxford-IIIT Pets
Oxford Flowers-102
VTAB
The paper designs three ViT models of different sizes: Base, Large and Huge, which represent the basic model, the large model and the super model respectively. The parameters of the three models are shown in the following table.
For example, ViT-B/16 means the ViT-Base model with a patch size of 16.
ViT experimental design
The core experiment of ViT is to implement the aforementioned training method, that is, pre-training on a large-scale data set and then migrating to a small data set to see the model effect. In order to compare the CNN model, the paper specially uses Big Transfer (BiT), which uses a large ResNet for supervised transfer learning and is a large CNN model proposed on 2020 ECCV. Another comparison CNN model is the Noisy Student model on CVPR 2020, which is a semi-supervised large-scale CNN model.
The accuracy rates of the ViT, BiT and Nosiy Student models on each small dataset after pre-training on the three datasets are shown in the following table.
It can be seen that after pre-training on large datasets, the post-migration accuracy of ViT on each small dataset exceeds the results of some SOTA CNN models. But to achieve this kind of performance that surpasses CNN requires a combination of large pre-trained datasets and large models.
So the second experiment is what kind of requirements does ViT have on the size of the pre-training dataset? This paper makes a comparative experiment for this problem. Pre-training is performed on ImageNet, ImageNet-21k and JFT-300M respectively. The three datasets are small datasets, medium-sized datasets and large datasets respectively. The pre-training effect is shown in the figure below.
As can be seen from the figure, when pre-training on the smallest dataset ImageNet, although the author adds a lot of regularization operations, the performance of the ViT-Large model is not as good as the ViT-base model, and it is far worse than the performance of BiT. On the medium-sized ImageNet-21k dataset, everyone's performance is similar. Only on the large dataset such as JFT-30M, the ViT model can exert its advantages and effects.
All in all, a large pre-training dataset coupled with a large model is the key factor for ViT to achieve SOTA performance.
ViT code usage and interpretation
The ViT model implementation currently has an open source framework vit-pytorch that can be called directly, and can be installed directly by pip:
pip install vit-pytorch
The usage of vit-pytorch is as follows:
import torch
from vit_pytorch import ViT
# 创建ViT模型实例
v = ViT(
image_size = 256,
patch_size = 32,
num_classes = 1000,
dim = 1024,
depth = 6,
heads = 16,
mlp_dim = 2048,
dropout = 0.1,
emb_dropout = 0.1
)
# 随机化一个图像输入
img = torch.randn(1, 3, 256, 256)
# 获取输出
preds = v(img) # (1, 1000)
The meaning of each parameter is:
image_size: original image size
patch_size: the size of the image patch
num_classes: number of classes
dim: Transformer hidden variable dimension size
depth: Transformer encoder layers
Heads: The number of heads in MSA
dropout: dropout ratio
emb_dropout: Embedding layer dropout ratio
Let's focus on the code interpretation of vit.py. ViT is based on Attention and Transformer, so the construction logic is the same as that of Transformer. After building the underlying components, package them according to the forward process of ViT. The underlying building components required by ViT include normalization layer, FFN, Attention, and then build Transformer based on these three components, and finally build ViT based on Transformer and ViT forward process. Let's take a look at the construction process of ViT in three steps.
(1) The underlying component normalization layer, FFN, Attention
# 导入相关模块
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
from einops.layers.torch import Rearrange
# 辅助函数,生成元组
def pair(t):
return t if isinstance(t, tuple) else (t, t)
# 规范化层的类封装
class PreNorm(nn.Module):
def __init__(self, dim, fn):
super().__init__()
self.norm = nn.LayerNorm(dim)
self.fn = fn
def forward(self, x, **kwargs):
return self.fn(self.norm(x), **kwargs)
# FFN
class FeedForward(nn.Module):
def __init__(self, dim, hidden_dim, dropout = 0.):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
# Attention
class Attention(nn.Module):
def __init__(self, dim, heads = 8, dim_head = 64, dropout = 0.):
super().__init__()
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
self.heads = heads
self.scale = dim_head ** -0.5
self.attend = nn.Softmax(dim = -1)
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
self.to_out = nn.Sequential(
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
) if project_out else nn.Identity()
def forward(self, x):
b, n, _, h = *x.shape, self.heads
qkv = self.to_qkv(x).chunk(3, dim = -1)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv)
dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale
attn = self.attend(dots)
out = einsum('b h i j, b h j d -> b h i d', attn, v)
out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)
(2) Build Transformer
# 基于PreNorm、Attention和FFN搭建Transformer
class Transformer(nn.Module):
def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.):
super().__init__()
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(nn.ModuleList([
PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)),
PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout))
]))
def forward(self, x):
for attn, ff in self.layers:
x = attn(x) + x
x = ff(x) + x
return x
(3) Build ViT
class ViT(nn.Module):
def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
super().__init__()
image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
# patch数量
num_patches = (image_height // patch_height) * (image_width // patch_width)
# patch维度
patch_dim = channels * patch_height * patch_width
assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
# 定义块嵌入
self.to_patch_embedding = nn.Sequential(
Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
nn.Linear(patch_dim, dim),
)
# 定义位置编码
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
# 定义类别向量
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
self.dropout = nn.Dropout(emb_dropout)
self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)
self.pool = pool
self.to_latent = nn.Identity()
# 定义MLP
self.mlp_head = nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, num_classes)
)
# ViT前向流程
def forward(self, img):
# 块嵌入
x = self.to_patch_embedding(img)
b, n, _ = x.shape
# 追加类别向量
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b)
x = torch.cat((cls_tokens, x), dim=1)
# 追加位置编码
x += self.pos_embedding[:, :(n + 1)]
# dropout
x = self.dropout(x)
# 输入到transformer
x = self.transformer(x)
x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0]
x = self.to_latent(x)
# MLP
return self.mlp_head(x)
summary
As a pioneering study of Visual Transformer, ViT can be regarded as a must-read paper for understanding this direction. Since the first half of this year, a large number of visual tasks based on ViT have been continuously proposed, and ViT basically plays the role of Backbone in CNN similar to VGG16 or ResNet-52. Although groundbreaking work, ViT still has a number of limitations, large datasets and large models, which are already prohibitive for most people. Of course, these shortcomings are constantly being overcome in subsequent studies.
参考资料:
An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale
https://github.com/lucidrains/vit-pytorch
https://mp.weixin.qq.com/s/ozUHHGMqIC0-FRWoNGhVYQ
往期精彩:
【原创首发】机器学习公式推导与代码实现30讲.pdf
【原创首发】深度学习语义分割理论与实战指南.pdf
谈中小企业算法岗面试
算法工程师研发技能表
真正想做算法的,不要害怕内卷
算法工程师的日常,一定不能脱离产业实践
技术学习不能眼高手低
技术人要学会自我营销
做人不能过拟合
求个在看