[Deep Learning] (ICCV-2021) PVT-Pyramid Vision Transformer and PVT_V2

0. Details

Name: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Paper: Original
Code: Official Code

Note reference:
1. Transformer in Semantic Segmentation (Part 3): PVT — Pyramid Vision Transformer
2 for dense prediction tasks The author's own explanation! ! -The vernacular Pyramid Vision Transformer
3 Pyramid Transformer, the Transformer backbone architecture that is more suitable for dense prediction tasks
4. Thinking about pvt!
5 Concise version - PVT (Pyramid Vision Transformer) algorithm arrangement
6. Translation version

1. Brief introduction

  • The previously summarized ViT backbone itself does not design a suitable structure for intensive predictive tasks such as segmentation and detection in vision.
    Subsequent papers such as SERT simply used VIT as an Encoder, and the single-scale features extracted by it were processed by some simple Decoders to verify the effect of Transformer on semantic segmentation tasks.
    However, we know that multi-scale features are very important in semantic segmentation tasks, so a vision transformer backbone that can extract multi-scale features is proposed in PVT
    .
    insert image description here

2. Main work

2.1 Problems left by ViT

The structure of ViT is as follows, it is the same columnar structure as the original Transformer. This means,
1) The feature map output by ViT lacking multi-scale features is basically the same size as the input.
This causes ViT to only output single-scale features when it is used as an Encoder.
It can only output 16-stride or 32-stride feature maps in the whole process;
2) The calculation cost increases sharply.
Once the resolution of the input image is slightly larger, the memory usage will be very high or even the memory will overflow.
Compared with classification tasks, segmentation and detection often require larger resolution image inputs. Therefore, on the one hand, we need to divide more patches
than classification tasks to obtain features of the same granularity . If the same number of patches is still maintained, the granularity of the features will become coarser, resulting in performance degradation. On the other hand, we know that the computational overhead of Transformer is positively correlated with the number of tokenized patches. The larger the number of patches, the higher the computational overhead. big . Therefore, if we increase the number of patches, it may make our already poor computing resources worse. The above is the first flaw of ViT applied to dense prediction tasks.

The solution
is simple and rude:

  • If the output resolution is not enough, increase it;
  • The patch token sequence is too long, resulting in a large amount of calculation for the attention matrix, so the length of the overall sequence can be reduced in a targeted manner, or only the length of k and v can be reduced (as shown in the figure below).

2.2 Introduce pyramid structure

After years of development, the CNN backbone in computer vision has precipitated some general design patterns.
The most typical is the pyramid structure.
A simple summary is:

1) The resolution of the feature map gradually decreases as the network deepens ;

2) The number of channels of the feature map gradually increases as the network deepens .
Almost all dense prediction algorithms are designed around feature pyramids

How can this structure be introduced into Transformer?
In the end, it was found that simply stacking multiple independent Transformer encoders works best.
Then we get the PVT, as shown in the figure below. In each Stage, the resolution of the input is gradually reduced by Patch Embedding .

Among them, in addition to the pyramid structure. In order to process high-resolution (4-stride or 8-stride) feature maps at a lower cost , we have also made some adjustments to Multi-Head Attention.
insert image description here
In order to reduce the amount of computation while ensuring the resolution of the feature map and the global receptive field , we reduce the length and width of the key (K) and value (V) to 1/R_i respectively. In this way, we can process 4-stride and 8-stride feature maps at a small cost.

3. Design scheme of PVT

The model is generally composed of 4 stages for generating features of different scales, and each stage consists of Patch Embedding and several Transformer modules (which have been changed compared to the original transformer).

  • Patch Embedding: The purpose is to divide the information into blocks, reduce the image size of a single image, but increase the depth of the data
  • Transformer Encoder: The purpose is to calculate the attention value of the image. As the depth becomes larger, the computational complexity will increase, so here the author uses Special Reduction to reduce the computational complexity

In the first stage, given an input image of size HX WX3, we process it as follows:

First, it is divided into square blocks of HW/4 (here, in order to benchmark against ResNet, the size of the maximum output feature is 1/4 of the original resolution), and the size of each block is; then, the
expanded The block is sent to the linear projection to obtain an embedded block with a size of HW/4 square xC1;
secondly, the aforementioned embedded block and position embedding information are sent to Transformer's Encoder, and the output will be reshap as .H/4 XW/4 X C1

In a similar way, we can get features F2 F3 F4 with the output of the previous stage as input.

H * W * 3 -> stage1 block -> H/4 * W/4 * C1 -> stage2 block -> H/8 * W/8 * C2 -> stage3 block -> H/16 * W/16 * C3 -> stage3 block -> H/32 * W/32 * C4

insert image description here

3.1 Patch embedding

The Patch Embedding part is the same as the block operation of the picture in ViT, that is,
insert image description here
in this way we can flexibly adjust the feature size of each stage so that it can build a feature pyramid for Transformer.

At the beginning of each stage, the input image is first tokenized like ViT, that is, patch embedding is performed, and the patch size is 2x2 except for the first stage which is 4x4. This idea is somewhat similar to pooling or convolution operations with steps, reducing the resolution of the image so that the model can extract more abstract information. This means that the final feature map dimension of each stage (except the first stage) is halved, and the number of tokens is correspondingly reduced by 4 times .

Each patch is then sent to a layer of Linear, the number of channels is adjusted, and then reshaped to tokenize the patch.

This makes PVT look similar to resnet in general, and the feature maps obtained by the four stages are 1/4, 1/8, 1/16 and 1/32 compared to the original image. This also means that PVT can generate features of different scales.

Note: Since the number of tokens in different stages is different, each stage adopts different position embeddings, and the respective position embeddings are added after the patch embed. When the size of the input image changes, the position embeddings can also be self-adapted through interpolation.

the code

1. The shape of the data input first is (bs, channel, H, W). For convenience, we directly use a picture with a batch size of 1 as an example, so the input is (1, 3, 224, 224) code corresponding to
:

model = pvt_small(**cfg)
data = torch.randn((1, 3, 224, 224))
output = model(data)

2. The input data first undergoes the Patch emb operation of the stage 1 block. This operation first divides the 224 224 image into 4 4 small patches. This implementation is realized by convolution, and the 224 224 convolution is used for 4 4 The image is convolved with a step size of 4.
code corresponds to:

self.proj = nn.Conv2d(in_chans=3, embed_dim=64, kernel_size=4, stride=4)
# 其中64是网络第一层输出特征的维度对应图中的C1
print(x.shape)# torch.Size([1, 3, 224, 224])
x = self.proj(x)
print(x.shape)# torch.Size([1, 64, 56, 56])

In this way, each point of the matrix of 56 56 can be used to represent the original 4 4 patch

3. The matrix of 1 64 56*56 is corresponding to the second dimension flattening
code:

print(x.shape) # torch.Size([1, 64, 56, 56])
x = x.flatten(2)
print(x.shape) # torch.Size([1, 64, 3136])

At this time, the one-dimensional vector of 3136 can be used to represent the image of 224*224

4. In order to facilitate the calculation, replace the second and third dimensions, and then perform layer norm on the data.
code corresponds to:

print(x.shape) # torch.Size([1, 64, 3136])
x = x.transpose(1, 2)
print(x.shape) # torch.Size([1, 3136, 64])
x = self.norm(x)
print(x.shape) # torch.Size([1, 3136, 64])

The operation of Patch emb is completed above, and the complete code corresponds to:

def forward(self, x):
    B, C, H, W = x.shape # 1,3,224,224
    x = self.proj(x) # 卷积操作,输出1,64,56,56
    x = x.flatten(2) # 展平操作,输出1,64,3136
    x = x.transpose(1, 2) # 交换维度,输出 1,3136,64
    x = self.norm(x) # layer normal,输出 1,3136,64
    H, W = H // 4, W // 4 # 最终的高宽变成56,56
    return x, (H, W)

insert image description here

3.2position embedding

1. This part is basically the same as Vit's position encoding. Create a learnable parameter whose size is the same as the size of the tensor from patch emb (1,3136,64). This is a learnable parameter.
code corresponds to:

pos_embed = nn.Parameter(torch.zeros(1, 3136, 64))

2. The use of position encoding is the same as that of Vit, **directly add the matrix with the output x,** so the shape does not change.
code corresponds to:

print(x.shape) # torch.Size([1, 3136, 64])
x = x + pos_embed
print(x.shape) # torch.Size([1, 3136, 64])

3. After the addition, the author added a dropout for regularization.
code corresponds to:

pos_drop = nn.Dropout(p=drop_rate)
x = pos_drop(x)

The operation of position embedding is completed above, and the complete code corresponds to:

x = x + pos_embed
x = pos_drop(x)

insert image description here

3.3Encoder

The encoder part of the i-th stage is composed of depth[i] blocks. For pvt_tiny to pvt_large, the main difference is the parameter of depth: for example, for
insert image description here
pvt_tiny, each encoder is composed of two blocks, each block The structure is shown in the figure below:
insert image description here
the input of the first block of the first encoder is the tensor obtained after the position embedding we analyzed earlier, so the size of its input is (1,3136,64), and this At the same time, the size of the image becomes 56*56 after Patch emb.

1. First of all, it can be seen from the above figure that a copy of the input is made for the residual structure. Then the input x first passes through a layer norm layer. At this time, the dimension remains unchanged, and then the multi head attention layer (SRA, which will be discussed later) modified by the author is superimposed on the previously copied input.
code corresponds to:

print(x.shape) # (1,3136,64)
x = x + self.drop_path(self.attn(self.norm1(x), H, W))
print(x.shape) # (1,3136,64)

2. A copy of the features passed through the SRA layer is reserved for the residual structure, and then the input is passed through the layer norm layer with the same dimension, and then sent to the feed forward layer (to be discussed later), and then superimposed with the previously copied input.
code corresponds to:

print(x.shape) # (1,3136,64)
x = x + self.drop_path(self.mlp(self.norm2(x)))
print(x.shape) # (1,3136,64)

Therefore, it can be found that after a block, the shape of the tensor does not change. The complete code corresponds to:

def forward(self, x, H, W):
      x = x + self.drop_path(self.attn(self.norm1(x), H, W)) # SRA层
      x = x + self.drop_path(self.mlp(self.norm2(x))) # feed forward层
      return x

3. In this way, the size of the tensor obtained after depth[i] blocks is still (1,3136,64) , and it can be input to the next stage only by restoring its shape to the shape of the image . To restore the shape, just call the reshape function directly. At this time, the features are restored to (bs, channel, H, W), and the value is (1, 64, 56, 56) corresponding to the code
:

print(x.shape) # 1,3136,64
x = x.reshape(B, H, W, -1)
print(x.shape) # 1,56,56,64
x = x.permute(0, 3, 1, 2).contiguous()
print(x.shape) # 1,64,56,56

At this time, the tensor input by stage2 is (1,64,56,56), and the complete analysis of the first stage of data output is completed. Finally , pvt_tiny, pvt_small, pvt_medium, and pvt_large can be constructed by
stacking different numbers of blocks in different encoders .
The full diagram is as follows:
insert image description here
So! After stage1, the input tensor of (1, 3, 224, 224) becomes a tensor of (1, 64, 56, 56), and this tensor can be input to the next stage to repeat the above calculation to complete the PVT design. .

3.2 Spatial-reduction attention(SRA)

After Patch embedding, the tokenized patch needs to be input into several transformer modules for processing.

The number of tokens in different stages is different, and the number of patches in the earlier stage is larger. The calculation amount of self-attention is proportional to the square of the length N of the sequence. If PVT and ViT are the same, all transformer encoders use the same parameters. , then the amount of computation must be unbearable.

1. In order to reduce the amount of calculation in PVT , different stages use different network parameters .
The network parameter settings of different series of PVT are as follows, where P is the size of the patch, C is the size of the feature dimension, N is the number of heads of MHA (multi-head attention), E is the expansion coefficient of FFN, and the transformer defaults to 4.
insert image description here
It can be seen that with the stage, the feature dimension is gradually increasing. For example, the feature dimension of stage1 is only 64, while the feature dimension of stage4 is 512. This setting is similar to the conventional CNN network setting, so the number of patches in the previous stage Although large, the feature dimension is small, so the amount of calculation is not too large. The difference between PVT of different volumes is mainly reflected in the difference in the number of transformer encoders in each stage.

2. In order to further reduce the amount of calculation, the author replaces the multi-head attention (MHA) with the proposed spatial-reduction attention (SRA).

The core of SRA is to reduce the number of key and value pairs in the attention layer . In conventional MHA, the number of key and value pairs in the attention layer is the length of the sequence, but SRA reduces it to the original 1/Rd square.

The processing process of SRA can be described as follows:
insert image description here
that is to say, the dimension of each head is equal to Ci/Ni, and SR( ) is a spatial scale downsampling operation , which is defined as follows:
insert image description here
insert image description here
(
In a word, the hyperparameters involved in the proposed scheme include the following:

P: block size of stage i;
C: number of channels of stage i;
L: number of encoder layers of stage i;
R: downsampling ratio of SRA in stage i;
N: head number of stage i;
E: MLP of stage i expansion ratio.
)
Specifically:
In terms of implementation,

  • First, transform K and V with dimensions (HW, C) into 3-D feature maps with dimensions (H, W, C) through reshape,
  • Then the patches with size R * R are evenly divided, and each patch will get patch embeddings with dimensions (H W / R R, C) through linear transformation (the implementation here is actually similar to the patch emb operation, which is equivalent to a convolution operate),
  • Finally, a layer norm layer is applied, so that the number of K and V can be greatly reduced. Its core code is also implemented in this way:
self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
self.norm = nn.LayerNorm(dim)```

At each stage, after being processed by several SRA modules, the obtained features are reshaped into a 3D feature map and input to the next stage.

1. First, if the parameter self.sr_ratio is 1, then the attetion of pvt is exactly the same as that of vit:
insert image description here

2. Therefore, analyze the different places

self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
x_ = self.norm(x_)
kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

2.1:首先输入进来的x的shape是(1313664)2.2:先permute置换维度得到(1643136)2.3:reshape得到(1645656)
2.4:self.sr(x_)是一个卷积操作,卷积的步长和大小都是sr_ratio,这个数值这里是8因此相当于将56*56的大小长宽缩小到8分之一,也就是面积缩小到64分之一,因此输出的shape是(1,64,7,7)
2.5:reshape(B, C, -1)得到(1,64,49)
2.6:permute(0, 2, 1)得到(1,49,64)
2.7:经过layer norm后shape不变
2.8:kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)这一行就是和vit一样用x生成k和v,不同的是这里的x通过卷积的方式降低了x的大小。
这一行shape的变化是这样的:(1,49,64)->(1,49,128)->(1,49,2,1,64)->(2,1,1,49,64)
2.9:拿到kv:(2,1,1,49,64)分别取index为01就可以得到k和v对应k, v = kv[0], kv[1]因此k和v的shape为(1,1,49,64)

3. The following code is the same as vit, mainly after getting the q, k, and v generated by x **, q is multiplied by all k matrices to calculate softmax, and then weighted to v. **
code corresponds to:

attn = (q @ k.transpose(-2, -1)) * self.scale # (1, 1, 3136, 64)@(1, 1, 64, 49) = (1, 1, 3136, 49)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)

x = (attn @ v).transpose(1, 2).reshape(B, N, C)# (1, 1, 3136, 49)@(1, 1, 49, 64) = (1, 1, 3136, 64)
x = self.proj(x)
x = self.proj_drop(x)
# x:(1, 1, 3136, 64)

4. So the size of x input to the attention module is (1,3136,64) and the output shape is also (1,3136,64)
insert image description here

full code

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1):
        super().__init__()
        assert dim % num_heads == 0, f"dim {
      
      dim} should be divided by num_heads {
      
      num_heads}."

        self.dim = dim
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.q = nn.Linear(dim, dim, bias=qkv_bias)
        self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.sr_ratio = sr_ratio
        # 实现上这里等价于一个卷积层
        if sr_ratio > 1:
            self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
            self.norm = nn.LayerNorm(dim)

    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)

        if self.sr_ratio > 1:
            x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
            x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1) # 这里x_.shape = (B, N/R^2, C)
            x_ = self.norm(x_)
            kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        else:
            kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        k, v = kv[0], kv[1]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)

        return x

feed forward

This part is relatively simple, in fact, it is a module composed of mlp.

1、完整代码:
class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x
首先forward函数的输入是attention的输出和原始输入残差相加的结果,
输入大小是(1313664)
fc1输出(13136512)
act是GELU激活函数,输出(13136512)
drop输出(13136512)
fc2输出(1313664)
drop输出(1313664)

Experimental results

semantic segmentation

We choose Semantic FPN [21] as the baseline, which is a simple segmentation method that requires no special operations (e.g., dilated convolutions). Therefore, using it as a baseline is a good check of the original validity of the backbone. Similar to the implementation in object detection, we feed feature pyramids directly into Semantic FPN and use bilinear interpolation to resize the pre-trained positional embeddings.

experiment settings

We chose ADE20K [63],
a challenging scene analysis benchmark for semantic segmentation. ADE20K contains 150 fine-grained semantic categories with 20210, 2000 and 3352 images for training, validation and testing, respectively. We evaluate our PVT backbone by applying the PVT backbone to Semantic FPN [21] (Panoptic feature pyramid networks), a simple segmentation method without dilated convolutions [57]. During the training phase, the backbone is initialized with pre-trained weights on ImageNet [9] and other newly added layers are initialized with Xavier [13]. We optimize the model using AdamW [33] with an initial learning rate of 1e-4. Following the common setup [21, 6], we train the model for 80k iterations on 4 V100 GPUs with a batch size of 16. The learning rate decays according to the polynomial decay law, with a power of 0.9. We randomly resize and crop training images to 512 × 512 and scale images to 512 on the short side during testing.

result. As shown in Table 5, under different parameter scales, our PVT consistently outperforms ResNet [15] and ResNeXt [56] for semantic segmentation using Semantic FPN. For example, with almost the same number of parameters, our PVT Tiny/Small/Medium outperforms ResNet-18/50/101 by at least 2.8 million. In addition, although the number of parameters of our Semantic FPN+PVT Large is 20% lower than that of Semantic FPN+ResNeXt101-64x4d, mIoU is still 1.9 higher (42.1 vs. 40.2), which shows that for semantic segmentation, our The PVT can extract better features than CNN, thanks to the global attention mechanism.

Finally, the paper uses PVT as Encoder and FPN as Decoder to test the performance of the model on the ADE20K dataset. details as follows:
insert image description here

think

It reflects an important phenomenon. Gradually reducing the structure of the Token sequence can actually achieve the same effect as the fixed-length Token sequence model of Deit and ViT, or even better.

That is, the Token for image construction still needs to be further optimized to more efficiently extract the local context information of the image . This is in the T2T Module in Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, and Visual Transformers: It is also reflected in the design of Tokenizer and other structures in Token-based Image Representation and Processing for Computer Vision.

Apart from other things, the extraction of multi-resolution features here is actually both traditional and trendy:

  • Enhancement strategies for common dense prediction tasks with multi-level feature extraction from existing CNNs to provide rich multi-scale information for subsequent structures
  • There is also the recent SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers This is an effective attempt to use the output features in the middle of the multi-layer Transformer Layer to feed the CNN decoder to serve the recovery of segmentation predictions

The two clapped their hands, ah, it seemed that the train of thought went with the flow. But wanting to try is not equivalent to being able to make it. In the process of specific practical details, problems encountered, targeted solutions, and strategy attempts and adjustments are not humane. However, in terms of the results, it can indeed be said that it is a good exploration of the existing work of CNN and Transformer.

author's idea

1. Why is PVT better than CNN under the same parameter amount?

I think there are two points
1) global receptive field and
2) dynamic weights.

In fact, in essence, Multi-Head Attention (MHA) and Conv have some similarities. MHA can be roughly regarded as a convolution with a global receptive field, and the result is a weighted average according to the attention weight . Therefore, the feature expression ability of Transformer will be stronger.

2. Subsequent expandable ideas

1) More efficient Attention: As the input image increases, the growth rate of PVT resource consumption is higher than that of ResNet, so PVT is more suitable for processing images with medium input resolutions (see Ablation Study of PVT for details). So it is very important to find a more efficient Attention scheme.

2) Position Embedding: The position embedding of PVT is the same as that of ViT, they are all random parameters, and then learned by hard work. And when changing the resolution of the input image, position embedding also needs to be resized by interpolation. So I think this is also an area that can be improved, and find a method that is more suitable for 2D images.

3) Pyramid structure: PVT is just a simpler pyramid Tranformer. The middle is connected through Patch Embedding , and there may be a more elegant solution.

【PVT v2】PVTv2: Improved Baselines with Pyramid Vision Transformer

Thesis
code

Note reference:
PVT, PVTv2
concise version
code explanation version

motivation

Starting point: Optimizing PVT1

  • vit and pvt_v1 encode the image with a 4*4 patch, which ignores a certain image local continuity .
  • Both vit and pvt_v1 are encoded with fixed-size positions , which is not friendly to processing images of arbitrary sizes.
  • The amount of calculation is still large

Improve

1.Overlapping Patch Embedding

insert image description here
Previously, ViT and PVT used the exact segmentation method, and sometimes the information in the boundary part could not be fully interpreted. At the same time, the local continuity of these blocks is also lost.
Improvement:
Enlarge the patch window so that the area of ​​adjacent windows overlaps by half, and fill 0 on the feature map to maintain the resolution. In this work, the authors use 0 padding convolutions to achieve overlapping patch embeddings.

Specifically, given an input of size h×w×c, input it to a convolution with stride S, convolution kernel size 2S−1, padding size S−1, and number of convolution kernels Product c. The output size is H/S x W/S x C.

Specific visible code:

1、针对PVT_V1的patch emb模块,作者将原来的卷积操作进行了修改:

# 原版本的映射层,stride = kernel
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

# PVT v2 的映射层
# stride = 2 * kernel - 1
# padding_size = kernel - 1
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride, padding=(patch_size[0] // 2, patch_size[1] // 2))

Whether the modified patch emb input is (1,3,224,224) output or (bs,channel,56,56) is consistent with the original result of convolution with a convolution kernel with a step size of 4 and a size of 4. The inconsistency lies in the fact that the encoded image combines the information of each patch and the adjacent patches up, down, left, and right . It can be seen from the lower part of the figure above. It feels similar to Swin, but the implementation is simpler.

2. Convolution forward propagation

insert image description here
Delete the fixed-size position code, and introduce 0 padding position code into pvt as shown in Figure 1 (b), and
add a 3×3 depth-wise convolution between the FC layer and GELU in the feed-forward network.
( The explanation of depth-wise convolution is very good!
Depth-wise convolution
depthwise separable convolution:
first use M 3 3 convolution kernels to convolve the input M feature maps one-to-one, without summing, and generate M results;
then Use N 1
1 convolution kernels to convolve the M results generated earlier, sum them, and
finally generate N results.
Therefore, the article divides depthwise separable convolution into two steps,
one is called depthwise convolution,
and the other is pointwise convolution.
)

class DWConv(nn.Module):
    def __init__(self, dim=768):
        self.dwconv = nn.Conv2d(dim, dim, 3, 1, 1, bias=True, groups=dim)

    def forward(self, x, H, W):
        B, N, C = x.shape
        x = x.transpose(1, 2).view(B, C, H, W)
        x = self.dwconv(x)
        x = x.flatten(2).transpose(1, 2)

        return x

class Mlp(nn.Module):
    def forward(self, x, H, W):
        x = self.fc1(x)
        x = self.dwconv(x, H, W) #这里这里
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

3.Linear Spatial Reduction Attention

Further reduce the calculation cost of PVT.
The convolution and resolution reduction operation in the SRA structure of PVT is replaced by pooling and convolution , which saves the amount of calculation.

Linear SRA uses average pooling to reduce the spatial dimension (i.e. h × w) to a fixed size (i.e. P × P) before attention operation , where P is the pooling size of linear SRA. So linear SRA requires linear computation and memory overhead like convolutional layers.
insert image description here

print(x.shape) # [1, 3136, 64]
x_ = x.permute(0, 2, 1).reshape(B, C, H, W) 
print(x_.shape) # [1, 64, 56, 56]
x_ = self.pool(x_)
print(x_.shape) # [1, 64, 7, 7]
x_ = self.sr(x_)
print(x_.shape) # [1, 64, 7, 7]
x_ = x_.reshape(B, C, -1)
print(x_.shape) # [1, 64, 49]
x_ = x_.permute(0, 2, 1)
print(x_.shape) # [1, 49, 64]

第一步把输入的x从tokens还原成二维,且完整shape为(bs,channal,H,W)
第二步经过尺寸为7的池化层
第三步经过卷积层
第四步至最后:还原成(1,H*W,dim)

full code

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1, linear=False):
        if not linear:
            if sr_ratio > 1: #这部分解读看 PVT 笔记
                self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
                self.norm = nn.LayerNorm(dim)
        else: #这就是使用线性 SRA
            self.pool = nn.AdaptiveAvgPool2d(7) #加了一层池化
            self.sr = nn.Conv2d(dim, dim, kernel_size=1, stride=1) #卷积核为 1
            self.norm = nn.LayerNorm(dim)
            self.act = nn.GELU() #激活函数
        self.apply(self._init_weights)

    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)

        if not self.linear: #这里是原版 SRA
            if self.sr_ratio > 1:
                x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
                x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
                x_ = self.norm(x_)
                kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
            else:
                kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

        else: #这里是线性 SRA
            x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
            x_ = self.sr(self.pool(x_)).reshape(B, C, -1).permute(0, 2, 1)
            x_ = self.norm(x_)
            x_ = self.act(x_)
            kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        k, v = kv[0], kv[1]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)

        return x

Notice

It should be noted that only the author of pvt_v2_b2_li used this linear SRA in PVTv2, and the best model released by the author, pvt_v2_b5, did not use this linear SRA, so the main contribution of the increase point comes from the motivation of using a larger size volume The accumulation kernel strengthens the connection between patches. It can be seen that for image tasks, the relationship between patches is still very important, and there should be a better method for patch emb!

pvt_v1 complete test pvt.py code

# 依赖库
python3 -m pip install timm
# 运行
python3 pvt.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial

from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from timm.models.registry import register_model
from timm.models.vision_transformer import _cfg

__all__ = [
    'pvt_tiny', 'pvt_small', 'pvt_medium', 'pvt_large'
]


class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x


class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1):
        super().__init__()
        assert dim % num_heads == 0, f"dim {
      
      dim} should be divided by num_heads {
      
      num_heads}."

        self.dim = dim
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.q = nn.Linear(dim, dim, bias=qkv_bias)
        self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.sr_ratio = sr_ratio
        if sr_ratio > 1:
            self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
            self.norm = nn.LayerNorm(dim)

    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)

        if self.sr_ratio > 1:
            x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
            x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
            x_ = self.norm(x_)

            kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        else:
            kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        k, v = kv[0], kv[1]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)

        return x


class Block(nn.Module):

    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, sr_ratio=1):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
            attn_drop=attn_drop, proj_drop=drop, sr_ratio=sr_ratio)
        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)

    def forward(self, x, H, W):
        x = x + self.drop_path(self.attn(self.norm1(x), H, W))
        x = x + self.drop_path(self.mlp(self.norm2(x)))

        return x


class PatchEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        img_size = to_2tuple(img_size)
        patch_size = to_2tuple(patch_size)

        self.img_size = img_size
        self.patch_size = patch_size
        # assert img_size[0] % patch_size[0] == 0 and img_size[1] % patch_size[1] == 0, \
        #     f"img_size {img_size} should be divided by patch_size {patch_size}."
        self.H, self.W = img_size[0] // patch_size[0], img_size[1] // patch_size[1]
        self.num_patches = self.H * self.W
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        B, C, H, W = x.shape

        x = self.proj(x)
        x = x.flatten(2)
        x = x.transpose(1, 2)
        x = self.norm(x)

        H, W = H // self.patch_size[0], W // self.patch_size[1]

        return x, (H, W)


class PyramidVisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dims=[64, 128, 256, 512],
                 num_heads=[1, 2, 4, 8], mlp_ratios=[4, 4, 4, 4], qkv_bias=False, qk_scale=None, drop_rate=0.,
                 attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm,
                 depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1], num_stages=4):
        super().__init__()
        self.num_classes = num_classes
        self.depths = depths
        self.num_stages = num_stages

        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]  # stochastic depth decay rule
        cur = 0

        for i in range(num_stages):
            patch_embed = PatchEmbed(img_size=img_size if i == 0 else img_size // (2 ** (i + 1)),
                                     patch_size=patch_size if i == 0 else 2,
                                     in_chans=in_chans if i == 0 else embed_dims[i - 1],
                                     embed_dim=embed_dims[i])
            num_patches = patch_embed.num_patches if i != num_stages - 1 else patch_embed.num_patches + 1
            pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dims[i]))
            pos_drop = nn.Dropout(p=drop_rate)

            block = nn.ModuleList([Block(
                dim=embed_dims[i], num_heads=num_heads[i], mlp_ratio=mlp_ratios[i], qkv_bias=qkv_bias,
                qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[cur + j],
                norm_layer=norm_layer, sr_ratio=sr_ratios[i])
                for j in range(depths[i])])
            cur += depths[i]

            setattr(self, f"patch_embed{
      
      i + 1}", patch_embed)
            setattr(self, f"pos_embed{
      
      i + 1}", pos_embed)
            setattr(self, f"pos_drop{
      
      i + 1}", pos_drop)
            setattr(self, f"block{
      
      i + 1}", block)

        self.norm = norm_layer(embed_dims[3])

        # cls_token
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims[3]))

        # classification head
        self.head = nn.Linear(embed_dims[3], num_classes) if num_classes > 0 else nn.Identity()

        # init weights
        for i in range(num_stages):
            pos_embed = getattr(self, f"pos_embed{
      
      i + 1}")
            trunc_normal_(pos_embed, std=.02)
        trunc_normal_(self.cls_token, std=.02)
        self.apply(self._init_weights)


    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    @torch.jit.ignore
    def no_weight_decay(self):
        # return {'pos_embed', 'cls_token'} # has pos_embed may be better
        return {
    
    'cls_token'}

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes, global_pool=''):
        self.num_classes = num_classes
        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def _get_pos_embed(self, pos_embed, patch_embed, H, W):
        if H * W == self.patch_embed1.num_patches:
            return pos_embed
        else:
            return F.interpolate(
                pos_embed.reshape(1, patch_embed.H, patch_embed.W, -1).permute(0, 3, 1, 2),
                size=(H, W), mode="bilinear").reshape(1, -1, H * W).permute(0, 2, 1)

    def forward_features(self, x):
        B = x.shape[0]

        for i in range(self.num_stages):
            patch_embed = getattr(self, f"patch_embed{
      
      i + 1}")
            pos_embed = getattr(self, f"pos_embed{
      
      i + 1}")
            pos_drop = getattr(self, f"pos_drop{
      
      i + 1}")
            block = getattr(self, f"block{
      
      i + 1}")
            x, (H, W) = patch_embed(x)
            """
            stage0: 
            """

            if i == self.num_stages - 1:
                cls_tokens = self.cls_token.expand(B, -1, -1)
                x = torch.cat((cls_tokens, x), dim=1)
                pos_embed_ = self._get_pos_embed(pos_embed[:, 1:], patch_embed, H, W)
                pos_embed = torch.cat((pos_embed[:, 0:1], pos_embed_), dim=1)
            else:
                pos_embed = self._get_pos_embed(pos_embed, patch_embed, H, W)

            x = pos_drop(x + pos_embed)

            for blk in block:
                x = blk(x, H, W)
            if i != self.num_stages - 1:
                x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous()

        x = self.norm(x)

        return x[:, 0]

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head(x)

        return x


def _conv_filter(state_dict, patch_size=16):
    """ convert patch embedding weight from manual patchify + linear proj to conv"""
    out_dict = {
    
    }
    for k, v in state_dict.items():
        if 'patch_embed.proj.weight' in k:
            v = v.reshape((v.shape[0], 3, patch_size, patch_size))
        out_dict[k] = v

    return out_dict


@register_model
def pvt_tiny(pretrained=False, **kwargs):
    model = PyramidVisionTransformer(
        patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4], qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[2, 2, 2, 2], sr_ratios=[8, 4, 2, 1],
        **kwargs)
    model.default_cfg = _cfg()

    return model


@register_model
def pvt_small(pretrained=False, **kwargs):
    model = PyramidVisionTransformer(
        patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4], qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1], **kwargs)
    model.default_cfg = _cfg()

    return model


@register_model
def pvt_medium(pretrained=False, **kwargs):
    model = PyramidVisionTransformer(
        patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4], qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 18, 3], sr_ratios=[8, 4, 2, 1],
        **kwargs)
    model.default_cfg = _cfg()

    return model


@register_model
def pvt_large(pretrained=False, **kwargs):
    model = PyramidVisionTransformer(
        patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4], qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 8, 27, 3], sr_ratios=[8, 4, 2, 1],
        **kwargs)
    model.default_cfg = _cfg()

    return model


@register_model
def pvt_huge_v2(pretrained=False, **kwargs):
    model = PyramidVisionTransformer(
        patch_size=4, embed_dims=[128, 256, 512, 768], num_heads=[2, 4, 8, 12], mlp_ratios=[8, 8, 4, 4], qkv_bias=True,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 10, 60, 3], sr_ratios=[8, 4, 2, 1],
        # drop_rate=0.0, drop_path_rate=0.02)
        **kwargs)
    model.default_cfg = _cfg()

    return model

if __name__ == '__main__':
    cfg = dict(
        num_classes = 2,
        pretrained=False
    )
    model = pvt_small(**cfg)
    data = torch.randn((1, 3, 224, 224))
    output = model(data)
    print(output.shape)

Guess you like

Origin blog.csdn.net/zhe470719/article/details/124807854