[Thesis Notes] Reading Notes of PVT Series Papers

PVT series paper reading notes, including PVTv1 and PVTv2.

content

1. PVTv1

I. Introduction

Second, the network structure

1、 Feature Pyramid for Transformer

2、Spatial Reduction Attention

3. Network structure

 3. Experimental results

1. Image classification

 2. Object detection and instance segmentation

 3. Semantic segmentation

2. PVTv2

I. Introduction

2. Network

1、Overlapping Patch Embedding

 2、Convolutional Feed-Forward

3、Linear Spatial Reduction Attention

4. Network structure

 3. Experimental results

1. Image classification 

2. Target detection instance segmentation


1. PVTv1

论文:Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

ViT is suitable for handling image classification tasks, but is not suitable for direct application to dense prediction tasks because of the low output resolution (number of patches) of ViT. PVT (Pyramid Vision Transformer) is cleverly designed to output high-resolution feature maps, and SRA (spatial reduction attention) is introduced to reduce the amount of computation. Similar to CNN, PVT outputs multi-resolution feature maps, which can be applied to various downstream tasks (semantic segmentation, object detection, etc.).

I. Introduction

ViT has been successfully applied to image classification tasks, but it cannot be directly used in tasks such as semantic segmentation and target detection. There are two main reasons:

(1) The resolution of the output feature map is fixed and the resolution is small;

(2) For normal-sized pictures (such as pictures with a short side length of 800 pixels), the computational cost is relatively high;

In order to solve the above limitations, the author proposes PVT (Pyramid Vision Transformer). PVT can replace the backbone of CNN to handle various downstream tasks (semantic segmentation, target detection, etc.). PVT mainly includes the following improvements:

(1) PatchEmbedding generates more patches to obtain high-resolution feature maps for downstream tasks;

(2) Similar to the CNN backbone, as the network deepens, the resolution of the feature map is reduced, and the amount of calculation is reduced;

(3) Use SRA (spatial reduction attention) to reduce the amount of computation;

Second, the network structure

1、 Feature Pyramid for Transformer

Similar to CNN, which can obtain feature maps of different scales through different convolution steps, PVT uses the steps in PatchEmbedding to reduce the resolution of feature maps.

Assuming that the dimension of the input feature map is [N, C, H, W], when the patch_size of PatchEmbedding is P, then the dimension of the output feature map is [N, C, H/P, W/P], which can reduce the feature The resolution of the graph.

The PatchEmbedding code is as follows:

class PatchEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        img_size = to_2tuple(img_size)
        patch_size = to_2tuple(patch_size)

        self.img_size = img_size
        self.patch_size = patch_size
        assert img_size[0] % patch_size[0] == 0 and img_size[1] % patch_size[1] == 0, \
            f"img_size {img_size} should be divided by patch_size {patch_size}."
        self.H, self.W = img_size[0] // patch_size[0], img_size[1] // patch_size[1]
        self.num_patches = self.H * self.W
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        B, C, H, W = x.shape

        x = self.proj(x).flatten(2).transpose(1, 2)
        x = self.norm(x)
        H, W = H // self.patch_size[0], W // self.patch_size[1]

        return x, (H, W)

2、Spatial Reduction Attention

In order to reduce the calculation amount of Attention, PVT proposes a simple and effective method, assuming that the dimension of Q\K\V is [H*W, C], in order to reduce the amount of calculation, a scaling factor R is introduced, and the dimension of K\V is first It becomes [H*W/ R^{2}, RC], and then becomes [H*W/ , C] through the fully connected layer R^{2}, thereby reducing the amount of calculation (when implementing, you can use convolution to set the convolution kernel and step size to R, Then you can directly change the dimension from [H*W, C] to [H*W/ R^{2}, C]).

 Code:

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1):
        super().__init__()
        assert dim % num_heads == 0, f"dim {dim} should be divided by num_heads {num_heads}."

        self.dim = dim
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.q = nn.Linear(dim, dim, bias=qkv_bias)
        self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.sr_ratio = sr_ratio
        if sr_ratio > 1:  # 维度衰减比例R
            self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
            self.norm = nn.LayerNorm(dim)

    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)

        if self.sr_ratio > 1:
            x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
            x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
            x_ = self.norm(x_)
            kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        else:
            kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        k, v = kv[0], kv[1]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)

        return x

3. Network structure

The following figure shows the structure of the network. PVT can obtain feature maps of different resolutions. Similar to the CNN backbone, PVT can be used as a general backbone for each downstream task.

The specific parameters of the network are as follows (R represents the spatial attenuation ratio of Attention, N represents num of heads, and E represents the dimension expansion ratio of mlp in the Transformer module):

 3. Experimental results

1. Image classification

Performance of PVT on ImageNet-1K data.

 2. Object detection and instance segmentation

Performance of PVT on the coco2017 dataset.

 3. Semantic segmentation

 Performance of PVT on the ADE20K dataset.

2. PVTv2

论文:PVTv2: Improved Baselines with Pyramid Vision Transformer

Compared with PVTv1, PVTv2 has made 3 improvements:

(1)overlap patch embedding;

(2) Forward propagation is added to the convolutional layer;

(3) Further reduce the amount of attention calculation;

I. Introduction

PVTv1 has 3 flaws:

(1) Similar to ViT, PVTv1 uses non-overlapping patch embedding;

(2) The position encoding dimension of PVTv1 is fixed, which is not friendly to input of any size;

(3) The amount of calculation is large;

In order to improve the above shortcomings, PVTv2 has made the following improvements:
(1) Overlapping Patch Embedding;

(2) Remove the position encoding and add convolution to the forward propagation;

(3)提出Linear Spatial Reduction Attention。

2. Network

By making three improvements to PVTv1, the following improvements can be obtained:
(1) Obtain the local connection between the image and the feature map;

(2) Can handle input of various resolutions;

(3) The same linear computational complexity as CNN;

1、Overlapping Patch Embedding

To achieve overlapping patch embedding through convolution, it only needs to adjust the size of the convolution kernel and the convolution step size.

 2、Convolutional Feed-Forward

The existence of positional encoding makes the network unfriendly to inputs of different sizes, remove the positional encoding, and at the same time add convolution to the forward-propagating mlp module (use depthwise separable convolution to reduce the amount of computation) to extract the positional relationship between pixels information.

code show as below:

class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0., linear=False):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.dwconv = DWConv(hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)
        self.linear = linear
        if self.linear:
            self.relu = nn.ReLU(inplace=True)
        self.apply(self._init_weights)


    def forward(self, x, H, W):
        x = self.fc1(x)
        if self.linear:
            x = self.relu(x)
        x = self.dwconv(x, H, W)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x

3、Linear Spatial Reduction Attention

In order to reduce the amount of Attention calculation, LSRA (Linear Spatial Reduction Attention) is proposed. SRA uses convolution to reduce the resolution of feature maps to reduce the amount of calculation, and LSRA reduces the amount of calculation through average pooling.

 The specific computational complexity is as follows:

4. Network structure

 The following table shows the network structure, and the meaning of the letters is shown in the following figure.

 The meaning of each letter in the above picture:

 3. Experimental results

1. Image classification 

ImageNet-1K classification accuracy.

2. Target detection instance segmentation

The coco2017 dataset performs as follows.

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123589554