DPT: Deformable Patch-based Transformer for Visual Recognition

DPT: Deformable Patch-based Transformer for Visual Recognition
Paper: https://arxiv.org/abs/2107.14467
Code: https://github.com/CASIA-IVA-Lab/DPT

Currently, Transformer has achieved great success in computer vision, but how to segment patches in images more effectively is still a problem. Existing methods usually divide the image into multiple fixed-size patches and then perform embedding. The author of this article points out that fixed-size patches have the following two disadvantages:

  1. Destruction of the local structure in the image, as shown in (a) below, with a fixed size (16x16) cropping patch, it is difficult to obtain the complete structure of the local target related to the target;
  2. Inconsistency of semantic information between images. The same object in different images may have different geometric changes (scale, rotation, etc.), and fixed-size cropping patches may capture inconsistent semantic information for the same object in different images. These fixed-size patches may destroy semantic information, resulting in performance degradation.
    Insert image description here

To solve this problem, the author proposes a deformable patch (DePatch) module, which adaptively segments images into patches with different positions and sizes in a data-driven manner instead of using fixed patches. Split method. Through this method, the destruction of semantic information by the original method can be avoided and the semantic information in the patch can be well preserved.

Prerequisites: Vision Transformer

Vision transformer consists of three parts:

  1. a patch embedding module
    The patch embedding module divides the image into patches of fixed size and position, and then inputs the patches into the linear layer respectively to generate a one-dimensional vector of fixed dimensions. Assume that the input picture or feature map A ∈ RH × W × C \in R^{H \times W \times C}RH × W × C , where H=W, the previous work was to divide A into N pieces of sizes × ss \times ss×s s = i n t ( H / N ) s=int(H/\sqrt N) s=int(H/N ) ), after passing through the linear layer, the vectorz ( i ) 1 < = i < = N {z^{(i)}}_{1<=i<=N}z(i)1<=i<=N.
    In order to better explain the patches segmentation process, the author re-explains the patches module. Each z ( i ) {z^{(i)}}z( i ) is regarded as a rectangular area of ​​the input image, and the point coordinates of its center are(xct (i), yct (i)) (x^{(i)}_{ct},y^{(i)} _{ct})(xct(i),yct(i)) , because the patch size is fixed to s, the coordinates of its upper left corner are:( xct ( i ) − s / 2 , yct ( i ) − s / 2 ) (x^{(i)}_{ct}-s/ 2,y^{(i)}_{ct}-s/2)(xct(i)s/2,yct(i)s / 2 ) , the coordinates of the lower right corner are:( xct ( i ) + s / 2 , yct ( i ) + s / 2 ) (x^{(i)}_{ct}+s/2,y^{( i)}_{ct}+s/2)(xct(i)+s/2,yct(i)+s / 2 ) , each point in the small patch can be expressed asa ^ 1 < = j < = s × s ( i , j ) {\hat{a}^{(i,j)}_{1<= j<=s\times s}}a^1<=j<=s×s(i,j), and then perform linear processing on the patch according to formula (1).
    Insert image description here

  2. multi-head self-attention blocks

  3. feed-forward multi-layer perceptrons (MLP)

DePatch Module

In order to obtain adaptive patches, the author adds two learnable parameters location and scale to the network, where location uses two offsets (δ x, δ y) (\delta_x,\delta_y)( dx,dy) represents the offset of the initial center, scale predicts the height of the patchsh s_hsh, width sw s_wsw, based on the four parameters predicted above, we can get the coordinates of the upper left corner and lower right corner after the initial patch adjustment:
Insert image description here
As shown in the figure below, the author uses adaptive patches to crop the area. You can see the local area of ​​the target object. Can be completely divided. The patches we get based on the above method have different sizes, and most of the coordinates are floating point numbers, and the network requires fixed dimensions, so the author randomly selects k × kk\times k
Insert image description here
in the patch.k×k points to represent the area:
Insert image description here
use bilinear interpolation to obtain the eigenvalues ​​under floating point coordinates [Why bilinear interpolation is formula (8), it can be derived from the definition of bilinear interpolation]:
Insert image description here

Overall Architecture

Insert image description here
The author implemented DPT based on PVT. PVT has four stages, so there are four different scale features (left in the figure above). The author replaced the patch embedding modules of stage2, stage3, and stage4 in PVT with DePatch, and other settings remained unchanged (right in the picture above).

Experiment

Image Classification

Insert image description here
As shown in the table above, the smallest DPT-Tiny achieved a top-1 accuracy of 77.4%, which is 2.3% higher than the corresponding baseline PVT model. DPT-Medium achieves a top-1 accuracy of 81.9%, even better than models with more parameters and computational effort, such as PVT-Large and DeiT-Base.

Object Detection

Insert image description here
The table above compares the results of DPT with PVT and ResNe(X)t. Under similar calculation amount, the performance of DPTSmall is 2.1% mAP better than PVT-Small and 6.2% mAP better than Resnet50.
Insert image description here
The results on Mask-RCNN are similar. The DPT-Small model achieved 43.1% box mAP and 39.9% mask mAP under 1×schedule, which were 2.7% and 2.1% higher than PVT-Small.

Insert image description here
On DETR, DPT-Small achieves 37.7% box mAP, which is 3.0% higher than PVT-Small and 5.4% higher than ResNet50.

reference:

https://mp.weixin.qq.com/s/v5jCTvoRo-OjhWfzL-LP9w

Guess you like

Origin blog.csdn.net/qq_44846512/article/details/119735892