MobileViT
The picture comes from Director B.
In each patch, the tokens at the corresponding positions are taken out to form a sequence [quite amazing]. It feels like as long as the B C H*W tensors
in [B, C, H, W] remain unchanged, these dimensions can be changed at will. unfold code:
def my_unfold(x):
# [B,C,H,W] -> [B,C,n_h,p_h,n_w,p_w]
x = x.reshape(batch_size , in_channels, num_patch_h, patch_h, num_patch_w, patch_w)
#[B,C,n_h,p_h,n_w,p_w]->[B,C,n_h,n_w,p_h,p_w]
x = x.transpose(3, 4)
#[B,C,n_h,n_w,p_h,p_w]->[B,C,n_h*n_w,p_h*p_w]即[B,C,N,P]
x = x.reshape(batch_size, in_channels, num_patches, patch_area)
#[B,C,N,P]->[B,P,N,C]
x = x.transpose(1,3)
#[B,P,N,C]->[BP,N,C] BP是所有batch里patch总数
x = x.reshape(batch_size*patch_area, num_patches, -1)
return x