MobileViT code implementation details and highlights

MobileViT

The picture comes from Director B.
Insert image description here
Insert image description here
In each patch, the tokens at the corresponding positions are taken out to form a sequence [quite amazing]. It feels like as long as the B C H*W tensors
in [B, C, H, W] remain unchanged, these dimensions can be changed at will. unfold code:

    def my_unfold(x): 
        # [B,C,H,W] -> [B,C,n_h,p_h,n_w,p_w]
        x = x.reshape(batch_size , in_channels, num_patch_h, patch_h, num_patch_w, patch_w)
        #[B,C,n_h,p_h,n_w,p_w]->[B,C,n_h,n_w,p_h,p_w]
        x = x.transpose(3, 4)
        #[B,C,n_h,n_w,p_h,p_w]->[B,C,n_h*n_w,p_h*p_w]即[B,C,N,P]
        x = x.reshape(batch_size, in_channels, num_patches, patch_area)
        #[B,C,N,P]->[B,P,N,C]  
        x = x.transpose(1,3)
        #[B,P,N,C]->[BP,N,C]   BP是所有batch里patch总数
        x = x.reshape(batch_size*patch_area, num_patches, -1)

        return x

Guess you like

Origin blog.csdn.net/weixin_44040169/article/details/127943022