SuperViT:Super Vision Transformer

insert image description hereThis article is mainly aimed atReduce the computational consumption of Vision Transformer, a new method is proposed. In ViT, we know that the number of Transformer tokens is inversely proportional to the patch size, which means that the smaller the patch size, the higher the computational cost of the model, and the larger the patch, the greater the loss of the model effect. This is contrary to our purpose. The author of SuperViT improves the performance from two aspects: 多尺度的patch分割and 多种保留率. Minimize the amount of calculation to speed up calculation and maintain better model performance. There is basically no problem with this method in image classification, but in the field of super-resolution, the discarding of pixels will still seriously affect the performance of the model.

Original link: Super Vision Transformer

Guess you like

Origin blog.csdn.net/qq_45122568/article/details/125480313