ViT Transformer paper reading notes

《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》

Paper: https://arxiv.org/abs/2010.11929

Code: https://github.com/google-research/vision_transformer

DAMO Academy’s modelscope model open source platform allows you to quickly experience ViT models:  ModelScope Magic Community

Principle: Use transformer to model the relationship between long sequences (self-attention)

Method: It is first proposed to use transformer for classification: the input image is directly divided into tokens, the position is encoded into a learnable token, an additional classification token is added, and finally head prediction is used.

Result: acc boost, sota

Guess you like

Origin blog.csdn.net/tantanweiwei/article/details/128319452