《An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale》
Paper: https://arxiv.org/abs/2010.11929
Code: https://github.com/google-research/vision_transformer
DAMO Academy’s modelscope model open source platform allows you to quickly experience ViT models: ModelScope Magic Community
Principle: Use transformer to model the relationship between long sequences (self-attention)
Method: It is first proposed to use transformer for classification: the input image is directly divided into tokens, the position is encoded into a learnable token, an additional classification token is added, and finally head prediction is used.
Result: acc boost, sota