YOLOV8 improvement: CVPR2023: Join the EfficientViT backbone: memory efficient ViT with cascade group attention

   1. This article belongs to the YOLOV5/YOLOV7/YOLOV8 improvement column, which contains a large number of improvement methods, mainly in the latest article in 2023 and the article in 2022.
2. Provide more detailed improvement methods, such as adding the attention mechanism to different positions of the network, which is convenient for experiments and can also be used as an innovation point of the paper.
2. Point increase effect: Add EfficientViT to effectively increase points.

Paper address

Vision Transformer has achieved great success due to its high model capability. However, their superior performance comes with heavy computational cost, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing Transformer models is often limited by memory-inefficient operations, especially tensor reshaping and unit functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., MHSA using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Furthermore, we find that attention maps have high similarity between heads, leading to computational redundancy. To address this issue, we propose a cascaded group attention module to feed attention heads with different complete feature segmentations, which not only saves computational cost but also improves attention diversity. Comprehensive experiments show that efficient vit outperforms existing efficient models, achieving a good balance between speed and accuracy. For example, our EfficientViT-M5 outperforms MobileNetV3-Large by 1.9% in accuracy, while achieving 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively.

おすすめ

転載: blog.csdn.net/m0_51530640/article/details/131783759