Daily Academic Express 4.16

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing 

Subjects: cs.CV

1.SpectFormer: Frequency and Attention is what you need in a Vision Transformer

Title: SpectFormer: Frequency and Attention Are What You Need in Vision Transformer 

作宇:Badri N. Patro, Vinay P. Namboodiri, Vijay Srinivas Agneeswaran

Article link: https://arxiv.org/abs/2304.06446

Project code: https://badripatro.github.io/SpectFormers/

Summary:

        Visual Transformers have been successfully applied to image recognition tasks. There have been original works based on multi-head self-attention ViT, DeIT, similar to text models, or more recently based on spectral layers Fnet, GFNet, AFNO. We hypothesize that both spectral attention and multi-head attention play important roles. We investigate this hypothesis through this work and observe that indeed combining spectral and multi-head attention layers provides a better converter architecture. Therefore, we propose the novel Specformer architecture for Transformer, which combines spectral and multi-head attention layers. We believe that the resulting representation allows the transformer to properly capture feature representations, and it yields higher performance than other transformer representations. For example, it improves top-1 accuracy on ImageNet by 2% compared to GFNet-H and LiT. SpecFormer-S achieves 84.25% top-1 accuracy on ImageNet-1K (state-of-the-art on a small version). In addition, Specformer-L achieved 85.7%, which is the state-of-the-art for the base version of similar transformers. We further ensure that we obtain reasonable results in other scenarios, such as transfer learning on standard datasets such as CIFAR-10, CIFAR-100, Oxford-IIIT-flower and Stanford Car datasets. We then investigate its use in downstream tasks such as object detection and instance segmentation on the MS-COCO dataset, and observe that Specformer exhibits consistent performance comparable to the best backbones and can be further optimized and improved. Therefore, we believe a combined spectral and attention layer is what is needed for a visual translator.

2.Verbs in Action: Improving verb understanding in video-language models

Title: Verbs in Action: Improving Verb Understanding in Video Language Models

作说:Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

Article link: https://arxiv.org/abs/2304.06708

Summary:

        Understanding verbs is critical to modeling how people and objects interact and environments through space and time. Recently, state-of-the-art video language models based on CLIP have been shown to have limited understanding of verbs and rely extensively on nouns, which limits their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video language models by proposing a novel Verb-Centric Contrast (VFC) framework. This consists of two main parts: (1) utilizing a pretrained large language model (LLM) to create a hard negative for cross-modal contrastive learning, and a calibration strategy that balances the emergence of positive and negative centered concepts; (2) performing fine-grained Verb Phrase Alignment Loss. Our method achieves state-of-the-art results in zero-shot performance on three downstream tasks focused on verb understanding: video-text matching, video question answering, and video classification. To the best of our knowledge, this is the first work that proposes a way to alleviate the verb comprehension problem, without simply emphasizing it.

3.RECLIP: Resource-efficient CLIP by Training with Small Images

Title: RECLIP: Resource Efficient CLIP with Small Image Training

Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

Article link: https://arxiv.org/abs/2304.06028

Summary:

        We propose RECLIP (Resource Efficient CLIP), a simple approach to minimize the computational resource footprint of CLIP (Contrastive Language-Image Pretraining). Inspired by the coarse-to-fine concept in computer vision, we leverage small images to efficiently learn from large-scale language supervision, and finally fine-tune the model using high-resolution data. Since the complexity of the visual transformer strongly depends on the size of the input image, our method significantly reduces training resource requirements both in theory and in practice. Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-to-text retrieval accuracy with 6 to 8× fewer computational resources and 7 to 9× fewer FLOPs than the baseline. Comparable to state-of-the-art learning methods In comparison, RECLIP demonstrates 5 to 59× savings in training resources while maintaining very competitive zero-shot classification and retrieval performance. We hope this work will pave the way for the broader research community to explore language supervised pre-training in a more resource-friendly environment.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130188290