综述
- Multimodal Machine Learning: A Survey and Taxonomy
论文网址:https://arxiv.org/pdf/1705.09406.pdf
中文翻译:Multimodal Machine Learning:A Survey and Taxonomy(多模态综述) - Multimodal Learning with Transformers: A Survey
论文网址:https://arxiv.org/pdf/2206.06488.pdf
中文翻译:300+篇文献!一文详解基于Transformer的多模态学习最新进展(内容不全,建议看原文) - 开放型对话技术研究综述
总结:开放型对话系统研究综述 - 任务型对话系统中的自然语言生成研究进展综述
tutorial
- Vision-Language Pretraining: Current Trends and the Future
网址:https://vlp-tutorial-acl2022.github.io/
模型
- Transformer
论文网址:Attention Is All You Need - BERT
论文网址:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
李沐视频讲解:BERT 论文逐段精读【论文精读】 - ViLT
论文网址:ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
源码网址:https://github.com/dandelin/vilt
bryanyzhu视频讲解:ViLT 论文精读【论文精读】
个人笔记:【论文&模型讲解】ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision - VL-BEiT
论文网址:VL-BEiT: Generative Vision-Language Pretraining
相关论文:BEiT: BERT Pre-Training of Image Transformers - CLIP
论文网址:Learning Transferable Visual Models From Natural Language Supervision
源码网址:https://github.com/OpenAI/CLIP
bryanyzhu视频讲解:CLIP 论文逐段精读【论文精读】
个人笔记:【论文&模型讲解】CLIP(Learning Transferable Visual Models From Natural Language Supervision) - VideoBERT
论文网址:VideoBERT: A Joint Model for Video and Language Representation Learning
源码网址:https://github.com/ammesatyajit/VideoBERT
个人笔记:【论文&模型讲解】VideoBERT: A Joint Model for Video and Language Representation Learning - Two-Stream Convolutional Networks for Action Recognition in Videos
论文网址:https://arxiv.org/abs/1406.2199
个人笔记:【论文&模型讲解】Two-Stream Convolutional Networks for Action Recognition in Videos