review
- Multimodal Machine Learning: A Survey and Taxonomy
paper URL: https://arxiv.org/pdf/1705.09406.pdf
Chinese translation: Multimodal Machine Learning: A Survey and Taxonomy (multimodal review) - Multimodal Learning with Transformers: A Survey
paper URL: https://arxiv.org/pdf/2206.06488.pdf
Chinese translation: 300+ documents! A detailed explanation of the latest developments in Transformer-based multimodal learning (the content is incomplete, it is recommended to read the original text) - A Survey of Open Dialogue Technology Research Summary
: A Survey of Open Dialogue Systems Research - A Survey of Research Progress in Natural Language Generation in Task-Based Dialogue Systems
tutorial
- Vision-Language Pretraining: Current Trends and the Future
网址:https://vlp-tutorial-acl2022.github.io/
Model
- Transformer
paper URL: Attention Is All You Need - BERT
paper website: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Li Mu video explanation: BERT paper intensive reading paragraph by paragraph [paper intensive reading] - ViLT
paper URL: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Source code URL: https://github.com/dandelin/vilt
bryanyzhu video explanation: ViLT paper intensive reading [paper intensive reading]
personal notes: [paper & model explanation 】ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision - VL-BEiT
paper URL: VL-BEiT: Generative Vision-Language Pretraining
Related papers: BEiT: BERT Pre-Training of Image Transformers - CLIP
paper website: Learning Transferable Visual Models From Natural Language Supervision
source code website: https://github.com/OpenAI/CLIP
bryanyzhu video explanation: CLIP paper intensive reading paragraph by paragraph [paper intensive reading]
personal notes: [paper & model explanation] CLIP ( Learning Transferable Visual Models From Natural Language Supervision) - VideoBERT
paper URL: VideoBERT: A Joint Model for Video and Language Representation Learning
Source URL: https://github.com/ammesatyajit/VideoBERT
Personal Notes: [Paper & Model Explanation] VideoBERT: A Joint Model for Video and Language Representation Learning - Two-Stream Convolutional Networks for Action Recognition in Videos
Paper URL: https://arxiv.org/abs/1406.2199
Personal Notes: [Paper & Model Explanation] Two-Stream Convolutional Networks for Action Recognition in Videos