BERT在多模态领域中的应用

BERT在多模态领域中的应用

Source: Link 

BERT (Bidrectional Encoder Representations from Transformers) 自提出后,凭借着 Transformer 强大的特征学习能力以及通过掩码语言模型实现的双向编码,其大幅地提高了各项 NLP 任务的基准表现。鉴于其强大的学习能力,2019 年开始逐渐被用到多模态领域。其在多模态领域的应用主要分为了两个流派:一个是单流模型,在单流模型中文本信息和视觉信息在一开始便进行了融合;另一个是双流模型,在双流模型中文本信息和视觉信息一开始先经过两个独立的编码模块,然后再通过互相的注意力机制来实现不同模态信息的融合。本文主要介绍和对比五个在图片与文本交互领域应用的 BERT 模型:VisualBert, Unicoder-VL, VL-Bert, ViLBERT, LXMERT。 

单流模型

1. VisualBERT 

论文标题:VisualBERT: A Simple and Performant Baseline for Vision and Language

论文链接:https://arxiv.org/abs/1908.03557

源码链接:https://github.com/uclanlp/visualbert

2. Unicoder-VL

论文标题:Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

论文链接:https://arxiv.org/abs/1908.06066

3. VL-BERT 

论文标题:VL-BERT: Pre-training of Generic Visual-Linguistic Representations

论文链接:https://arxiv.org/abs/1908.08530

源码链接:https://github.com/jackroos/VL-BERT

双流模型

1. ViLBERT 

论文标题:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

论文链接:https://arxiv.org/abs/1908.02265

源码链接:https://github.com/facebookresearch/vilbert-multi-task

2. LXMERT 

论文标题:LXMERT: Learning Cross-Modality Encoder Representations from Transformers

论文链接:https://arxiv.org/abs/1908.07490

源码链接:https://github.com/airsplay/lxmert

基于视频的 BERT:

1. VideoBERT

论文标题:VideoBERT: A Joint Model for Video and Language Representation Learning

论文链接:ICCV-2019

Reference

[1] VL-BERT: Pre-training of generic visual linguistic representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai 

[2] Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou 

[3] VisualBERT: A simple and performant baseline for vision and language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Change

[4] LXMERT: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal 

[5] ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee 

[6] VideoBERT: A Joint Model for Video and Language Representation Learning 

猜你喜欢

转载自www.cnblogs.com/wangxiaocvpr/p/12402975.html