论文被引：6(04/22/20)
论文年份：2019
论文原文：点击此处

文章目录

Deep Multimodal Representation Learning: A Survey
ABSTRACT
I. INTRODUCTION
II. DEEP MULTIMODAL REPRESENTATION LEARNING FRAMEWORKS

A. MODALITY-SPECIFIC REPRESENTATIONS
B. JOINT REPRESENTATION
C. COORDINATED REPRESENTATION
D. ENCODER-DECODER

Deep Multimodal Representation Learning: A Survey

深度多模态表征学习研究综述

ABSTRACT

Multimodal representation learning, which aims to narrow the heterogeneity gap among different modalities, plays an indispensable role in the utilization of ubiquitous multimodal data. Due to the powerful representation ability with multiple levels of abstraction, deep learning-based multimodal representation learning has attracted much attention in recent years. In this paper, we provided a comprehensive survey on deep multimodal representation learning which has never been concentrated entirely. To facilitate the discussion on how the heterogeneity gap is narrowed, according to the underlying structures in which different modalities are integrated, we category deep multimodal representation learning methods into three frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we review some typical models in this area ranging from conventional models to newly developed technologies. This paper highlights on the key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of our knowledge, have never been reviewed previously, even though they have become the major focuses of much contemporary research. For each framework or model, we discuss its basic structure, learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and experienced researchers can benefit from this survey. Finally, we suggest some important directions for future work.

INDEX TERMS Multimodal representation learning, multimodal deep learning, deep multimodal fusion, multimodal translation, multimodal adversarial learning.

多模态表示学习旨在缩小不同模式之间的异质性差异，在利用普遍存在的多模态数据方面发挥着不可或缺的作用。基于深度学习的多模态表示学习由于具有强大的多层次抽象表示能力，近年来受到广泛关注。本文对目前尚未完全集中的深度多模态表征学习进行了综述。为了便于讨论如何缩小异质性差距，根据不同模式整合的底层结构，我们将深度多模态表示学习方法分为三个框架：联合表示、协调表示和编码器-解码器。此外，我们还回顾了这方面的一些典型模型，从传统模型到新开发的技术。本文从多模态表征学习的角度，重点讨论了新发展的编码-解码模型、生成式对抗网络、注意机制等关键技术，据我们所知，这些技术以前从未被回顾过，尽管它们已经成为当代许多研究的主要焦点。对于每一个框架或模型，我们都会讨论其基本结构、学习目标、应用场景、关键问题、优缺点，以便新的和有经验的研究人员都能从中受益。最后，提出了今后工作的一些重要方向。

关键词：多模态表征学习、多模态深度学习、多模态深度融合、多模态翻译、多模态对抗学习。

I. INTRODUCTION

To convey the comprehensive information about objects in the world, various cognitive signals describing different aspects of the same object are recorded in different kinds of media such as text, image, video, sound, and graph. In the representation learning area, the word ‘‘modality’’ refers to a particular way or mechanism of encoding information. Hence, different types of media listed above also refer to modalities, and the representation learning tasks involving several modalities will be characterized as multimodal.

为了传达世界上关于物体的综合信息，人们在文本、图像、视频、声音、图形等不同媒体上记录了描述同一物体不同方面的各种认知信号。在表征学习领域，“情态”一词是指编码信息的特定方式或机制。因此，以上所列的不同类型的媒体也涉及模式，涉及多种模式的表征学习任务将被描述为多模态。

Since multimodal data depict an object from different viewpoints, usually complementary or supplementary in contents, they are more informative than unimodal data. For example, early research on speech recognition showed that the visual modality provides information on lip motions and articulations of the mouth including open and close, thus can help to improve the speech recognition performance. Therefore, it is valuable to exploit the comprehensive semantics provided by several modalities.

由于多模态数据从不同的角度描述对象，通常是内容上的补充或补充，因此它们比单模态数据更具信息性。例如，早期的语音识别研究表明，视觉模态提供了嘴唇运动和口的张开和闭合等发音信息，有助于提高语音识别性能。因此，利用多种模式提供的综合语义是有价值的。

However, although it is easy for human beings to perceive the world through comprehensive information from multiple sensory organs [3], how to endow machines with analogous cognitive capabilities is still an open question. One of the challenges we are confronted with is the heterogeneity gap in multimodal data. As Fig. 1 shows, since the feature vectors from different modalities originally located in unequal subspaces, the vector representations associated with similar semantics would be completely different. Here, this phenomenon is referred to as heterogeneity gap, which would hinder the multimodal data from being comprehensively utilized by the subsequent machine learning modules [4]. A popular method for addressing this problem is projecting the heterogeneous features into a common subspace, where the multimodal data with similar semantics will be represented by similar vectors [5]. Thus, the primary objective of multimodal representation learning is narrowing the distribution gap in a joint semantic subspace while keeping modality specific semantics intact.

然而，尽管人类很容易通过来自多个感官的综合信息来感知世界[3]，如何赋予机器类似的认知能力仍然是一个悬而未决的问题。我们面临的挑战之一是多模态数据的异质性缺口。如图1所示，由于来自不同模式的特征向量最初位于不等子空间中，因此与类似语义相关联的向量表示将完全不同。在这里，这种现象被称为异质差距，这将阻碍多模态数据被随后的机器学习模块综合利用[4]。解决这一问题的一种常用方法是将异构特征投影到一个公共子空间中，其中具有相似语义的多模态数据将由相似向量表示[5]。因此，多模态表示学习的主要目标是缩小联合语义子空间中的分布差距，同时保持特定模态语义的完整性。

在这里插入图片描述
FIGURE 1. Schematic of the common subspace learning (adapted from [5]), which aims to project the heterogeneous data of different modalities into a common subspace, where the multimodal data with similar semantics will be represented by similar vectors.（图1。公共子空间学习（改编自[5]）的原理图，其目的是将不同模式的异构数据投影到公共子空间中，其中具有相似语义的多模态数据将由相似向量表示。）

To narrow the heterogeneity gap, numerous researches with various approaches have been conducted in the past decades. As a result, the advancement of multimodal representation learning has benefited plenty of applications. For example, by the utilization of fused features from multimodalities, improved performance can be achieved in cross-media analysis tasks, such as video classification [6], event detection [7], [8], and sentiment analysis [9], [10]. Further, via the exploitation of cross-modal similarity or cross-modalcorrelation,itbecomespossibleforustoretrieve images using a sentence as input or vice versa, which is a task known as cross-modal retrieval [11]. Most recently, a novel type of multimodal application, cross-modal translation [12], has drawn great attention in the computer vision community. As the name suggests, it strives to translate one modality into another. Exemplary applications within this category include image caption [13], video description [14], and text-to-image synthesis [15].

为了缩小异质性差距，近几十年来，国内外学者进行了大量的研究。因此，多模态表示学习的发展为多模态表示学习的应用提供了有益的帮助。例如，通过利用来自多模态的融合特征，可以在跨媒体分析任务中获得改进的性能，例如视频分类[6]、事件检测[7]、[8]和情感分析[9]、[10]。此外，通过利用跨模态相似度或跨模态相关度，可以使用句子作为输入来检索图像，反之亦然，这是一项称为跨模态检索的任务[11]。最近，一种新型的多模态应用，跨模态翻译，引起了计算机视觉界的广泛关注。顾名思义，它努力将一种情态转换成另一种情态。这一类中的示例性应用包括图像标题[13]、视频描述[14]和文本到图像合成[15]。

In recent years, due to the powerful representation ability with multiple levels of abstraction, deep learning has demonstrated outstanding results in various applications involving computer vision, natural language processing, and speech recognition [16]. Additionally, another key advantage of deep learning is that a hierarchical representation can be learned directly using a general-purpose learning procedure, without requiring a design or selection process of handcrafted features. Motivated by this success, deep multimodal representation learning, which is a natural extension of its unimodal version, has recently attracted tremendous research attention.

近年来，由于强大的表示能力和抽象的多个层次，深度学习已在涉及计算机视觉，自然语言处理和语音识别的各种应用中显示了出色的成果[16]。此外，深度学习的另一个关键优势是可以使用通用学习过程直接学习层次表示，而无需手工设计或选择功能。受此成功的推动，深度多模式表示学习是单模式版本的自然扩展，最近引起了极大的研究关注。

The goal of this article is to provide a comprehensive survey on deep multimodal representation learning and suggest the future direction in this active field.Generally,themachine learning tasks based on multimodal data include three necessary steps: modality-specific features extracting, multimodal representation learning which aims to integrate diverse features from different modalities in a common subspace, and a reasoning step such as classification or clustering. This paper mainlyfocusesonthesecondstep,multimodalrepresentation learningindeeplearningscenarios,andwillalsomakeabrief reference to the other two steps but not go into the details.

本文的目的是对深度多模式表示学习进行全面的调查，并为该活跃领域提出未来的方向。通常，基于多模式数据的机器学习任务包括三个必要步骤：特定于模式的特征提取，多模式表示学习旨在将来自不同模态的各种功能集成到一个公共子空间中，并进行推理步骤，例如分类或聚类。本文主要着眼于第二步，多模式表示学习和深度学习场景，还将简要介绍其他两个步骤，但不赘述。

The focus of this paper is the key issues on how to narrow the heterogeneity gap while keeping modality specific semantics intact in different multimodal application scenes. Tofacilitatethediscussion,accordingtotheunderlyingstructures in which different modalities are integrated, shown as Fig. 2, we category these methods into three types of frameworks: joint representation, coordinated representation, and encoder-decoder. Each framework has its distinct architecture and approach of integrating multimodal features. Additionally, we review some typical models including probabilistic graphical models (PGM), multimodal autoencoders, deep canonical correlation analysis (DCCA), generative adversarial networks (GAN), and attention mechanism, which have either proven to be effective or shown promising results.

本文的重点是如何缩小异构性差距，同时在不同的多模态应用场景中保持模态特定的语义完整的关键问题。为了便于讨论，根据集成了不同模式的底层结构，如图2所示，我们将这些方法分为三种类型的框架：联合表示，协调表示和编码器-解码器。每个框架都有其独特的体系结构和集成多模式功能的方法。此外，我们回顾了一些典型模型，包括概率图形模型（PGM），多模式自动编码器，深度典范相关分析（DCCA），生成对抗网络（GAN）和注意力机制，这些模型已被证明是有效的或有希望的结果。

The connection between the typical models and the three frameworks can be seen in Table1. Each of the typical models described here can be categorized into one or more of the frameworks or can be integrated with them. For each type of framework or model, we will discuss its basic structure, learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and experienced researchers will benefit from this survey. The key issues relevant to different frameworks and models will be marked in bold and summarized in Section IV (Table 3).

表1列出了典型模型与三个框架之间的联系。此处描述的每个典型模型都可以分类为一个或多个框架，或者可以与它们集成。对于每种类型的框架或模型，我们将讨论其基本结构，学习目标，应用场景，关键问题，优点和缺点，以便新颖和经验丰富的研究人员都将从该调查中受益。与不同框架和模型相关的关键问题将以粗体显示，并在第四部分（表3）中进行总结。

Most recently, several surveys [17]–[20] related to the topic of multimodal learning have been published. Comparing to previous reviews, the focus of our paper is distinctive in that we seek to survey the literature from a cross-perspective of multimodal representation learning and deep learning, which has never been concentrated fully. For example, the review proposed by Zhao et al. [17] mainly focuses on conventional methods. The work proposed by Baltrušaitis et al. [18] focuses on the challenges of multimodal machine learning, as one of the five challenges they defined, representation learning is only a small part of their concern. From the perspective of multimodal representation learning, the closest work to ours is that of Li et al. [19] which concentrates on multi-view representation learning, including shallow and deep methods, while, by contrast, ours highlight the latter which have gained more attention in recent years. From the perspective of multimodal deep learning, the closest effort to ours is [20] which mainly reviews the models and applications relying on multimodal features fusion (categorized as joint representation in ours). Comparing to [20], in this paper, more types of integration frameworks and models will be discussed.

最近，已经发表了与多模式学习主题相关的几项调查[17] – [20]。与以前的评论相比，本文的重点是与众不同的，因为我们试图从多模态表示学习和深度学习的交叉视角来调查文献，而这些观点从来没有完全集中。例如，Zhao等人提出的评论。 [17]主要关注常规方法。 Baltrušaitis等人提出的工作。 [18]关注多模式机器学习的挑战，作为他们定义的五个挑战之一，表示学习只是他们关注的一小部分。从多模式表示学习的角度来看，与我们最接近的工作是李等人的工作。 [19]专注于多视图表示学习，包括浅层和深层方法，而与此相反，我们的方法则强调后者，近年来受到了越来越多的关注。从多模式深度学习的角度来看，与我们最接近的工作是[20]，它主要回顾了依赖于多模式特征融合（在我们的分类为联合表示）的模型和应用程序。与[20]相比，本文将讨论更多类型的集成框架和模型。

在这里插入图片描述

In contrast to previous surveys, another difference of ours is that we highlight on the key issues of newly developed technologies such as encoder-decoder model, generative adversarial networks (GAN), and attention mechanism in a multimodal representation learning perspective, which, to our best knowledge, have never been reviewed previously, even though they have become the major focuses of much contemporary research. For example, previously, the encoder-decoder models were mainly introduced as one of the implementation ways used for cross-modal translation task, while, for the first time, they are discussed further from the representation learning perspective in this paper.

与以前的调查相比，我们的另一个不同之处在于，我们从多模式表示学习的角度强调了新开发技术的关键问题，例如编码器-解码器模型，生成对抗网络（GAN）和注意力机制。最好的知识，尽管它们已经成为许多当代研究的主要焦点，但以前从未进行过评论。例如，以前，编码器-解码器模型主要是作为跨模态翻译任务的一种实现方式引入的，而本文则首次从表示学习的角度对其进行了讨论。

The rest of this paper is organized as follows: In Section II, we discuss the key issues on how to narrow the heterogeneity gap in three types of frameworks. In Section III, we review the typical models listed in Table 1. In Section IV, we finish with a conclusion and suggest some future directions in this active field.

本文的其余部分安排如下：在第二部分中，我们讨论了如何缩小三种框架中的异质性差距的关键问题。在第三节中，我们回顾了表1中列出的典型模型。在第四节中，我们得出了结论，并提出了该活跃领域的一些未来方向。

II. DEEP MULTIMODAL REPRESENTATION LEARNING FRAMEWORKS

深度多模态表示学习框架

To facilitate the discussion on how to narrow the heterogeneity gap and inspired by the definitions in [18], according to the underlying structures illustrated in Fig. 2, we category deep multimodal representation methods into three types of frameworks: (i) joint representation, which aims to project unimodal representations together into a shared semantic subspace such that the multimodal features can be fused; (ii) coordinated representation including cross-modal similarity models and canonical correlation analysis, which seeks to learn separated but constrained representations for each modality in a coordinated subspace; (iii) encoder-decoder models, which endeavors to learn an intermediate representation used for mapping one modality into another. Each framework has its way of integrating several modalities and is shared by some applications. To obtain a general impression of their possible applications, in table2, a summary of typical applications and the relevant modalities involved in these frameworks has been given.

为了便于讨论如何缩小异质性差距，并受[18]中定义的启发，根据图2所示的底层结构，我们将深度多模态表示方法分为三类框架：（i）联合表示，其目的是将单一模态表示一起投影到共享的语义子空间，使多模态特征能够融合；（ii）协调表示，包括跨模态相似模型和典型相关分析，其寻求学习协调子空间中每个模态的分离但受约束的表示；（iii）编码器-解码器模型，它努力学习一种用于将一种情态映射到另一种情态的中间表示。每个框架都有其集成多种模式的方式，并由一些应用程序共享。为了对其可能的应用有一个总体印象，在表2中，对这些框架中涉及的典型应用和相关模式进行了总结。
在这里插入图片描述

As Fig.2shows, before multimodal representation learning can be applied, modality-specificfeaturesshouldbeextracted via appropriate methods. Thus, in this section, we will first introduce unimodal representation methods which may significantly impact the performance, and then start our discussion on three types of frameworks.

如图2所示，在应用多模态表示学习之前，应通过适当的方法提取模态特性。因此，在本节中，我们将首先介绍可能会显著影响性能的单模态表示方法，然后开始讨论三种类型的框架。

A. MODALITY-SPECIFIC REPRESENTATIONS

模态特定表示

Although a variety of different multimodal representation learning models may share similar architectures, the essential components used for extracting modality-specific features could be quite different from each other. Here, we will introduce some of the most popular components appropriate for different modalities, without going into technical details.

尽管各种不同的多模态表示学习模型可能共享相似的体系结构，但用于提取特定于模态特征的基本组件可能彼此大不相同。在这里，我们将介绍一些最流行的组件，适合不同的模式，而不涉及技术细节。

The most popular models used for image feature learning are convolutional neural networks (CNN) such as LeNet [45], AlexNet [46], GoogleNet [47], VGGNet [48], and ResNet [49]. They can be integrated into multimodal learning models and trained together with other components. However, considering the requirement for sufficient training data and computation resources, the pre-trained version of CNN may be a better choice for multimodal representation learning.

用于图像特征学习的最流行模型是卷积神经网络（CNN），如LeNet[45]、AlexNet[46]、GoogleNet[47]、VGGNet[48]和ResNet[49]。它们可以集成到多模式学习模型中，并与其他组件一起进行训练。然而，考虑到需要足够的训练数据和计算资源，CNN的预训练版本可能是多模态表示学习的较好选择。

在这里插入图片描述
FIGURE 2. Three types of frameworks about deep multimodal representation. (a) Joint representation aims to learn a shared semantic subspace. (b) Coordinated representation framework learns separated but coordinated representations for each modality under some constraints. ( c ) Encoder-decoder framework translates one modality into another and keep their semantics consistent.

The fundamental works for neural language processing involve representing words and encoding sentences. A popular way to represent words is word embedding such as word2vec [50] or Glove [51] which maps words into a distributional vector space, where the similarity between words can be measured. In NLP tasks, a common issue that should be considered is the unknown word problem, also known as out-of-vocabulary (OOV) words, that can potentially affect the performance of many systems. To deal with unknown word issue, character embeddings [52], [53] is a viable option for representing language inputs. For example, Kim et al. [52] trained a convolution neural network to yield word representations based on character-level embeddings. Bojanowski et al. [53] proposed to learn the vector representations of character n-grams, then, by treating each word as a bag of character n-grams, the embedding of a word can be obtained by the sum of these vector representations. Experiments [54], [55] showed that handling OOV issue properly would improve the performance of NLP systems considerably.

神经语言处理的基础工作包括表示单词和编码句子。代表单词的流行方法是单词嵌入，例如word2vec [50]或Glove [51]，它将单词映射到分布矢量空间，可以在其中测量单词之间的相似性。在NLP任务中，应考虑的一个常见问题是未知单词问题，也称为词汇不足（OOV）单词，它可能会影响许多系统的性能。为了处理未知的单词问题，字符嵌入[52]，[53]是表示语言输入的可行选择。例如，Kim等。 [52]训练了卷积神经网络，以基于字符级嵌入产生单词表示。 Bojanowski等。 [53]提出学习字符n-gram的向量表示，然后，通过将每个单词视为字符n-gram的袋子，可以通过这些向量表示的总和获得单词的嵌入。实验[54]，[55]表明，正确处理OOV问题将大大改善NLP系统的性能。

Recurrent neural networks (RNN) [56] is a powerful tool for dealing with varying length sequences such as sentences, videos, and audios. Since the activation of the current hidden state at time t depends on that of all the previous time steps, it can be seen as a summarization of the sequence up to step t. However, vanilla RNNs is difficult to capture long-term dependencies because of the gradient vanishing problem [57]. In practice, a better choice is long short-term memory (LSTM) [58], [59] networks or gated recurrent unit (GRU) [60] networks, which has a better performance in capturing long-term dependencies [61], [62]. Further,bidirectionalrecurrentneuralnetworks(BRNN) [63] and the bidirectional edition of LSTM [64] or GRU [65] are also widely used for capturing the semantics. In addition to RNN, CNN is another widely used model for extracting salient n-gram features from sentences. Experiments showed that CNN based models perform remarkably well in sentence-level classification [66] and sentiment analysis tasks [67].

递归神经网络（RNN）[56]是处理各种长度序列（例如句子，视频和音频）的强大工具。由于在时间t的当前隐藏状态的激活取决于所有先前时间步的激活，因此可以将其视为直至步骤t的序列的汇总。但是，由于梯度消失问题，香草RNN很难捕获长期依赖关系[57]。实际上，更好的选择是长短期记忆（LSTM）[58]，[59]网络或门控循环单元（GRU）[60]网络，它们在捕获长期依存关系方面具有更好的性能[61]， [62]。此外，双向递归神经网络（BRNN）[63]和LSTM [64]或GRU [65]的双向版本也被广泛用于捕获语义。除了RNN，CNN是从句子中提取显着n-gram特征的另一种广泛使用的模型。实验表明，基于CNN的模型在句子级分类[66]和情感分析任务[67]中表现出色。

As to video modality, since the input of each time step is an image, its feature can be extracted via the techniques used for handling images. In addition to deep features, handcrafted features are still widely used in video and audio modalities [10], [68]. Further, some toolkits have been developed to extract handcrafted features. For example, OpenFace [69] can be used to extract facial features such as facial landmark, head pose, and eye gaze. Another tool is Opensmile [70]whichcanbeusedtoextractacousticfeatures includingMel-frequencycepstralcoefficients(MFCC),voice intensity, pitch, and their statistics. After the frames of videos and audios have been encoded, CNN or RNN networks aforementioned can be used to summarize the sequences into individual vector representations.

对于视频模态，由于每个时间步的输入都是图像，因此可以通过用于处理图像的技术来提取其特征。除深层功能外，手工功能仍广泛用于视频和音频形式[10]，[68]。此外，已经开发了一些工具包来提取手工特征。例如，OpenFace [69]可用于提取面部特征，例如面部标志，头部姿势和视线。另一个工具是Opensmile [70]，可用于提取声学特征，包括梅尔频率倒谱系数（MFCC），语音强度，音调及其统计数据。在视频和音频的帧已编码后，可以使用上述的CNN或RNN网络将序列汇总为单独的矢量表示形式。

B. JOINT REPRESENTATION

The strategy of integrating different types of features to improve the performance of machine learning methods has long been used by researches. A natural extension of this strategy in a multimodal setting is the utilization of fused heterogeneous features. Following this strategy, promising results have been shown in many multimodal classificationor clustering tasks, such as video classification [6], [21], event detection [7], [8], sentiment analysis [9], [10], and visual question answering [23].

集成不同类型特征以提高机器学习方法性能的策略一直被研究者所采用。这种策略在多模式环境中的自然扩展是利用融合的异构特征。根据该策略，在视频分类[6]、[21]、事件检测[7]、[8]、情感分析[9]、[10]和视觉问答[23]等多模式分类或聚类任务中都显示出了良好的结果。

To bridge the heterogeneity gap of different modalities, joint representation aims to project unimodal representations into a shared semantic subspace, where the multimodal features can be fused [18]. As Fig. 2(a) showed, after each modality is encoded via an individual neural network, both of them will be mapped into a shared subspace, where the conceptions shared by modalities will be extracted and fused into a single vector.

为了弥补不同模式之间的异质性差异，联合表示的目的是将单模式表示投影到一个共享语义子空间中，在该子空间中可以融合多模式特征[18]。如图2（a）所示，在通过单个神经网络对每个模态进行编码之后，它们将被映射到共享子空间，在该子空间中，模态共享的概念将被提取并融合到单个向量中。

The simplest way for fusing multimodal features is to concatenate them directly. However, mostly this subspace is implemented by a distinct hidden layer, in which the transformed modality specific vectors will be added, and thus the semantics from different modalities will be combined. This property can be seen from (1), where z is the activation of output nodes in the shared layer, v is the output of modality-specific encoding network, w is the weights connecting between modality specific encoding layer to the shared layer and the subscript index denotes different modalities.

融合多模态特征的最简单方法是直接将它们连接起来。然而，该子空间大多是通过一个不同的隐藏层来实现的，在这个隐藏层中，将添加转换后的模态特定向量，从而组合来自不同模态的语义。从（1）可以看出，其中z是共享层中输出节点的激活，v是特定于模态的编码网络的输出，w是特定于模态的编码层与共享层之间的连接权重，下标索引表示不同的模态。

在这里插入图片描述

Other than the fusion process in a distinct hidden layer, usually called as an additive approach, a multiplicative method is also adopted in some literature. In a sentiment analysis task, Zadeh et al. [10] proposed to fuse language, video, and audio modalities in a tensor, which is constructed from the out product of all the modality-specific feature vectors. By this way, the author intends to exploit either intra-modality or inter-modality dynamics. The definition of the fused tensor can be formulated as follows:

除了在一个明显的隐藏层中的融合过程（通常称为加法方法）外，在一些文献中也采用乘法方法。在情绪分析任务中，Zadeh等人。[10] 提出将语言、视频和音频模式融合成张量，张量由所有特定于模式的特征向量的乘积构成。通过这种方式，作者试图利用模态内或模态间的动力学。融合张量的定义可表述如下：
在这里插入图片描述

where zm denotes the fused tensor, zl, zv, za denotes different modalities respectively, and ⊗ indicates the outer product operator. However, since the outer product is cost expensive, in a more efficientway,Fukuiet al. [23] alternatively propose to utilize Multimodal Compact Bilinear pooling (MCB) to fuse language and image modalities. Formulated as (3), given vectors x and q, the proposed method seeks to reduce the dimension of the outer product x ⊗ q by Count Sketch projection function 9. Particularly, the count sketch of the outer product can be decomposed into a convolution of separated count sketches [71], which means that the computation of an outer product can be avoided. Further, the authors use Fast Fourier Transform (FFT) to accelerate the computation.

其中zm表示融合张量，zl，zv，za分别表示不同的模态，而⊗表示外积算子。但是，由于外部产品的成本昂贵，因此，Fukuiet等人的方法更为有效。 [23]或者提议利用多模式紧凑双线性池（MCB）融合语言和图像模态。在给定向量x和q的情况下，公式（3）表示为：通过Count Sketch投影函数9来减小外部乘积x⊗q的尺寸。特别是，可以将外部乘积的count草图分解为的卷积。分开的计数草图[71]，这意味着可以避免计算外部乘积。此外，作者使用快速傅立叶变换（FFT）来加快计算速度。
在这里插入图片描述

Although the model shown in Fig. 2(a) is designed for the setting in which parallel data are available during training and inference steps, the ability to deal with partial data missing problem in some modalities is also desired, such that more training data can be exploited or the performance of downstream tasks is influenced only slightly in the case of data missing from one or more modalities. To this end, a widely used method is training the model via the data including only some modalities, excluding a modality in different training epochs [1], [72].

尽管图2（a）所示的模型是为在训练和推理步骤中可以使用并行数据的环境而设计的，但也希望能够以某些方式处理部分数据丢失的问题，以便可以提供更多的训练数据如果数据从一种或多种方式中丢失，则对下游任务的利用或对下游任务的性能的影响只会受到很小的影响。为此，一种广泛使用的方法是通过仅包括某些模态的数据来训练模型，但不包括不同训练时期的模态[1]，[72]。

Interestingly, the training trick used for tackling data missing is also helpful for obtaining modality-invariant property, which means that the difference of the statistical distribution between modalities is minimized, or, in other words, the feature vectors contains minimum modality-specific characteristics. The work proposed by Aytar et al. [73] shows that constrained by a statistical regularization which encourages activations in the intermediate hidden layers to have similar statistics distribution across modalities, the modality-invariant property can be strengthened. Their model encourages different modalities to be aligned with each other automatically in the representation layer, even when the training data is unaligned.

有趣的是，用于处理数据丢失的训练技巧也有助于获得模态不变的特性，这意味着模态之间统计分布的差异被最小化，换句话说，特征向量包含最小的模态特定特性。 Aytar等人提出的工作。 [73]表明，受统计正则化的约束，该统计正则化鼓励中间隐藏层中的激活具有跨模态的相似统计分布，模态不变性可以得到增强。他们的模型鼓励即使在训练数据未对齐的情况下，不同的模态也可以在表示层中自动对齐。

To be more expressive, the learned vector is expected to fuse complementary semantics form different modalities. The property, complementary, cannot be guaranteed automatically since joint representation tends to preserve shared semantics across modalities while ignoring modality-specific information.Asolutionisaddingextraregularizationtermsto theoptimizationobjectives [74].Forexample,thereconstruction loss used in multimodal autoencoders [1] can be considered as a regularization term playing as a role to preserve modality independence. Another example is the approach proposed by Jiang et al. [21], which impose a trace norm regularization over the network weights to reveal the hidden correlations and diversity of the multimodal features. Intuitively, if a pair of features are highly correlated, the weights used for fusing them should be similar such that their contributions to the fused representation will be roughly equal. Thus, the goal of trace norm regularization is to discover the relationship between modalities and adjust the weights of the fusion layer accordingly. Their experiments in video classification tasks showed that this regularization term is helpful for improving performance.

为了更富表现力，学习的向量有望融合不同形式的互补语义。互补的属性无法自动得到保证，因为联合表示往往会在跨模态的情况下保留共享的语义，而忽略了特定于模态的信息。解决方案为优化目标增加了额外的正则化项[74]。例如，多模态自动编码器[1]中使用的重构损失可被视为正则化项扮演维护模态独立性的角色。另一个例子是Jiang等人提出的方法。 [21]，这对网络权重施加了跟踪范数正则化，以揭示多模式特征的隐藏的相关性和多样性。直观地，如果一对特征高度相关，则用于融合它们的权重应该相似，以使它们对融合表示的贡献大致相等。因此，跟踪规范正则化的目的是发现模态之间的关系并相应地调整融合层的权重。他们在视频分类任务中的实验表明，该正则化术语有助于提高效果。

Comparing to other frameworks, one of the advantages of joint representation is that it is convenient to fuse several modalities since there is no need to coordinate modalities explicitly. Another advantage is that the shared common subspace tends to be modality-invariant, which is helpful for transferring knowledge from one modality to another [1], [73]. While one of the disadvantages of this framework is that it cannot be used to infer the separated representations for each modality.

与其他框架相比，联合表示的优点之一在于，由于不需要明确地协调模式，因此可以方便地融合多个模式。另一个优点是，共享的公共子空间往往是模态不变的，这有助于将知识从一种模态转移到另一种模态[1]，[73]。尽管此框架的缺点之一是它不能用于推断每种模态的分离表示。

C. COORDINATED REPRESENTATION

Another type of methods popular in multimodal learning is coordinated representation. As Fig. 2(b) showed, instead of learning representations in a joint subspace, coordinated representation framework learns separated but coordinated representations for each modality under some constraints [18]. Since the information contained in different modalities is unequal, learning separated representations is beneficial for persevering the exclusive and useful modality-specific characteristics [31]. Typically, condition on the constraint types, coordinated representation methods can be categorized into two groups, cross-modal similarity based and cross-modal correlation based. Cross-modal similarity based methods aim to learn a common subspace where the distance of vectors from different modalities can be measured directly [75], while cross-modal correlation based methods aim to learn a shared subspace such that the correlation of the representation sets from different modalities is maximized [5]. In this section, we will review the former and leave the latter in Section III-C.

在多模式学习中流行的另一种方法是协调表示。如图2（b）所示，协调表示框架不是在联合子空间中学习表示，而是在某些约束下为每种模态学习分离但协调的表示[18]。由于包含在不同模态中的信息是不平等的，因此学习分离的表示形式对于坚持专有和有用的模态特定特性是有益的[31]。通常，根据约束类型的条件，可将协调表示方法分为两类，基于交叉模式相似性和基于交叉模式相关性。基于交叉模态相似性的方法旨在学习一个公共子空间，其中可以直接测量矢量与不同模态之间的距离[75]，而基于交叉模态相关性的方法旨在学习一个共享子空间，以便从不同的模态被最大化[5]。在本节中，我们将审查前者，并将后者留在III-C节中。

Cross-modal similarity methods learn coordinated representations under constraints of similarity measurement. The learning objective of this model is to preserve inter-modality and intra-modality similarity structure, which expects the cross-modal similarity distance associated with the same semantics or object to be as minimum as possible, while expects the distance with dissimilar semantics to be as maximum as possible.

跨模式相似性方法在相似性度量的约束下学习协调表示。该模型的学习目标是保留模态间和模态内相似性结构，该结构期望与相同语义或对象关联的跨模态相似性距离尽可能小，而期望具有不同语义的距离尽可能小。尽可能最大。

A widely used constraint is cross-modal ranking. Take visual-text embedding for example, ignoring the regularization terms and denoting the matched embedding vectors of visual and text as (v,t) ∈ D, the optimization objective can be expressed as a loss function in (4), where α is the margin, S is the similarity measurement function, t− is the embedding vectors unmatched to v and v− is the embedding vectors unmatched to t. Commonly, t− and v− are known as negative samples which are selected randomly from the dataset D, and (4) is known as margin rank loss [36].

广泛使用的约束是交叉模式排名。以视觉文本嵌入为例，忽略正则项并将匹配的视觉和文本嵌入向量表示为（v，t）∈D，优化目标可以表示为（4）中的损失函数，其中α是裕度，S是相似性度量函数，t-是与v不匹配的嵌入向量，v-是与t不匹配的嵌入向量。通常，t-和v-被称为从数据集D中随机选择的负样本，而（4）被称为余量秩损失[36]。
在这里插入图片描述

Based on the cross-modal ranking constraint, a variety of cross-modal applications have been developed. For example, Frome et al. [34] used a combination of dot-product similarity and margin rank loss to learn a visual-semantic embedding model (DeViSE) for visual recognition. DeViSE firstly pre-trains a pair of deep networks to map images and their correlated labels into embedding vectors v and t then leverages the cross-modal similarity model to learn a shared semantic embedding space for both modalities. Following the notations in (4), the loss function for each training sample can be defined as follows:

基于交叉模式排序约束，已经开发了多种交叉模式应用程序。例如，Frome等。 [34]结合点积相似性和边际等级损失来学习视觉语义嵌入模型（DeViSE）进行视觉识别。 DeViSE首先对一对深度网络进行预训练，以将图像及其相关标签映射到嵌入向量v和t中，然后利用交叉模式相似性模型为这两种模式学习共享的语义嵌入空间。按照（4）中的符号，可以将每个训练样本的损失函数定义如下：
在这里插入图片描述

where M is a linear transformation matrix used for transforming v into the shared semantic embedding space, and the dot-product between t and Mv is the similarity measurement used for both training and testing. Under the constraint in (5), the model is expected to produce a higher dot-product similarity between matched vectors than between unmatched ones and subsequently endows images embedding with rich semantic information which is transferred from language modality. This idea is also shared by the work proposed by Lazaridou and Baroni [35], which aims to integrate and propagate visual information into word embeddings. Their experimental results implied that the transferred visual knowledge is helpful for representing abstract concepts.

其中M是用于将v转换为共享语义嵌入空间的线性转换矩阵，t和Mv之间的点积是用于训练和测试的相似性度量。在（5）中的约束下，期望该模型在匹配向量之间产生比在不匹配向量之间更高的点积相似度，并且随后赋予嵌入有丰富语义信息的图像，该语义信息是从语言模态传递来的。 Lazaridou和Baroni [35]提出的工作也分享了这一想法，该工作旨在将视觉信息集成并传播到单词嵌入中。他们的实验结果表明，转移的视觉知识有助于表示抽象概念。

Inspired by the success of DeViSE, Kiros et al. [36] extended this model to learn a joint image-sentence embedding used for image captioning. They pre-trained a CNN network to obtain image features v and trained an LSTM network to encode its relevant sentences into t, then mapped both encodings into a coordinated embedding space where the similarity between them can be exploited by a cross-modal similarity model similar to [34]. Their model adopted the same similarity measurement used in DeViSE but employed a bi-directional rank loss formulated in (4) such that much richer cross-modal relationships can be discovered. This model is also employed in the work proposed by Socher et al. [32], which aims to map sentences and images into a common space for cross-modal retrieval. They introduced dependency trees based recursive neural network (DTRNN) to encode language modality and argued that the proposed DTRNN is robust to surface changes such as word order.

受DeViSE成功的启发，Kiros等人。 [36]扩展了该模型，以学习用于图像字幕的联合图像句子嵌入。他们预先训练了CNN网络以获取图像特征v，并训练了LSTM网络将其相关句子编码为t，然后将这两种编码映射到协调的嵌入空间中，通过类似的跨模式相似性模型可以利用它们之间的相似性至[34]。他们的模型采用了与DeViSE中相同的相似性度量，但是采用了（4）中公式化的双向秩损失，因此可以发现更丰富的跨模态关系。该模型也用于Socher等人提出的工作中。 [32]，其目的是将句子和图像映射到用于跨模式检索的公共空间中。他们引入了基于依赖树的递归神经网络（DTRNN）来编码语言模态，并认为所提出的DTRNN对于表面变化（如单词顺序）具有鲁棒性。

Further, Karpathy and Fei-Fei [76] extended this framework to learn a fine-grained cross-model alignment between words and image regions for generating region-level descriptions of images. Unfortunately, this task suffers from lacking of necessary supervision information. Given images and their correlated sentences, the one-to-one correspondence between a word and the region it described is not yet known. To address this problem, they selected to infer the alignment between segments of sentences and the regions of the image in a cross-modal embedding space. The key idea is to formulate the image-sentence score as a function of the individual region-word similarity. Let videnotes the image regions and stdenotes the words in a sentence, the score between image k and sentence l is defined as follows:

此外，Karpathy和Fei-Fei [76]扩展了该框架，以学习单词和图像区域之间的细粒度交叉模型对齐，以生成图像的区域级描述。不幸的是，该任务缺乏必要的监督信息。在给定图像及其相关句子的情况下，单词及其描述区域之间的一一对应关系是未知的。为了解决这个问题，他们选择了在跨模态嵌入空间中推断句子片段和图像区域之间的对齐方式。关键思想是根据单个区域词的相似性来制定图像句子分数。设图像代表图像区域并标出句子中的单词，图像k和句子l之间的得分定义如下：
在这里插入图片描述

where, gk is the set of fragments in image k, gl is the set of snippets in sentence l and each word staligns to a unique best image region. Additionally, assuming that k = l denotes a matched image-sentence pair, the cross-modal ranking constraint can be defined as a loss function in (7), which encourages aligned image-sentences pairs to have a higher score than misaligned pairs.

其中，gk是图像k中的片段集，gl是句子l中的片段集，每个单词都与唯一的最佳图像区域对齐。另外，假设k = 1表示匹配的图像句子对，则交叉模式排名约束可以定义为（7）中的损失函数，这鼓励对齐的图像句子对比未对齐的句子对具有更高的分数。
在这里插入图片描述

The strategy to measure image-sentence similarity based on individual region-word scores is also adopted by Peng et al. [31], who aim to preserve the modality-specific characteristics by utilizing the fine-grained information within each modality during the cross-modal correlation learning. The authors argued that different modalities have imbalanced and complementary relationships, thus, instead of measuring the similarity in a common space, they construct an independent semantic space for each modality and measure the cross-modal similarity in both spaces simultaneously. After that, the modality-specific similarity scores will be combined into a final measurement used for cross-modal retrieval.

Peng等人也采用了基于单个区域词得分来衡量图像句子相似度的策略。 [31]，他们的目标是在跨模态相关性学习期间通过利用每个模态内的细粒度信息来保留模态特定的特征。作者认为，不同的情态具有不平衡和互补的关系，因此，他们没有在一个公共空间中测量相似性，而是为每个情势构建了一个独立的语义空间，并同时测量了这两个空间中的跨模式相似性。在那之后，特定于模态的相似性分数将被组合成用于跨模态检索的最终度量。

In addition to cross-modal ranking, another widely used constraint is Euclid distance. The mainstream approach in this category is to minimize the distance of paired samples [33], [77], [78]. An example is a model proposed by Pan et al. [33], which aims to learn a visual-semantic embedding used for generating video descriptions. The model projects both visual and language representations into a low-dimensional embedding space, where the distances between paired samples are minimized such that the semantics of visual embeddings will be consistent with their relevant sentences. This constraint can be expressed as a loss term:

除了交叉模式排序外，另一个广泛使用的约束是欧几里得距离。该类别中的主流方法是最小化配对样本的距离[33]，[77]，[78]。 Pan等人提出的模型就是一个例子。 [33]，旨在学习用于生成视频描述的视觉语义嵌入。该模型将视觉和语言表示都投影到一个低维的嵌入空间中，在该空间中配对样本之间的距离被最小化，从而使视觉嵌入的语义与其相关句子保持一致。此约束可以表示为损失项：
在这里插入图片描述

where Tv and Ts are transform matrices for video v and sentence s, v and s are paired samples form dataset D.

其中Tv和Ts是视频v和句子s的变换矩阵，v和s是来自数据集D的配对样本。

Another example is the model for cross-modal matching proposed by Liong et al. [78], which aims to reduce the modality gap of paired data by minimizing the difference of hidden representations over all layers. Suppose that visual modality v and text modality t are encoded via homogeneous feed-forward neural networks, the loss can be formulated as follows:

另一个例子是Liong等人提出的交叉模式匹配模型。 [78]，旨在通过最小化所有层上隐藏表示的差异来减小配对数据的模态差距。假设视觉模态v和文本模态t是通过同质前馈神经网络编码的，则损失可以表示为
在这里插入图片描述

wherel indicates a layer of both modality-specificnetworks,i indicates a pair of instances of training data and h denotes the hidden representations. Further, the authors also imposed a large margin criterion to the distance of unpaired data which aims to minimize the intra-class distance and maximize the inter-class distance, such that more discriminative information can be exploited. This criterion is defined as follows:

其中，l表示两个模态专用网络的一层，i表示训练数据的一对实例，h表示隐藏的表示。此外，作者还对未配对数据的距离施加了较大的余量标准，其目的是最小化类别内距离并最大化类别间距离，以便可以利用更多判别信息。此标准定义如下：
在这里插入图片描述

where ti denotes the sentence i, vj denotes image j, and θ1, θ2 are the small and large thresholds respectively. The condition lti,vj= 1 means that ti and vj belong to the same class, otherwise, belong to the different class.

其中ti表示句子i，vj表示图像j，θ1，θ2分别是小阈值和大阈值。条件lti，vj = 1表示ti和vj属于同一类别，否则属于不同类别。

Except for learning inter-modality similarity measurement, another key issue of cross-modal applications is to preserve the intra-modality similarity structure. A widely used strategy is classifying the category of learned features such that they are also discriminative within each modality [30], [79]. Additionally, another method is to keep the neighborhood structure within each view. The constraint in (10) is one of the implementations in this group. Another example is the work from Wang et al. [80], which proposed to learn image-text embeddings via coordinated representation model which combines cross-view ranking constraints with within-view neighborhood structure preservation constraints in the loss function. Let N (vi) denotes the neighborhood of image viand N (ti) denotes the neighborhood of sentence ti, the within-view neighborhood structure preservation constraints can be formulated as follows:

除了学习模态间的相似性度量外，跨模态应用的另一个关键问题是保留模态内的相似性结构。一种广泛使用的策略是对学习特征的类别进行分类，以使它们在每个模态中也具有区别性[30]，[79]。此外，另一种方法是将邻域结构保留在每个视图内。（10）中的约束是该组中的一种实现。另一个例子是Wang等人的工作。 [80]，它提出了通过协调表示模型学习图像文本嵌入，该模型结合了跨视图排序约束与视图内邻域结构保存约束的损失函数。令N（vi）表示图像邻居的邻域N（ti）表示句子ti的邻域，视图内邻域结构保存约束可以表述为：
在这里插入图片描述

In addition to the applications characterized as finding one modality from another such as cross-modal retrieval [75], [77], [80] and retrieval-based visual description [32], another type of application of coordinated representationistransferknowledgeacrossmodalities,which may enhance the semantic description capability of the embeddings in target modality. The basic idea is minimized the cross-modal similarity of paired multimodal data in a common subspace during training, such that the embeddings can capture their shared semantics, which means that the knowledge has been transferred. Several pieces of literature mentioned above [33]–[36] can be considered as representative examples of this idea. Furthermore, coordinated representation can also be used for cross-domain transfer learning which would partially reduce the need for labeled data. For example, in order to transfer knowledge from a large-scale cross-media dataset to small-scale one, the works from Huang et al. [37], [38] proposed to train a pair of networks, each for one of the domains, and coordinate them via minimizing the maximum mean discrepancy (MMD) [81].

除了具有从另一种模式中找到一种模态的应用程序（例如跨模态检索[75]，[77]，[80]和基于检索的视觉描述[32]）之外，协调表示的另一种应用是交叉知识传递模态，这可能会增强目标模态中嵌入的语义描述能力。基本思想是在训练过程中最小化公共子空间中成对的多峰数据的跨峰相似性，以便嵌入可以捕获它们的共享语义，这意味着知识已被转移。上面提到的几篇文献[33]-[36]可以被认为是该思想的代表。此外，协调表示也可以用于跨域转移学习，这将部分减少对标记数据的需求。例如，为了将知识从大规模的跨媒体数据集转移到小规模的数据集，Huang et al。的著作。 [37]，[38]提出了训练一对网络，每个网络用于一个域，并通过最小化最大平均差异（MMD）来协调它们[81]。

Comparing to other frameworks, coordinated representation tends to persevere the exclusive and useful modality-specific characteristics within each modality [31]. Sincedifferentmodalitiesareencodedinseparatednetworks, one of the advantages is that each modality can be inferred individually. This property is also beneficial for cross-modal transferlearningwhichaimstotransferknowledgeacrossdifferent modalities or domains. A disadvantage of this framework is that, mostly, it is hard to learn representations with more than two modalities.

与其他框架相比，协调表示倾向于在每个模态中坚持独有且有用的模态特定特征[31]。由于不同的模态是在分离的网络中编码的，因此优点之一是可以分别推断每个模态。此属性对于跨模式传输学习也很有帮助，该学习模式可以跨不同的模态或域来传递知识。该框架的缺点是，大多数情况下，很难学习具有两种以上模式的表示形式。

D. ENCODER-DECODER

Recently, Encoder-decoder framework has been widely used for multimodal translation tasks which map one modality into another, such as image caption [13], [39], video description [14], [41], and image synthesis [15], [82]. Typically, as shown in Fig. 2©, the encoder-decoder framework is mainly composed of two components, an encoder, and a decoder. The encoder maps source modality into a latent vector v, and then, based on the vector v, the decoder will generate a novel sample of target modality.

最近，编码器-解码器框架已广泛用于将一种模式映射到另一种模式的多模式翻译任务，例如图像标题[13]，[39]，视频描述[14]，[41]和图像合成[15]， [82]。通常，如图2（c）所示，编码器-解码器框架主要由两个组件组成：编码器和解码器。编码器将源模态映射到潜在向量v，然后，基于向量v，解码器将生成目标模态的新样本。

Although most of the encoder-decoder models contain only an encoder and a decoder, some of the variants can also be composed of several encoders or decoders. For example, Mor et al. [83] proposed a model to translate music across musical instruments, where a single encoder and several decoders are involved. The shared encoder is responsible for extracting domain-independent music semantics, and each decoder will reproduce a piece of music in the target domain. An example including two encoders is the image-to-image translation model proposed by Huang et al. [84]. It consists of a content encoder and a style encoder, each is responsible for part of the duty.

尽管大多数编码器/解码器模型仅包含一个编码器和一个解码器，但是某些变体也可以由多个编码器或解码器组成。例如，Mor等。 [83]提出了一种跨乐器翻译音乐的模型，其中涉及单个编码器和几个解码器。共享编码器负责提取与域无关的音乐语义，每个解码器将在目标域中再现一段音乐。包含两个编码器的一个示例是Huang等人提出的图像到图像转换模型。 [84]。它由内容编码器和样式编码器组成，每个编码器负责一部分职责。

The generalized learning objective of encoder-decoder models, take visual description as an example [41], can be expressed as follows:

编码器-解码器模型的广义学习目标，以视觉描述为例[41]，可以表示为：

在这里插入图片描述

which maximizes the log likelihood of the sentence S given the corresponding visual content V and the model parameters θ. Further, assuming that each word in the sequence is produced in order, the log probability of the sentence can be expressed as

在给定相应的视觉内容V和模型参数θ的情况下，这最大化了句子S的对数似然性。此外，假设序列中的每个单词按顺序产生，则句子的对数概率可以表示为

在这里插入图片描述

where Swi represents the ith word in the sentence and N is the total number of words.

其中Swi表示句子中的第i个单词，N是单词总数。

Superficially, the latent vector learned by the encoderdecoder model seems to relate only to the source mode, but in fact, it closely relates to both source and target modalities. Since the flowing direction of the error correction signal is from the decoder to the encoder, the encoder is guided by the decoder during training. Subsequently, the generated representation tends to capture the shared semantics from both modalities.

从表面上看，由编码器-解码器模型学习的潜矢量似乎仅与源模式有关，但实际上，它与源和目标模态都密切相关。由于纠错信号的流动方向是从解码器到编码器，因此在训练期间，编码器由解码器引导。随后，生成的表示倾向于从这两种模态中捕获共享的语义。

To capture shared semantics more effectively, a popular solution is keeping the semantic consistency among modalities via some regularization terms. It depends on the coordination between the encoder and the decoder. Both the correct understanding of semantics in source modality and the pertinent generating of novel samples in target modality are important for success. Take image caption [85] as an example, the description generated by the decoder may cover multiple visual aspects of an image including objects, attributes such as color and size, backgrounds, scenes and spatial relationships, hence, the encoder has to detect and encode necessary information correctly, and further, the decoder will be responsible for reasoning high-level semantics and generating grammatically well-formed sentences.

为了更有效地捕获共享语义，一种流行的解决方案是通过一些正则化术语来保持模态之间的语义一致性。这取决于编码器和解码器之间的协调。对源语态中语义的正确理解和目标语态中新样本的相关生成，对于成功都至关重要。以图像标题[85]为例，由解码器生成的描述可能覆盖图像的多个视觉方面，包括对象，属性（例如颜色和大小，背景，场景和空间关系），因此编码器必须进行检测和编码正确地获取必要的信息，此外，解码器将负责推理高级语义并生成语法良好的句子。

An example of explicitly considering the semantic consistency between modalities is the model proposed by Gao et al. [42], which aims to translate videos into sentences. To tackle this problem, on the one hand, they maximized the likelihood formulated in (13) such that sentences can be generated correctly, on the other hand, they minimized the representation difference in a common subspace such that their semantics are correlated with each other. Suppose that v denotes the visual features, s denotes the sentence embeddings, and R denotes a matrix used for linearly projecting s into the subspace where v located, the consistency constraint can be written as loss term in (14). Another example is the work proposed by Reed et al. [15], which endeavors to translate characters into pixels via Generative adversarial network (GAN) [82]. In their model, within each class, the similarity between the source and target encodings is maximized such that the semantics in both modalities will keep consistent. Since the models of image synthesis are mostly implemented by GAN, more example of this task will be left to Section III-D which concentrates on generative adversarial learning.

明确考虑模态之间语义一致性的一个例子是Gao等人提出的模型。 [42]，旨在将视频翻译成句子。为了解决这个问题，一方面，他们最大化了（13）中公式化的可能性，以便可以正确生成句子；另一方面，他们最小化了公共子空间中的表示差异，从而使它们的语义相互关联。。假设v表示视觉特征，s表示句子嵌入，R表示用于将s线性投影到v所在的子空间中的矩阵，则一致性约束可以写为（14）中的损失项。另一个例子是里德等人提出的工作。 [15]，它致力于通过生成对抗网络（GAN）将字符转换为像素[82]。在他们的模型中，在每个类中，源编码和目标编码之间的相似性被最大化，从而使两种方式中的语义保持一致。由于图像合成模型主要由GAN实施，该任务的更多示例将留给III-D部分，该部分着重于生成对抗性学习。

在这里插入图片描述

On condition that the semantic consistency between modalities has been modeled explicitly, this framework can be used to learn cross-modal semantic embedding. For example, based on the encoder-decoder framework, Gu et al. [86] proposed to learn cross-modal embeddings used for retrieval. Their model translates each of the modality into another via distinct encoder-decoder networks and expects that the generated images or sentences are similar to their sources. In this model, the similarity between the generated sentence and its corresponding reference sentences is measured by a standard evaluation metric like BLEU [87], and the similarity between images is measured by a discriminator which is responsible for distinguishing whether an image comes from the generator or not.

在模态之间的语义一致性已被明确建模的条件下，该框架可用于学习跨模态语义嵌入。例如，基于编码器-解码器框架，Gu等人。 [86]提出学习用于检索的交叉模态嵌入。他们的模型通过独特的编码器/解码器网络将每个模态转换为另一个模态，并期望生成的图像或句子与其来源相似。在该模型中，所生成的句子与其对应的参考句子之间的相似性是通过像BLEU [87]这样的标准评估度量来衡量的，而图像之间的相似性则是由鉴别器来衡量的，该鉴别器负责区分图像是否来自生成器。或不。

In early works [88], [89], the representation of visual modality is usually a fixedvisualsemanticlistsuchasobjects and their relationship which is detected explicitly by the encoder. Then based on n-gram language models or sentence templates,asentenceisgeneratedbythedecoder.Inthisway, the problem is simplified. However, it is difficult for these models to deal with large vocabulary or to model complex sentence structure [41].

在早期的作品[88]，[89]中，视觉模态的表示通常是固定的视觉语义列表，例如对象及其与对象之间的关系，编码器会明确地对其进行检测。然后，基于n-gram语言模型或句子模板，由解码器生成句子。这样可以简化问题。然而，这些模型很难处理大量词汇或建模复杂的句子结构[41]。

Recently, a more accessible way of representing source modality is encoding essential information into a single vectorial representation [14]. Comparing to traditional methods, it is more convenient for neural networks to encode information and generate samples. However, using the single vector as a bridge, it is challenging for both encoder and decoder to translate semantics between modalities. A problem for the encoder is that the high-level vectorial representation distilled from the source may lose some information which is useful for generating target modality [13]. Also, another problem will arise in decoder once RNN models are used for generating a long sequence. The information contained in the original representation vector will be diminished during its delivery through time steps.

近来，一种表示源模态的更可访问的方法是将基本信息编码为单个矢量表示[14]。与传统方法相比，神经网络更方便地对信息进行编码并生成样本。然而，使用单个向量作为桥梁，编码器和解码器在模态之间翻译语义都是具有挑战性的。编码器的问题是从源中提取的高级矢量表示可能会丢失一些信息，这对于生成目标模态很有用[13]。同样，一旦将RNN模型用于生成长序列，解码器中将出现另一个问题。原始表示向量中包含的信息将在传递过程中通过时间步长而减少。

Attention mechanism has become a popular solution for both aforementioned problems. Rather than merely using a single vector resulting from the last step of the encoder, attention mechanism allows utilizing the intermediate representations which distribute among time steps in an RNN network [90] or localized regions in a CNN network [91]. For the encoder, this mechanism relieves the requirement that the full information should be integrated into a single vector, and thus gives more flexibility to the design of encoder.

注意机制已成为上述两个问题的流行解决方案。注意机制不仅仅是使用编码器最后一步所产生的单个矢量，还允许利用中间表示，这些中间表示分布在RNN网络[90]的时间步长或CNN网络[91]的局部区域之间。对于编码器，此机制减轻了将完整信息集成到单个向量中的要求，从而为编码器设计提供了更大的灵活性。

在这里插入图片描述
FIGURE 3. The model of deep multimodal RBM (adapted from [96]), which models the joint distribution over image and text inputs.

On the other hand, for the decoder, this mechanism provides an ability to concentrate on the part of the scene selectively and dynamically during the prediction process. Due to its ability to select the prominent features, attention mechanism has been successfully used in a variety of neural networks and has demonstrated its unique power in improving performance in many applications [90]–[92]. Considering its significance on multimodal representation learning, we will take a more detailed look at its impact in Section III-E.

另一方面，对于解码器，此机制提供了在预测过程中选择性地动态地集中于场景部分的能力。由于其具有选择突出特征的能力，因此注意力机制已成功用于各种神经网络，并已展示出其在许多应用中提高性能的独特能力[90]-[92]。考虑到它对多模式表示学习的重要性，我们将更详细地研究它在III-E节中的影响。

To address the encoding and decoding problems of multimodal sequence, deep reinforcement learning (DRL) is another promising solution, in which either encoding or decoding of a sequence can be treated as sequential decision-making problem. For example, via deep reinforcement learning, Chen et al. [93] proposed to train a feature selection module used for determining whether an input at time step t should be included or not during encoding, such that salient features can be involved while noise will be excluded. Conversely, an exemplary application of deep reinforcement learning during decoding is image captioning [94], [95].

为了解决多模式序列的编码和解码问题，深度强化学习（DRL）是另一种有前途的解决方案，其中序列的编码或解码都可以视为顺序决策问题。例如，Chen等人通过深度强化学习。 [93]提出训练特征选择模块，该特征选择模块用于确定在编码期间是否应该包括在时间步t处的输入，从而可以包括显着特征而将噪声排除在外。相反，在解码期间深度强化学习的示例性应用是图像字幕[94]，[95]。

Comparing to other frameworks, one of the advantages of the encoder-decoder framework is its being able to generate novel samples of target modality condition on the representations of source modality. On the contrary, the disadvantage of this framework is that each encoder-decoder can only encode one of the modalities. Further, the complexity in designing the generator should be taken into consideration, since the technique for generating plausible target is still on its development.

与其他框架相比，编解码框架的一个优点是能够根据源模态的表示生成新的目标模态条件样本。相反，该框架的缺点是每个编码器-解码器只能对其中一种模式进行编码。此外，由于生成似然目标的技术仍在发展中，因此在设计发生器时应考虑到复杂性。

总字数超过十万了，剩余部分在下篇。

datamonday

发布了176 篇原创文章 · 获赞 694 · 访问量 5万+

私信关注

【Paper】Deep Multimodal Representation Learning: A Survey （Part1）

文章目录

Deep Multimodal Representation Learning: A Survey

ABSTRACT

I. INTRODUCTION

II. DEEP MULTIMODAL REPRESENTATION LEARNING FRAMEWORKS

A. MODALITY-SPECIFIC REPRESENTATIONS

B. JOINT REPRESENTATION

C. COORDINATED REPRESENTATION

D. ENCODER-DECODER

猜你喜欢