接上篇：【Paper】Deep Multimodal Representation Learning: A Survey （Part1）

文章目录

Deep Multimodal Representation Learning: A Survey
III. TYPICAL MODELS

A. PROBABILISTIC GRAPHICAL MODELS
B. MUL TIMODAL AUTOENCODERS
C. DEEP CANONICAL CORRELATION ANAL YSIS
D. GENERATIVE ADVERSARIAL NETWORK

Deep Multimodal Representation Learning: A Survey

深度多模态表征学习研究综述

III. TYPICAL MODELS

In this section, some typical models in deep multimodal representation learning will be summarized. They range from conventional models, including probabilistic graphical models, multimodal autoencoders, and deep canonical correlation analysis, to newly developed technologies, including generative adversarial networks and attention mechanism. The typical models described here can be categorized into one or more of the frameworks above introduced or can be integrated with them.

在本节中，将总结深度多模式表示学习中的一些典型模型。它们的范围从常规模型（包括概率图形模型，多模式自动编码器和深度典范相关分析）到新近开发的技术（包括生成对抗网络和注意力机制）。此处描述的典型模型可以归类为上面介绍的一个或多个框架，或者可以与它们集成。

A. PROBABILISTIC GRAPHICAL MODELS

In the deep representation learning area, probabilistic graphical models include deep belief networks (DBN) [97] and deep Boltzmann machines (DBM) [98]. Although both of them are trained from stacked restricted Boltzmann machines (RBM) [99] layer wisely, their structures are different. The former is a partially directed model which consists of a directed belief network and an RBM layer, while the latter is a fully undirected model.

在深度表示学习领域，概率图形模型包括深度信念网络（DBN）[97]和深度Boltzmann机器（DBM）[98]。尽管这两种机器都是从限制性叠加的Boltzmann机器（RBM）[99]层智能地训练出来的，但是它们的结构是不同的。前者是由有向信念网络和RBM层组成的部分有向模型，后者是完全无向模型。

An example of probabilistic graphical models is multimodal DBN proposed by Srivastava and Salakhutdinov [72]. By adding a shared RBM hidden layer on top of modality-specific DBN, it learns a joint representation across modalities. Another model also from Srivastava and Salakhutdinov [96] is multimodal deep Boltzmann machines which alternatively using DBMs as the basic units for processingdatafromeachmodality.Asafullyundirectedmodel, the states of hidden units will influence each other across modalities. Hence, the modality fusion process is distributed across all hidden units of all layers.

概率图形模型的一个例子是Srivastava和Salakhutdinov[72]提出的多模态DBN。通过在特定于模态的DBN上添加一个共享的RBM隐藏层，它学习跨模态的联合表示。Srivastava和Salakhutdinov[96]的另一个模型是多模式深Boltzmann机器，它可以选择使用DBMs作为基本单元来处理每个模式的数据。作为fullyundirectedmodel，隐藏单元的状态将在模式之间相互影响。因此，模态融合过程分布在所有层的所有隐藏单元上。

The learning objective of multimodal probabilistic graphical models is to maximize the joint distribution over modalities. Take multimodal DBM as an example which is illustrated in Fig. 3, suppose that each modality is encoded via a DBM with two hidden layers, the joint distribution can be written as:

多模态概率图形模型的学习目标是最大化模式上的联合分布。以多模DBM为例，如图3所示，假设每个模式通过一个具有两个隐藏层的DBM进行编码，则联合分布可以写成：
在这里插入图片描述

where vm, vtdenote image and text input respectively, θ denotes the parameters, hm= {h(1) m,h(2) m}, ht= {h(1) t,h(2) t} denotes the hidden layers in each modality and h(3)denotes the shared representation layer.

其中vm，vt分别表示图像和文本输入，θ表示参数，hm = {h（1）m，h（2）m}，ht = {h（1）t，h（2）t}表示隐藏层在每个模态中，h（3）表示共享表示层。

Unlike the strategy which connects different modalities via a shared representation layer, Feng et al. [28] tended to maximize the correspondence between modalities layer wisely. At each equivalent hidden layer, two RBMs from different modalities are connected respectively by a correlation loss function. In this way, the essential cross-model correlation for cross-modal retrieval is captured.

与通过共享表示层连接不同模式的策略不同，Feng等人。 [28]倾向于明智地最大化模态之间的对应。在每个等效的隐藏层，来自两个不同模态的RBM分别通过相关损失函数连接。以这种方式，捕获了用于交叉模式检索的基本交叉模型相关性。

By fusing modalities together in a unified latent space, probabilistic graphical models can be used to learn the essential cross-modal correlations. Based on multimodal deep belief networks, several applications such as audio-visual emotion recognition [25], audio-visual speech recognition [27], and information trustworthiness estimation [100] have been reported. Also, based on multimodal deep Boltzmann machines, several solutions used for human pose estimation [101] and video emotion prediction [26] have been proposed.

通过将模态融合在一个统一的潜在空间中，概率图形模型可以用来学习本质的跨模态相关性。基于多模态深度信念网络，人们已经报道了一些应用，如视听情感识别[25]、视听语音识别[27]和信息可信度估计[100]。此外，基于多模态深Boltzmann机器，提出了几种用于人体姿势估计和视频情感预测的解决方案。

One of the advantages of probabilistic graphical models is that they can be trained in an unsupervised fashion, allowing the use of unlabeled data. Another advantage comes from their generative nature, which makes it possible to generate the missing modality condition on the other ones [96]. However, due to the expensive approximate inference algorithm, a crucial disadvantage of multimodal deep Boltzmann machines is its considerably high computational cost [102].

概率图形模型的优点之一是，它们可以在无监督的方式下进行训练，从而允许使用未标记的数据。另一个优势来自于它们的生成性，这使得在其他的情态条件上生成缺失情态条件成为可能[96]。然而，由于昂贵的近似推理算法，多模深Boltzmann机器的一个关键缺点是其相当高的计算成本[102]。
在这里插入图片描述

B. MUL TIMODAL AUTOENCODERS

Autoencoders is popular for its ability to learn representations in an unsupervised manner, no labels are needed [103]. The basic structure of autoencoders includes two components, one is an encoder and the other is a decoder. The encoder converts the input into a compressed hidden vector, also known as latent representation, while the decoder endeavors to reconstruct the input based on this latent representation such that the reconstruction loss is minimized.

自动编码器以其无监督方式学习表示的能力而广受欢迎，不需要标签[103]。自动编码器的基本结构包括两部分，一部分是编码器，另一部分是解码器。编码器将输入转换为压缩的隐藏向量，也称为潜在表示，而解码器则尝试基于该潜在表示重建输入，以使重建损失最小化。

Inspired by denoising autoencoders [104], Ngiam et al. [1] extended the autoencoders to a multimodal setting. They trained a bimodal deep autoencoder to learn a shared representation across audio and video modalities. Showed as Fig. 4, in this model, two separated autoencoders are combined in the common latent representation layer while keeping their encoders and decoders independent. To capture the cross-modal correlations robustly, depended on the shared representation, each modality can be reconstructed, even when one of the modalities is absent. Let (xi,yi) denotes a pair of inputs and (ˆ xi, ˆ yi) denotes their reconstructed outputs, the basic optimization objective of this model is to minimize the reconstruction loss of both modalities formulated as follows:

Ngiam等人受去噪自动编码器的启发[104]。 [1]将自动编码器扩展到多模式设置。他们训练了双峰深度自动编码器，以学习跨音频和视频模态的共享表示。如图4所示，在此模型中，两个分离的自动编码器在公共潜在表示层中组合在一起，同时保持它们的编码器和解码器独立。为了根据共享表示可靠地捕获跨模态相关性，即使没有一种模态，也可以重构每个模态。令（xi，yi）表示一对输入，而（ˆ xi，ˆ yi）表示其重构的输出，此模型的基本优化目标是最小化以下两种形式的重构损失：
在这里插入图片描述

Similar to the work from Ngiam, Silberer and Lapata [105] proposed a variant to learn semantic representations from textual and visual input. In addition to reconstruction loss, a classification loss is also optimized simultaneously to ensure the abilitythatdifferentobjectscanbediscriminatedbasedonthe learned latent representations. Another variant is the model proposed by Wang et al. [106] which imposed orthogonal regularization on the weights to reduce the redundancy in the learned representation.

与Ngiam的工作相似，Silberer和Lapata [105]提出了一种变体，可以从文本和视觉输入中学习语义表示。除重建损失外，还同时优化了分类损失，以确保基于学习的潜在表示来区分不同对象的能力。另一个变体是Wang等人提出的模型。 [106]对权重进行正交正则化以减少学习表示中的冗余。

Other than learning representation in a common subspace, Feng et al. [11] proposed to learn a couple of independent while correlated representation for each modality. In their model, each modality is encoded via individual autoencoders. In addition to the reconstruction loss of both modalities, the model minimizes the similarity loss between modalities such that the correlation between them can be captured. The author implied that a balance between both losses is vital for higher performance. This idea is also adopted by Wang et al. [107] who assigned separated weights to reconstruction loss of different modalities.

除了在公共子空间中学习表示之外，Feng等人。[11] 提出为每种情态学习两个独立而相关的表示。在他们的模型中，每个模态都通过单独的自动编码器进行编码。除了两种模式的重建损失外，该模型还最小化了模式之间的相似性损失，从而可以捕获它们之间的相关性。作者暗示，两种损失之间的平衡对于提高性能至关重要。王等人也采纳了这一观点。[107]世卫组织为不同模式的重建损失分配了不同的权重。

Besides the above-mentioned models, autoencoders are also used for extracting intermediate features. Generally, this type of models can be characterized as two stages of learning strategy. In the first step, based on unsupervised learning, the modality-specific features are extracted via separated autoencoders. Then, in the next step, a particular supervised learning procedure will be imposed to capture the cross-modal correlations. For example, based on autoencoders,Liuet al. [6] extracted modality-specificfeaturesseparately, then fused them in a neural network via supervised learning. Another instance is the work of Hong et al. [108], which learns a mapping from one modality to another based on the features learned from autoencoders.

除上述模型外，自动编码器还用于提取中间特征。通常，这种类型的模型可以描述为学习策略的两个阶段。第一步，基于无监督学习，通过分离的自动编码器提取特定于模态的特征。然后，在下一步中，将执行特定的监督学习过程以捕获交叉模式相关性。例如，基于自动编码器，Liuet等。 [6]分别提取了特定于模态的特征，然后通过监督学习将它们融合到神经网络中。另一个例子是Hong等人的工作。 [108]，其基于从自动编码器学习的特征来学习从一种模态到另一种模态的映射。

The first advantage of autoencoders is that the learned latent representation can preserve the dominant semantic informationofinputdata.Intheviewofthegenerativemodel, since the input can be reconstructed from this latent representation, it is believable that the critical factors for generating the input have been encoded. The second advantage is that it can be trained by unsupervised manner, without labelsrequired.However,sincethismodelismainlydesigned for general purpose, in order to improve its performance in specific tasks, additional constraints or supervised learning process should be involved.

自动编码器的第一个优点是学习的潜在表示可以保留输入数据的主要语义信息。在生成模型的视图中，由于可以从该潜在表示中重构输入，因此可以相信生成输入的关键因素已经被编码。第二个优点是可以通过无监督方式进行训练，而无需标签。但是，由于该模型主要是为通用目的而设计的，因此，为了提高其在特定任务中的性能，应该涉及其他约束或监督学习过程。

C. DEEP CANONICAL CORRELATION ANAL YSIS

深度正则相关分析。

Canonical correlation analysis (CCA) [109] is a method originally used for measuring the correlation between a pair of sets. In the multimodal representation learning scenario, given two sets of data x[x1,x2,· · · ,xn] ∈ Rn×dx and y[y1,y2,· · · ,yn] ∈ Rn×dy, where each pair (xi,yi) is a data sample including two modalities, CCA aims to find two sets of basis vectors wxand wyused for mapping multimodal data into a shared d dimensional subspace, such that the correlation between the projected representations, Px= wTxx and Py= wTyy, is maximized [5], [110]. In the case each set x and y has a zero mean, the objective function can be written as (17), where ρ denotes the correlation coefficient, and C denotes the covariance matrix.

典型相关分析（CCA）[109]是一种最初用于测量一对集合之间的相关性的方法。在多模式表示学习场景中，给定两组数据x [x1，x2，···，xn]∈Rn×dx和y [y1，y2，····yn]∈Rn×dy，其中每对（ xi，yi）是一个包含两个模态的数据样本，CCA的目的是找到用于将多模态数据映射到共享d维子空间中的两组基向量wx和wyus，以使投影表示之间的相关性Px = wTxx和Py = wTyy ，最大化[5]，[110]。在每个集合x和y均值为零的情况下，目标函数可以写为（17），其中ρ表示相关系数，C表示协方差矩阵。

在这里插入图片描述

Since ρ is invariant to the scale of wx or wy, the optimization objective can be further reformed as a constrained optimization problem as follows:

由于ρ对于wx或wy的尺度不变，因此可以将优化目标作为受约束的优化问题进一步进行如下重构：

在这里插入图片描述

The basic CCA is limited to modeling linear relationship, regardless of the truth of probability distribution in different data views. To address this problem, many extensions have been proposed. One of the non-linear extensions is kernel CCA [111] which transforms the data into a higher dimensional Hilbert space before applying the CCA method. However, KCCA suffers from poor scalability [112], in that its closed form solution requires computation of high time complexity and memory consumption. Alternatively, some approximation methods such as Nyström method [113], incomplete Cholesky decomposition [114], partial Gram Schmidt orthogonalization [115], and block incremental SVD [116] can be used to speed up this model. Another drawback of KCCA is its poor efficiency, which results from its requirement of accessing to all training sets when transforming an unseen instance [117].

基本CCA仅限于对线性关系进行建模，而与不同数据视图中概率分布的真实性无关。为了解决这个问题，已经提出了许多扩展。非线性扩展之一是内核CCA [111]，它在应用CCA方法之前将数据转换为更高维的希尔伯特空间。但是，KCCA具有较差的可伸缩性[112]，因为其闭式解决方案需要计算高时间复杂度和内存消耗。或者，可以使用某些近似方法（例如Nyström方法[113]，不完全的Cholesky分解[114]，部分Gram Schmidt正交化[115]和块增量SVD [116]）来加速此模型。 KCCA的另一个缺点是效率低下，这是由于在转换一个看不见的实例时需要访问所有训练集所致[117]。

A new extension of CCA is deep CCA [117], which aims to learn a pair of more complex non-linear transformation for different modalities. The basic structure of this model can be illustrated by Fig. 2(b), where each modality is encoded by a deep neural network, then in a common subspace, the canonical correlation between modalities is maximized. Let Hx= fx(x, θx) and Hy= fy(y, θy) are non-linear transformation functions implemented by neural network which mapped x and y into a shared subspace, the optimization objective is to maximize the cross-modality correlation between Hx and Hy formulated as follows:

CCA的新扩展是Deep CCA [117]，其目的是学习针对不同模态的一对更复杂的非线性变换。该模型的基本结构可以通过图2（b）进行说明，其中每个模态由一个深度神经网络编码，然后在一个公共子空间中，模态之间的规范相关性被最大化。设Hx = fx（x，θx）和Hy = fy（y，θy）是由神经网络实现的非线性变换函数，将x和y映射到共享子空间中，优化目标是使之间的交叉模态相关性最大化Hx和Hy的公式如下：
在这里插入图片描述

Comparing to a particular kernel function used in KCCA, the non-linear function learned from the neural network is far more general. Hence, DCCA exhibits better performance in adaptability and flexibility. Meanwhile, as a parametric method, DCCA scales better with data size and does not require to reference to train data during testing.

与KCCA中使用的特定内核函数相比，从神经网络中学习到的非线性函数更为通用。因此，DCCA在适应性和灵活性方面表现出更好的性能。同时，作为一种参数化方法，DCCA可以更好地扩展数据大小，并且在测试过程中无需参考训练数据。

Commonly, a maximized correlation objective focuses on learning the shared semantic information but tends to ignore modality specific knowledge. To address this problem, extra regularization terms should be considered. For example, Wang et al. [118] proposed a variant of DACC name deep canonically correlated autoencoders (DCCAE). In addition to maximize the correlation between views, this model also minimizes the reconstruction error of each view via autoencoders architecture. The role of additional autoencoders can be interpreted as a regularization item which aims to raise the lower bound of mutual information between views.

通常，最大化的关联目标侧重于学习共享的语义信息，但倾向于忽略形式特定的知识。为了解决这个问题，应该考虑额外的正则化术语。例如，Wang等。 [118]提出了DACC名称的一种变体，即深度典范相关自动编码器（DCCAE）。除了最大化视图之间的相关性之外，该模型还通过自动编码器体系结构将每个视图的重构误差最小化。附加自动编码器的作用可以解释为一个正则化项目，旨在提高视图之间相互信息的下限。

So far, most DCCA based applications can be characterized as predicting one modality given another, while DCCA can also be used to generate novel samples. Based on the probabilistic interpretation of CCA [119], Wang et al. [120] proposed an extension named deep variational canonical correlation analysis (VCCA). As a generative model, VCCA enables us to obtain unseen samples of each view. The basic probabilistic interpretation of CCA assumes that two views of observed variable x and y are generated according to conditional probabilities p(x|z) and p(y|z), where z is a latent variable shared by both views. Other than a linear assumption between x, y and z, implemented via DNN network, VCCA aims to model a non-linear relationship among them, which potentially has a stronger representation power. Specifically, the optimization objective of VCCA is a variational lower bound of the likelihood which can be expressed as a sum over data samples. Hence, the model can be trained via stochastic gradient descent method conveniently.

到目前为止，大多数基于DCCA的应用程序都可以表征为预测一种模态并给出另一种模态，而DCCA也可以用于生成新颖的样本。基于CCA的概率解释[119]，Wang等。 [120]提出了一个扩展名为深度变分规范相关分析（VCCA）。作为生成模型，VCCA使我们能够获取每个视图的看不见的样本。 CCA的基本概率解释假设根据条件概率p（x | z）和p（y | z）生成观察变量x和y的两个视图，其中z是两个视图共享的潜在变量。除了通过DNN网络实现的x，y和z之间的线性假设外，VCCA旨在对它们之间的非线性关系进行建模，这可能具有更强大的表示能力。具体而言，VCCA的优化目标是可能性的变化下限，可以将其表示为数据样本的总和。因此，可以通过随机梯度下降法方便地训练模型。

A challenge for DCCA is its relatively poor scalability. Directly inherited from basic CCA, the standard correlation function couples all training samples together and cannot be expressed as a sum of all data samples. Thus, Andrew et al. [117] choose a batch-based algorithm (L-BFGS) to optimize the network. However, it computes gradients over entire data samples and requires high memory volume which is infeasible for large datasets. In order to improve the scalability of DCCA, some efforts have been made. Wang et al. [121], [122] adopted a stochastic optimization method with large mini-batch to approximate the gradients. As a result, the problem of memory consumption is relieved.

DCCA面临的挑战是其相对较差的可伸缩性。标准相关函数直接从基本CCA继承，将所有训练样本耦合在一起，并且不能表示为所有数据样本的总和。因此，安德鲁等。 [117]选择基于批处理的算法（L-BFGS）来优化网络。但是，它会计算整个数据样本的梯度，并且需要大量的存储空间，这对于大型数据集是不可行的。为了提高DCCA的可伸缩性，已经做了一些努力。 Wang等。 [121]，[122]采用了随机优化方法，该方法具有大的迷你批处理来近似梯度。结果，减轻了存储器消耗的问题。

Recently, a more efficient optimization solution named SoftCCA,whichrequireslowercomputationcomplexity,has been proposed by Chang et al. [123]. Unlike to traditional CCA which constrains the correlation matrix over the training batch to be an identity matrix, Soft CCA relaxes this constraint to a loss in (20), which minimizes the L1loss of off-diagonal element in constraint matrix. By expressing CCA objective as a loss function, Soft CCA avoids some computationally expensive operations such as matrix inversion and singular value decomposition (SVD). Thus, Soft CCA is effective and more scalable in computation.

最近，Chang等人提出了一种更有效的优化解决方案，称为SoftCCA，它需要较低的计算复杂度。 [123]。与传统的CCA将训练批次上的相关矩阵约束为一个单位矩阵不同，Soft CCA将该约束放宽到（20）中的损失，从而使约束矩阵中非对角元素的L1损失最小。通过将CCA目标表示为损失函数，Soft CCA避免了一些计算量大的运算，例如矩阵求逆和奇异值分解（SVD）。因此，Soft CCA是有效的，并且在计算中更具可扩展性。
在这里插入图片描述

Comparing to another type of model in the coordinated framework, cross-modal similarity method, one of the advantages of DCCA is that it can be trained in an unsupervised manner. Due to these advantages, DCCA has been widely used for various multi-view and multimodal learning tasks including word embedding in a multilingual context [124], [125], acoustic features representation [121], matching images and text [29], music retrieval [126], and speech recognition [127], [128]. On the contrary, the drawback of DCCA is the higher computation complexity which may limit its scalability in data size.

与协调框架中的另一种类型的模型（交叉模式相似性方法）相比，DCCA的优点之一是可以以无监督的方式对其进行训练。由于这些优点，DCCA已广泛用于各种多视图和多模式学习任务，包括在多语言环境中嵌入单词[124]，[125]，声学特征表示[121]，匹配图像和文本[29]，音乐检索[126]和语音识别[127]，[128]。相反，DCCA的缺点是较高的计算复杂度，这可能会限制其数据大小的可伸缩性。

D. GENERATIVE ADVERSARIAL NETWORK

在这里插入图片描述
FIGURE 5. The conceptual structure of basic generative adversarial
networks.

Generative adversarial network (GAN) is an emerging deep learning technique. As an unsupervised learning method,it can be used for learning data representation without involving labels, which will significantly lower the dependence on manual annotations. Also, as a generative method, it can be used for generating high-quality novel samples according to the distribution of training data. Since 2014, after being proposed by Goodfellow et al. [82], the generative adversarial learning strategy has been successfully used for various unimodal applications. One of the best-known applications is image synthesis [82], [129], [130], which generates high-quality images according to a random input drawn from a normal distribution. The other successful examples including image-to-image translation [131] and image super-resolution [132]. Most recently, generative adversarial learning strategy is further extended to multimodal cases such as text-to-image synthesis [15], [44], visual captioning [40], [43], cross-modal retrieval [30], multimodal features fusion [4], and multimodal storytelling [133]. In this section, we will briefly introduce the fundamental concepts of GAN and discuss its role in multimodal representation learning.

生成对抗网络（GAN）是一种新兴的深度学习技术。作为一种无监督的学习方法，它可以用于学习数据表示而无需涉及标签，这将大大降低对手动注释的依赖。同样，作为一种生成方法，它可以根据训练数据的分布来生成高质量的新颖样本。自2014年以来，由Goodfellow等提出。 [82]，生成对抗性学习策略已成功用于各种单峰应用。图像合成[82]，[129]，[130]是最著名的应用程序之一，它可以根据从正态分布得出的随机输入生成高质量的图像。其他成功的示例包括图像到图像的转换[131]和图像超分辨率[132]。最近，生成式对抗学习策略进一步扩展到多模式案例，例如文本到图像合成[15]，[44]，视觉字幕[40]，[43]，跨模式检索[30]，多模式特征融合[4]和多模式讲故事[133]。在本节中，我们将简要介绍GAN的基本概念，并讨论GAN在多模式表示学习中的作用。

Generally,agenerativeadversarialnetworkiscomposedof two components, a generative network G playing as a generator and a discriminative networkDplaying as a discriminator, contesting with each other. The network G is responsible for generating new samples according to the learned data distribution. While the network D aims to discriminate the difference between an instance generated by network G and an item sampled from the training set. Commonly, both components,GandD, are implemented via deep neural networks.

通常，对抗性广告网络由两个部分组成，一个生成网络G充当生成器，一个判别网络D扮演鉴别器，彼此竞争。网络G负责根据学习到的数据分布生成新样本。网络D旨在区分网络G生成的实例与从训练集中采样的项目之间的差异。通常，两个组件GandD都是通过深度神经网络实现的。

The generator G can be considered as a function mapping a vector in latent space, z, into a sample in data space, and this mapping can be formulated as G(z;θg) → x, where θg is the parameters of G. Similarly, the discriminator D can be formulated as D(x, θd) → p, mapping a matrix or a vector into a scalar probability value predicting whether a sample is drawn from training data or not, where θd is the parameters of D and p ∈ (0,1). Although G generates novel samples from distribution Pg(x), it endeavors to capture the ground truth Pdata(x). Once the distribution Pg estimates Pdata well enough, the discriminator D will be confused, and its prediction accuracy will be lowered. Theoretically, Goodfellow et al. [82] shows that the global optimum can be achieved on condition that Pg= Pdata. In such a case, the discriminator is unable to distinguish the difference between them, and the predicted probabilitypwill be 0.5 for all inputs.

可以将生成器G视为将潜在空间z中的向量映射到数据空间中的样本的函数，并且可以将该映射表示为G（z;θg）→x，其中θg是G的参数。，可将判别器D公式化为D（x，θd）→p，将矩阵或向量映射到标量概率值中，以预测是否从训练数据中提取样本，其中θd是D和p∈的参数（0,1）。尽管G从分布Pg（x）生成了新颖的样本，但它努力捕获基本事实Pdata（x）。一旦分布Pg足够好地估计了Pdata，鉴别器D将被混淆，并且其预测精度将降低。从理论上讲，Goodfellow等。 [82]表明，在Pg = Pdata的条件下可以实现全局最优。在这种情况下，判别器无法区分它们之间的差异，并且所有输入的预测概率p将为0.5。

在这里插入图片描述

The optimization objective of GANs is a solution of (21), where function V(G,D) is the cross-entropy loss of discriminator D which formulated in (22). During the training process, G and D will be updated in an iterative paradigm. While one of the components is updated, the parameters of another one will keep fixed. In step one, given samples from either generator or training dataset, the discriminator is trained to tell them apart. This objective is achieved by maximizing function V. On the other hand, in step two, the generator is trained to produce samples sufficient to confuse the discriminator. This objective is achieved by minimizing the function V. In such an adversarial manner, both subnets evolve alternately.

GAN的优化目标是（21）的解决方案，其中函数V（G，D）是鉴别器D的交叉熵损失，其在（22）中公式化。在训练过程中，将以迭代方式更新G和D。在更新其中一个组件时，另一组件的参数将保持固定。在第一步中，给定来自生成器或训练数据集的样本，对鉴别器进行训练以区分它们。通过使函数V最大化来实现此目标。另一方面，在第二步中，训练生成器生成足以混淆鉴别器的样本。该目标通过最小化函数V来实现。以这种对抗的方式，两个子网交替发展。

Comparing to classic representation learning methods, a visible difference for GANs is that the learning process of data representation is not straightforward. It is rather in an implicit paradigm. Unlike traditional unsupervised representation methods, such as autoencoders, which learns a mapping from data to latent variables directly, GANs learns a reverse mapping from latent variables to the data samples. Specifically, the generator maps a random vector into a distinctive sample. Thus, this random signal is a representation corresponding to generated data. On condition that Pgfits Pdatawell, this random signal is a good enough representation for realistic training data.

与经典表示学习方法相比，GAN的一个明显区别是数据表示的学习过程并不简单。而是在隐式范例中。与传统的无监督表示方法（例如，自动编码器）直接学习从数据到潜在变量的映射不同，GAN学会了从潜在变量到数据样本的反向映射。具体而言，生成器将随机向量映射到独特样本中。因此，该随机信号是与生成的数据相对应的表示。在Pgfits Pdatawell的条件下，此随机信号足以表示实际的训练数据。

However, despite the success of GANs in image synthesis, a disadvantage of basic GANs is that the latent representation is hard to be interpreted since such a random representation has no connection with meaningful semantics. To improve the interpretability of this latent representation, Chen et al. [134] introduced a semantically meaningful method name InfoGAN which separates the random noise vector into several groups, z and c = (c1, . . . ,cL). By maximizing the mutual information between latent variable c and generator distributionG(z,c), the model encourages the differentcito represent uncoupled salient attributes. As a result, a modificationonthe value of ciwill lead to a change of its relevant data attributes such as shape or style.

然而，尽管GAN在图像合成中取得了成功，但基本GAN的缺点是，由于这种随机表示与有意义的语义无关，因此难以解释潜在表示。为了提高这种潜在表示的可解释性，Chen等人。 [134]引入了一种语义上有意义的方法名称InfoGAN，该方法将随机噪声向量分为几组，z和c =（c1，…，cL）。通过最大化潜在变量c和生成器分布G（z，c）之间的互信息，该模型鼓励差异表示非耦合的显着属性。结果，对ci值的修改将导致其相关数据属性（例如形状或样式）发生变化。

Another disadvantage of basic GANs is its lacking of a direct mapping from data to latent space which is critical for representation learning in traditional tasks such as retrieval and classification. To address this problem, some techniques equipped with an additional inference network have been proposed [135], [136]. Other typical models which can translate representations between data space and latent space bi-directionally include Adversarially Learned Inference model (ALI) [137] and Bidirectional Generative Adversarial Networks (BiGANs) [138]. In these models, thegeneratorcomprisesapairofparallelnetworks:adecoder used for mapping a latent vector z into a novel sample ˆ x, and an encoder which is responsible for inferring ˆ z from x. The decoder and the encoder will be optimized jointly such that the tuples?ˆ x,z?and?x,ˆ z?are similar enough to fool the discriminator.

基本GAN的另一个缺点是，它缺乏从数据到潜在空间的直接映射，这对于传统任务（例如检索和分类）中的表示学习至关重要。为了解决这个问题，已经提出了一些装备有附加推理网络的技术[135]，[136]。可以双向转换数据空间和潜在空间之间的表示形式的其他典型模型包括对抗学习推理模型（ALI）[137]和双向生成对抗网络（BiGAN）[138]。在这些模型中，发电机包括并行网络的一对：用于将潜矢量z映射到新样本ˆ x的编码器，一个编码器，负责从x推断ˆ z。解码器和编码器将被共同优化，以使元组ˆ x，z和andx，ˆ z足够相似以欺骗鉴别器。

在这里插入图片描述
FIGURE 6. The conceptual structure of text-to-image generative adversarial networks.

Most recently, generative adversarial learning strategy has been extended to multimodal representation cases, mainly including cross-modal translation and retrieval. Although in both applications, the core role of adversarial learning is narrowing the distribution difference between modalities, their focuses are slightly different. Specifically, in crossmodal translation applications, GANs will help the encoder tocapturesharedsemanticconceptsamongmodalities,while, incross-modalretrieval,givenpairedmultimodalinputs,they will help the coupled encoders to yield paired representations that are similar enough in common subspace.

最近，生成式对抗学习策略已扩展到多模式表示案例，主要包括跨模式翻译和检索。尽管在两种应用中，对抗性学习的核心作用是缩小模式之间的分布差异，但它们的重点稍有不同。具体而言，在交叉模式翻译应用中，GAN将帮助编码器捕获模态之间的共享语义概念，而在交叉模式检索中，给定配对的多模式输入将帮助耦合的编码器产生在公共子空间中足够相似的配对表示。

In cross-modal translation area, take text-to-image synthesis as an example, one of the key challenges is to properly encode visual concepts such as object categories, colors and location from text descriptions into a vector such that another modality can be generated correctly according to this intermediate representation. To address this problem, based on conditional generative adversarial nets (CGAN) [139], Reed et al. [15] proposed an end-to-end architecture to train the text encoder. As Fig. 6 illustrated, in this model, the text input acted as the condition is encoded into vector T, then T along with a noise vector Z are translated into an image, after that, the discriminator will tell whether T and the image encoding V is compatible or not. To gain a visually-discriminative vector representation of text descriptions, the optimization objective is a structured loss [140] formulated as follows:

在跨模式翻译领域，以文本到图像的合成为例，关键的挑战之一是如何将文本描述中的视觉概念（例如对象类别，颜色和位置）正确编码为向量，从而可以生成另一种模式根据这个中间表示正确地。为了解决这个问题，Reed等人基于条件生成对抗网络（CGAN）[139]。 [15]提出了一种端到端的体系结构来训练文本编码器。如图6所示，在该模型中，将作为条件的文本输入编码为向量T，然后将T与噪声向量Z一起转换为图像，然后，鉴别器将判断T和图像编码V是否兼容。为了获得文本描述的视觉区分向量表示，优化目标是结构化损失[140]，其公式如下：

在这里插入图片描述

where {(vn,tn,yn),n = 1,…,N} is the training set, 1 is the 0-1 loss, vn are the images, tn are the text descriptions, and yn are the class labels. Classifiersfvandftare definedasfollows:

其中{(vn，tn，yn),n = 1,… ,N}是训练集，1是0-1损失，vn是图像，tn是文本描述，yn是类标签。分类器的定义如下：

在这里插入图片描述

where ϕ denotes the text encoder, φ denotes the image encoder, T (y) denotes the text set belongs to class y and likewise V(y) for images. Via optimizing loss function (23), the adversarial process between G and D will not only guide the generator to align images with the text descriptions but also help the text encoder to capture shared visual semantic concepts among modalities.

其中ϕ表示文本编码器，φ表示图像编码器，T(y) 表示文本集属于类y，同样，图像的 V(y) 也属于类别。通过优化损失函数（23），G和D之间的对抗过程不仅会引导生成器将图像与文本描述对齐，而且还将帮助文本编码器捕获模态之间共享的视觉语义概念。

To improve the performance of text-to-image synthesis, several models [44], [141], [142] which share the same basic structure illustrated in Fig. 6 have been proposed. In different ways, they improved the text encoder such that visual information from text descriptions can be encoded more explicitly. Zhang et al. [44] adopted a sketch-refinement process to generate photo-realistic images. Conditioned on text descriptions, their model firstly sketches a low-resolution image and then generates a high-resolution image in the refinement stage. Also, in this model, in order to improve the diversity of the synthesized images, they introduce a Conditioning Augmentation technique to encourage the text encoding to be smooth in the latent conditioning space. In [141], Reed et al. combined object location information, which is provided by bounding boxes or key points, with the text descriptions to describe what content to draw in which location. Other than using a sentence as the condition, Johnson et al. [142], instead, proposed to use a scene graph as the input of the translation network. To process the scene graphs, in the proposed model, a graph convolution network is designed to encode the nodes and edges information into representation vectors. Comparing to unstructured text, the structured scene graphs which describe objects and their relationships explicitly will help for generating complex images.

为了提高文本到图像合成的性能，已经提出了几种共享图6所示基本结构的模型[44]，[141]，[142]。他们以不同的方式改进了文本编码器，以便可以更明确地编码来自文本描述的视觉信息。张等。 [44]采用了草图细化过程来生成逼真的图像。以文本描述为条件，他们的模型首先绘制低分辨率图像，然后在细化阶段生成高分辨率图像。同样，在该模型中，为了改善合成图像的多样性，他们引入了条件增强技术以鼓励文本编码在潜在条件空间中保持平滑。在[141]中，Reed等人。结合的对象位置信息（由边界框或关键点提供）与文本描述一起描述在哪个位置绘制哪些内容。除了使用句子作为条件外，Johnson等人。 [142]而是建议使用场景图作为翻译网络的输入。为了处理场景图，在提出的模型中，设计了图卷积网络将节点和边缘信息编码为表示向量。与非结构化文本相比，显式描述对象及其关系的结构化场景图将有助于生成复杂图像。

在这里插入图片描述
FIGURE 7. Two methods used for improving modality-invariant property via adversarial learning. The key idea is mapping paired inputs into a common subspace such that the discriminator cannot distinguish which modality a feature comes from. (a) Discriminate which modality a feature comes from. (b) Discriminate whether the input is a pair or not.

In the cross-modal retrieval area, the main role of GANs is to help the coupled encoders to yield paired representations that are similar enough in common subspace. The key idea is mapping paired inputs into a common subspace such that the discriminator cannot distinguish which modality a feature comes from. According to the input contents of the discriminator, the typical structures of cross-modal adversarial models can be generalized into two categories. In the firstcategory,whichisillustratedinFig.7(a),theinputsofthemodalitydiscriminatorarefeaturesgeneratedbyencoders.Whilein thesecondcategoryillustratedbyFig.7(b),theinputsaredata samples. The rest of this section, we will describe both types of learning strategies.

在跨模态检索领域，GAN的主要作用是帮助耦合编码器产生在公共子空间中足够相似的配对表示。关键思想是将成对的输入映射到公共子空间中，以使鉴别器无法区分特征来自哪个模态。根据鉴别器的输入内容，可以将交叉模式对抗模型的典型结构概括为两类。在第一类中，如图7（a）所示，输入软模式识别器具有编码器生成的功能。而在第二类中，图7（b）中所示，输入是数据样本。在本节的其余部分，我们将介绍两种学习策略。

As Fig. 7(a) showed, the cross-modal adversarial model of the first category is composed of two generators and a discriminator. Each generator is a feature encoder used for mapping either text or images in to a common latent subspace, where features from different modalities can be compared directly. The desired goal is to narrow the distribution gap of different modalities, which means that the data with a similar semantic from different modalities may be mapped into the adjacent points in common space. During training, the generators seek to yield modality-invariant representations; on the contrary, a modality classifier, also the discriminator of GANs, is used for discriminating where a feature comes from. Once the discriminator cannot distinguish the source of feature vectors, the distribution gap of different modalities will be minimized accordingly.

如图7（a）所示，第一类交叉模式对抗模型由两个生成器和一个鉴别器组成。每个生成器都是一个特征编码器，用于将文本或图像映射到一个共同的潜在子空间中，在这里可以直接比较来自不同模态的特征。期望的目标是缩小不同模态的分布差距，这意味着来自不同模态的具有相似语义的数据可以映射到公共空间中的相邻点。在训练过程中，生成器试图产生模态不变的表示。相反，模态分类器，也是GAN的鉴别器，用于鉴别特征的来源。一旦鉴别器不能区分特征向量的来源，则不同形式的分布间隙将相应地最小化。

Based on the learning strategies of the first category, several models used for cross-modal retrieval have been proposed [4], [30], [143]. In these models, the adversarial process is served to enforce the distributions of projected representations from different modalities to be closer to each other. The main difference between them is the way how to preserve the intra-modality and inter-modality similarities simultaneously. For example, Wang et al. [30] proposed to learn presentations that are modality-invariant and discriminative. In addition to the modality classifier, a label predictor is also integrated into this model to keep the learned features discriminative within each modality. Further, a triplet margin rank constraint is added to the label classifier such that inter-modality similarity can be preserved.

基于第一类的学习策略，已经提出了几种用于交叉模式检索的模型[4]，[30]，[143]。在这些模型中，对抗过程用于强制使来自不同形式的投影表示的分布彼此更接近。它们之间的主要区别是如何同时保留模态内和模态间相似性的方式。例如，Wang等。 [30]提议学习形式不变且具有区别性的陈述。除了模态分类器外，标签预测器也集成到该模型中，以使学习到的特征在每个模态中具有区别性。此外，将三元组边距等级约束添加到标签分类器，使得可以保持帧间相似性。

Peng et al. [4] proposed to learn discriminative common representation for bridging the heterogeneity gap. In their model, the generator is formed by a cross-modal autoencoder with weight-sharing constraint, and the discriminator is composed of two kinds of discriminative modules: intra-modality and inter-modality discriminators. The generator seeks to project multimodal inputs into common subspace with two useful properties, keeping semantic consistency within each modality and distribution consistency among modalities, on the contrary, the discriminators tries to detect the inconsistency. Specifically, the intra-modality discriminator aims to distinguishgeneratedreconstructionfeaturefromtheoriginal input,whiletheinter-modalitydiscriminatorendeavorstotell which modality a feature comes from.

Peng等。 [4]提出学习区分性的共同表示，以弥合异质性差距。在他们的模型中，生成器由具有权重共享约束的交叉模态自动编码器组成，鉴别器由两种鉴别模块组成：模内鉴别器和模态鉴别器。生成器试图将多模态输入投影到具有两个有用属性的公共子空间中，以保持每个模态内的语义一致性以及模态之间的分布一致性，相反，鉴别器试图检测不一致之处。具体而言，模态内部的鉴别器旨在从原始输入中区分生成的重构特征，而模态内部的鉴别器则尽力说明特征来自哪个模态。

The model proposed by Xu et al. [143] aims to learn cross-modal representations which are maximally correlated and statistically indistinguishable in the common subspace. They decompose the whole problem into three loss terms: an adversarial loss which is utilized to minimize the statistical difference between distributions of different modalities, a feature discrimination loss which ensures the representations to be discriminative within each modality, and a cross-modal correlation loss which is responsible for keeping cross-modal similarity structure. Specifically, the cross-modal correlation loss is measured by the square distance between pairs of samples come from different modalities. If a pair come from the same category, its distance will be minimized. Otherwise, it is maximized.

徐等人提出的模型。 [143]旨在学习在公共子空间中最大相关且在统计上无法区分的交叉模态表示。他们将整个问题分解为三个损失项：对抗损失（用于最小化不同形式的分布之间的统计差异），特征辨别损失（用于确保每个形式内的表示具有区别性）和交叉形式的相关损失（负责保持跨模式相似性结构。具体而言，跨模态相关损失是通过来自不同模态的样本对之间的平方距离来衡量的。如果一对来自同一类别，则其距离将最小。否则，它将最大化。

As Fig. 7(b) showed, the cross-modal adversarial model of the second category contains an encoder-decoder network, which translates one modality into another. For example, given a pair of input (v,t), the encoder maps t into a representation vector, then the decoder, playing as the generator, maps this vector into a reproduced sample ˆ v. The generated sample ˆ v is expected to sufficiently similar to v, such that the reproduced pair (ˆ v,t) is considered as a real pair by the discriminator. On condition that the learned representation can be translated into another modality soundly, it is believable that the cross-modal invariant property has been preserved. An example in this category is the model proposed by Gu et al. [86] which integrated a generative adversarial network in their model to train a text encoder. In the following, more examples will be shown to demonstrate how this model can be used in practice.

如图7（b）所示，第二类交叉模式对抗模型包含一个编码器-解码器网络，该网络将一种模式转换为另一种模式。例如，给定一对输入（v，t），编码器将t映射为表示向量，然后解码器（充当生成器）将该向量映射为再现样本ˆ v。充分类似于v，从而使复制对（ˆ v，t）被鉴别器视为真实对。在可以将学习到的表示形式正确地转换为另一个模态的条件下，可以相信交叉模态不变属性已被保留。 Gu等人提出的模型就是这一类的一个例子。 [86]在他们的模型中集成了生成对抗网络来训练文本编码器。在下面，将显示更多示例，以演示如何在实践中使用此模型。

Zhang et al. [144] adopted GANs to model cross-modal hashing in an unsupervised fashion. In addition to preserving inter-modality and intra-modality correlations in the common hash space, the property preserving manifold structure across different modalities is also desired in their model. Given a sample from a modality, the generator is trained to select a sample from another modality located in the same manifold. Then, the discriminator will determine whether the generated pair of samples belonging to the same manifold structure or not. Here, the hash codes play a key role for both generator and discriminator. Specifically, the generator selects samples conditioned on hash codes; also, the discriminator judges their correlation between modalities based on hash codes. The adversarial learning process is used for enhancing the property of preserving cross-modal manifold structure in a common hash space.

张等。 [144]采用GANs以无监督的方式对跨模式哈希建模。除了在公共哈希空间中保留模态之间和模态内部的相关性外，在其模型中还需要跨不同模态的属性保持流形结构。给定来自模态的样本，训练生成器以从位于相同歧管中的另一模态中选择样本。然后，鉴别器将确定所生成的成对样本是否属于同一歧管结构。在这里，哈希码对于生成器和鉴别器都起着关键作用。具体来说，生成器选择以哈希码为条件的样本；同样，鉴别器基于哈希码判断它们在模态之间的相关性。对抗学习过程用于增强在公共哈希空间中保留交叉模式流形结构的性质。

Wu et al. [145] extended CycleGAN [146] to learn cross-modal hash functions on the condition without paired training samples are available. CycleGAN can be seen as a special case of the second category, which includes a pair of encoder-decoder, each of them is designed to translate one modality into another. For example, given an input v,the model translates v into t, and then t is reversely translated back to ˆ v, it is expected that v ≈ ˆ v. Similarly, given an input t, a reconstructedˆ t is expected to roughly equal witht. Based on the cycle-consistent constraint in both modalities, the model can be trained in the absence of paired training samples.

Wu等。 [145]扩展CycleGAN [146]，可以在没有成对训练样本的情况下学习交叉模式哈希函数。 CycleGAN可以看作是第二类的特例，它包括一对编码器/解码器，它们中的每一个都旨在将一种形式转换为另一种形式。例如，给定输入v，该模型将v转换为t，然后将t反向转换为ˆ v，期望v≈ˆ v。类似地，给定输入t，重构后的ˆ t期望与t大致相等。基于两种模式中周期一致的约束，可以在没有成对的训练样本的情况下训练模型。

One of the advantages of GAN is that it can be trained by unsupervised learning which will significantly lower the dependence on manual annotations. Another advantage is its powerful ability to generate high-quality novel samples according to the distribution of training data. However, though a unique global optimum is existent theoretically, it is challenging to train a GAN system which may suffer from training instability, either ‘‘collapsing’’ or failing to converge [147]. Although several improvements have been proposed [147]–[150], the way for stabilizing the training of GANs remains an open problem.

GAN的优点之一是可以通过无监督学习对其进行训练，这将大大降低对手动注释的依赖。另一个优势是其强大的能力，可以根据训练数据的分布生成高质量的新颖样本。但是，尽管理论上存在一个唯一的全局最优值，但是训练一个可能会遭受训练不稳定性（“崩溃”或无法收敛）的GAN系统却具有挑战性[147]。尽管已经提出了一些改进[147]-[150]，但是稳定GAN训练的方法仍然是一个悬而未决的问题。

datamonday

发布了176 篇原创文章 · 获赞 694 · 访问量 5万+

私信关注

【Paper】Deep Multimodal Representation Learning: A Survey （Part2）

文章目录

Deep Multimodal Representation Learning: A Survey

III. TYPICAL MODELS

A. PROBABILISTIC GRAPHICAL MODELS

B. MUL TIMODAL AUTOENCODERS

C. DEEP CANONICAL CORRELATION ANAL YSIS

D. GENERATIVE ADVERSARIAL NETWORK

猜你喜欢