密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇(下)

 [ TPAMI 2021 ]

Multi-Task Learning for Dense Prediction Tasks:

A Survey

[ The authors ]

Simon Vandenhende, Wouter Van Gansbeke and Marc Proesmans

Center for Processing Speech and Images, Department Electrical Engineering, KU Leuven.

Stamatios Georgoulis and Dengxin Dai

Computer Vision Lab, Department Electrical Engineering, ETH Zurich.

Luc Van Gool

Center for Processing Speech and Images, KU Leuven;

Computer Vision Lab, ETH Zurich.

[ Paper | Code ]

Multi-Task Learning for Dense Prediction Tasks: A Survey

GitHub - SimonVandenhende/Multi-Task-Learning-PyTorch: PyTorch implementation of multi-task learning architectures, incl. MTI-Net (ECCV2020).

Figure 1 shows a structured overview of the paper. Our code is made publicly available to ease the adoption of the reviewed MTL techniques: https://github.com/ SimonVandenhende/Multi-Task-Learning-PyTorch.

[ CSDN Links ]

该综述全篇过长,故将其分为 4 部分分别讲解,相关博客链接如下:

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 摘要前言篇

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (上)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (下)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 优化方法篇

____________________\triangledown____________________

目录

2  Deep Multi-Task Architectures

2.3 Decoder-Focused Architectures

2.3.1 PAD-Net

2.3.2 Pattern-Affinitive Propagation Networks

2.3.3 Joint Task-Recursive Learning

2.3.4 Multi-Scale Task Interaction Networks

2.4 Other Approaches


2  Deep Multi-Task Architectures

2.1 节和 2.2 节请参阅博客:

2.3 Decoder-Focused Architectures

The encoder-focused architectures in Section 2.2 follow a common pattern: they directly predict all task outputs from the same input in one processing cycle (i.e. all predictions are generated once, in parallel or sequentially, and are not refined afterwards). By doing so, they fail to capture commonalities and differences among tasks, that are likely fruitful for one another (e.g. depth discontinuities are usually aligned with semantic edges). Arguably, this might be the reason for the moderate only performance improvements achieved by the encoder-focused approaches to MTL (see Section 4.3.1). To alleviate this issue, a few recent works first employed a multi-task network to make initial task predictions, and then leveraged features from these initial predictions in order to further improve each task output – in an one-off or recursive manner. As these MTL approaches also share or exchange information during the decoding stage, we refer to them as decoder-focused architectures (see Figure 3b).

章节 2.2 中 encoder-focused 的体系结构遵循一个共同的模式:它们直接预测在一个处理周期中来自相同输入的所有任务输出 (即所有预测都是一次性生成的,并行或顺序生成,之后不会再细化)。这样做,他们无法捕捉到任务之间的共性和差异,而这些共性和差异对彼此来说可能是卓有成效的 (例如,深度不连续通常与语义边缘对齐)。可以论证的是,这可能是 MTL 中以编码器为中心的方法仅取得适度性能改进的原因 (参见 4.3.1 节)。为了缓解这个问题,最近的一些研究首先使用了多任务网络来进行初始任务预测,然后利用这些初始预测的特性来进一步改进每个任务的输出——以一次性或递归的方式。由于这些 MTL 方法也在解码阶段共享或交换信息,我们将它们称为decoder-focused 的体系结构 (参见图3b)。

2.3.1 PAD-Net

PAD-Net [13] was one of the first decoder-focused architectures. The model itself is visualized in Figure 6. As can be seen, the input image is first processed by an off-the-shelf backbone network. The backbone features are further processed by a set of task-specific heads that produce an initial prediction for every task. These initial task predictions add deep supervision to the network, but they can also be used to exchange information between tasks, as will be explained next. The task features in the last layer of the task-specific heads contain a per-task feature representation of the scene. PAD-Net proposed to re-combine them via a multi-modal distillation unit, whose role is to extract cross-task information, before producing the final task predictions.

PAD-Net performs the multi-modal distillation by means of a spatial attention mechanism. Particularly, the output features Fok for task k are calculated as

 where 

returns a spatial attention mask that is applied to the initial task features F i l from task l. The attention mask itself is found by applying a convolutional layer Wk,l to extract local information from the initial task features. Equation 2 assumes that the task interactions are location dependent, i.e. tasks are not in a constant relationship across the entire image. This can be understood from a simple example. Consider two dense prediction tasks, e.g. monocular depth prediction and semantic segmentation. Depth discontinuities and semantic boundaries often coincide. However, when we segment a flat object, e.g. a magazine, from a flat surface, e.g. a table, we will still find a semantic boundary where the depth map is rather continuous. In this particular case, the depth features provide no additional information for the localization of the semantic boundaries. The use of spatial attention explicitly allows the network to select information from other tasks at locations where its usefull.

The encoder-focused approaches in Section 2.2 shared features amongst tasks using the intermediate representations in the encoder. Differently, PAD-Net models the task interactions by applying a spatial attention layer to the features in the task-specific heads. In contrast to the intermediate feature representations in the encoder, the task features used by PAD-Net are already disentangled according to the output task. We hypothesize that this makes it easier for other tasks to distill the relevant information. This multistep decoding strategy from PAD-Net is applied and refined in other decoder-focused approaches.

PAD-Net [13] 是第一个 decoder-focused 的体系结构,模型如图 6 所示。输入图像首先由现成的 backbone 网进行处理。backbone 特征被一组特定任务的头部进一步处理,产生对每个任务的初步预测。这些初始任务预测为网络添加了深度监督,也可以用于任务之间的信息交换。任务特定头部的最后一层任务特征包含场景的每个任务特征表示。PAD-Net 提出通过一个多模态蒸馏单元将它们重新组合,该单元的作用是提取跨任务信息,然后生成最终的任务预测。

   

PAD-Net 通过空间注意机制进行多模态蒸馏。特别地,任务k的输出特征 Fok 计算为 (2) 式。

其中,\sigma (W_{k,l}F ^i _l) 返回一个应用于任务 l 初始任务特征 F^ i _l 的空间注意力 Mask。注意掩码本身是通过应用卷积层Wk,l从初始任务特征中提取局部信息来找到的。方程2假设任务之间的交互依赖于位置,即任务在整个图像上的关系不是恒定的。这可以从一个简单的例子中理解。考虑两个密集的预测任务,例如单目深度预测和语义分割。深度不连续和语义边界经常重合。然而,当将一个平面物体 (如杂志) 与一个平面 (如表格) 分割开来时,仍然会发现一个语义边界,其中深度图相当连续。在这种特殊情况下,深度特性没有为语义边界的本地化提供额外的信息。空间注意的使用明确地允许网络从其他任务中选择有用的信息。

章节 2.2 中 encoder-focused 的方法在使用编码器中的中间表示的任务之间共享特性。与此不同的是,PAD-Net 通过对特定任务头部的特征应用空间注意层来模拟任务交互。与编码器中的中间特征表示不同,PAD-Net 所使用的任务特征已经根据输出任务进行了解耦。假设,这使得其他任务更容易提取相关信息。这种来自 PAD-Net 的多步译码策略被应用于其他 decoder-focused  的方法中。

[13] PAD-Net: Multitasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. CVPR, 2018

 

2.3.2 Pattern-Affinitive Propagation Networks

Pattern-Affinitive Propagation Networks (PAP-Net) [14] used an architecture similar to PAD-Net (see Figure 7), but the multi-modal distillation in this work is performed in a different manner. The authors argue that directly working on the task features space via the spatial attention mechanism, as done in PAD-Net, might be a suboptimal choice. As the optimization still happens at a different space, i.e. the task label space, there is no guarantee that the model will learn the desired task relationships. Instead, they statistically observed that  pixel affinities  tend to align well with common local structures on the task label space. Motivated by this observation, they proposed to leverage pixel affinities in order to perform multi-modal distillation.

To achieve this, the backbone features are first processed by a set of task-specific heads to get an initial prediction for every task. Second, a per-task pixel affinity matrix CM_{ T_j} is calculated by estimating pixel-wise correlations upon the task features coming from each head. Third, a cross-task information matrix \hat{M} _{T_j} for every task Tj is learned by adaptively combining the affinity matrices MTi for tasks Ti with learnable weights \alpha^{T_j}_i

Finally, the task features coming from each head j are refined using the cross-task information matrix Mˆ Tj . In particular, the cross-task information matrix is diffused into the task features space to spread the correlation information across the image. This effectively weakens or strengthens the pixel correlations for task Tj , based on the pixel affinities from other tasks Ti . The refined features are used to make the final predictions for every task.

All previously discussed methods only use limited local information when fusing features from different tasks. For example, cross-stitch networks and NDDR-CNNs combine the features in a channel-wise fashion, while PAD-Net only uses the information from within a 3 by 3 pixels window to construct the spatial attention mask. Differently, PAPNet also models the non-local relationships through pixel affinities measured across the entire image. Zhou et al. [17] extended this idea to specifically mine and propagate both inter- and intra- task patterns.

Pattern-Affinitive Propagation Networks (PAP-Net) [14] 使用了类似于 PAD-Net 的架构 (见图7),但本工作中的多模态蒸馏是以不同的方式进行的。作者认为,通过空间注意机制直接处理任务的空间特征,就像 PAD-Net 那样,可能是次优选择。由于优化仍然发生在不同的空间,即任务标签空间,因此不能保证模型将学习所需的任务关系。相反,从统计学上观察到,像素亲和力(pixel affinities)倾向于与任务标签空间上的普通局部结构很好地对齐。基于这一观察,提出利用像素亲和力来进行多模态蒸馏。

 

为了实现这一点,

首先,backbone 特征被一组特定任务的头部处理,以获得每个任务的初始预测。

其次,通过估计来自每个头部的任务特征的像素相关性,计算每个任务像素亲和矩阵 MTj。

第三,将任务 Ti 的亲和矩阵 MTi 与可学习权值 \alpha^{T_j}_i 自适应结合,得到每个任务 Tj 的跨任务信息矩阵 \hat{M} _{T_j},如公式(3)。

最后,利用跨任务信息矩阵 \hat{M} _{T_j} 对来自每个头 j 的任务特征进行细化。

具体来说,将跨任务信息矩阵扩散到任务特征空间,将相关信息扩散到整个图像。这有效地削弱或加强了任务 Tj 的像素相关性,基于其他任务Ti的像素相关性。改进后的特征被用于对每个 task 做出最后的预测。

前面讨论的方法在融合不同任务的特征时只使用有限的局部信息。例如,cross-stitch networks and NDDR-CNNs 以通道方式结合了这些特征,而 PAD-Net 仅使用来自 3 × 3 像素窗口内的信息来构建空间注意 mask。不同的是,PAP-Net 还通过测量整个图像的像素亲和力来模拟非局部关系。Zhou 等人 [17] 扩展了这个想法,专门挖掘和传播任务间和任务内模式。

[14] Pattern-Affinitive Propagation across Depth, Surface Normal and Semantic Segmentation. CVPR, 2019

2.3.3 Joint Task-Recursive Learning

Joint Task-Recursive Learning (JTRL) [15] recursively predicts two tasks at increasingly higher scales in order to gradually refine the results based on past states. The architecture is illustrated in Figure 8. Similarly to PAD-Net and PAP-Net, a multi-modal distillation mechanism is used to combine information from earlier task predictions, through which later predictions are refined. Differently, the JTRL model predicts two tasks sequentially, rather than in parallel, and in an intertwined manner. The main disadvantage of this approach is that it is not straightforward, or even possible, to extent this model to more than two tasks given the intertwined manner at which task predictions are refined.

联合任务递归学习 (Joint Task-Recursive Learning, JTRL) [15] 在越来越高的尺度上递归地预测两个任务,以便根据过去的状态逐步完善结果。该体系结构如图 8 所示。类似于 PAD-Net 和 PAP- Net,多模态蒸馏机制被用于结合来自早期任务预测的信息,通过这些信息,后期的预测得到细化。不同的是,JTRL 模型预测两个任务是顺序的,而不是并行的,并以一种相互交织的方式。这种方法的主要缺点是,它不直接,甚至不可能将这个模型扩展到两个以上的任务,因为任务预测是在复杂的方式下进行的。

[15] Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation. ECCV, 2018 

  

2.3.4 Multi-Scale Task Interaction Networks

In the decoder-focused architectures presented so far, the multi-modal distillation was performed at a fixed scale, i.e. the features of the backbone’s last layer. This rests on the assumption that all relevant task interactions can solely be modeled through a single filter operation with a specific receptive field. However, Multi-Scale Task Interaction Networks (MTI-Net) [16] showed that this is a rather strict assumption. In fact, tasks can influence each other differently at different receptive fields.

To account for this restriction, MTI-Net explicitly took into account task interactions at multiple scales. Its architecture is illustrated in Figure 9. First, an off-the-shelf backbone network extracts a multi-scale feature representation from the input image. From the multi-scale feature representation an initial prediction for every task is made at each scale. The task predictions at a particular scale are found by applying a task-specific head to the backbone features extracted at that scale. Similarly to PAD-Net, the features in the last layer of the task-specific heads are combined and refined to make the final predictions. Differently, in MTI-Net the pertask feature representations can be distilled at each scale separately. This allows to have multiple task interactions, each modeled within a specific receptive field. The distilled multi-scale features are upsampled to the highest scale and concatenated, resulting in a final feature representation for every task. The final task predictions are found by decoding these final feature representations in a task-specific manner again. The performance was further improved by also propagating information from the lower-resolution task features to the higher-resolution ones using a Feature Propagation Module.

The experimental evaluation in [16] shows that distilling task information at multiple scales increases the multitasking performance compared to PAD-Net where such information is only distilled at a single scale. Furthermore, since MTI-Net distills the features at multiple scales, i.e. using different pixel dilations, it overcomes the issue of using only limited local information to fuse the features, which was already shown to be beneficial in PAP-Net.

在目前所介绍的以解码器为重点的体系结构中,多模态蒸馏是在固定的尺度上进行的,即 backbone 最后一层的特征。这基于这样一个假设: 所有相关的任务交互都可以通过一个具有特定滤波接受域的单一过作来建模。然而,多尺度任务交互网络 (MTI-Net) [16] 表明,这是一个相当严格的假设。事实上,不同的任务在不同的接受领域会产生不同的影响

为了考虑到这一限制,MTI-Net 明确地考虑了多个尺度上的任务交互。其体系结构如图 9 所示。首先,利用现成的 backbone 网络从输入图像中提取多尺度特征表示。从多尺度特征表示出发,在每个尺度上对每个任务进行初始预测。在特定尺度下的任务预测是通过将特定任务的头部应用到该尺度下提取的主干特征上得到的。与 PAD-Net 类似,将任务特定头部的最后一层的特征进行组合和细化,从而做出最终的预测。不同的是,在 MTI-Net 中,每个任务的特征表示可以在每个尺度上单独提取。这允许有多个任务交互,每个任务都在一个特定的接受域内建模。将蒸馏得到的多尺度特征上采样到最高尺度并进行连接,从而得到每个任务的最终特征表示。通过再次以特定于任务的方式解码这些最终特征表示,可以找到最终任务预测。通过使用特性传播模块将信息从低分辨率任务特性传播到高分辨率任务特性,性能得到了进一步改进。

[16] 的实验评价表明,与只在单一尺度上提取任务信息的 PAD-Net 相比,在多个尺度上提取任务信息可以提高多任务处理性能。此外,由于 MTI-Net 在多个尺度上提取特征,即使用不同的像素膨胀,它克服了只使用有限的局部信息来融合特征的问题,这在 PAP-Net 中已经被证明是有益的。

[16] MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning. ECCV, 2020

2.4 Other Approaches

A number of approaches that fall outside the aforementioned categories have been proposed in the literature. For example, multilinear relationship networks [56] used tensor normal priors to the parameter set of the task-specific heads to allow interactions in the decoding stage. Different from the standard parallel ordering scheme, where layers are aligned and shared (e.g. [5], [7]), soft layer ordering [64] proposed a flexible sharing scheme across tasks and network depths. Yang et al. [65] generalized matrix factorisation approaches to MTL in order to learn cross-task sharing structures in every layer of the network. Routing networks [66] proposed a principled approach to determine the connectivity of a network’s function blocks through routing. Piggyback [67] showed how to adapt a single, fixed neural network to a multi-task network by learning binary masks. Huang et al. [68] introduced a method rooted in Neural Architecture Search (NAS) for the automated construction of a tree-based multi-attribute learning network. Stochastic filter groups [57] re-purposed the convolution kernels in each layer of the network to support shared or taskspecific behaviour. In a similar vein, feature partitioning [69] presented partitioning strategies to assign the convolution kernels in each layer of the network into different tasks.

In general, these works have a different scope within MTL, e.g. automate the network architecture design. Moreover, they mostly focus on solving multiple (binary) classification tasks, rather than multiple dense predictions tasks. As a result, they fall outside the scope of this survey, with one notable exception that is discussed next.

许多文献中提出了不在上述范畴的方法。例如,多重线性关系网络 [56] 使用张量正态先验到特定任务头部的参数集,以允许在解码阶段进行交互。与标准的并行排序方案 (如 [5],[7]) 不同,soft layer 排序 [64] 提出了一种跨任务和网络深度的灵活共享方案。Yang 等人 [65] 采用广义矩阵分解的MTL方法,以学习网络各层的跨任务共享结构。Routing 网络 [66] 提出了一种通过路由确定网络功能块连通性的原则方法。Piggyback [67] 展示了如何通过学习二进制掩码将单一固定的神经网络适应到多任务网络。Huang 等人 [68] 引入了一种基于神经体系结构搜索 (Neural Architecture Search, NAS) 的方法来自动构建基于树的多属性学习网络。随机滤波器组 [57] 在网络的每一层重新使用卷积核,以支持共享或特定任务的行为。类似地,feature partitioning [69] 提出了划分策略,将网络各层的卷积核分配到不同的任务中。

一般来说,这些工作在 MTL 中有不同的范围,例如自动化网络架构设计。此外,它们大多专注于解决多个 (二元) 分类任务,而不是多个密集的预测任务。因此,它们不在本综述的范围之内,但有一个值得注意的例外情况将在下面讨论。

[56] Learning multiple tasks with multilinear relationship networks. NIPS, 2017.

[57] Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. ICCV, 2019.

 

[64] Beyond shared hierarchies: Deep multitask learning through soft layer ordering. ICLR, 2018.

[65] Deep multi-task representation learning: A tensor factorisation approach. arXiv:1605.06391, 2016.

[66] Routing networks: Adaptive selection of non-linear functions for multi-task learning. ICLR, 2018.

[67] Piggyback: Adapting a single network to multiple tasks by learning to mask weights. ECCV, 2018.

[68] Gnas: A greedy neural architecture search method for multi-attribute learning. ACMMM, 2018.

[69] Feature partitioning for efficient multi-task architectures. arXiv:1908.04339, 2019.

Attentive Single-Tasking of Multiple Tasks (ASTMT) [18] proposed to take a ’single-tasking’ route for the MTL problem. That is, within a multi-tasking framework they performed separate forward passes, one for each task, that activate shared responses among all tasks, plus some residual responses that are task-specific. Furthermore, to suppress the negative transfer issue they applied adversarial training on the gradients level that enforces them to be statistically indistinguishable across tasks. An advantage of this approach is that shared and task-specific information within the network can be naturally disentangled. On the negative side, however, the tasks can not be predicted altogether, but only one after the other, which significantly increases the inference speed and somehow defies the purpose of MTL.

注意多任务单任务处理 (ASTMT) [18]提出对 MTL 问题采取 “单任务” 路线。也就是说,在多任务框架下,他们执行单独的前向传递,每个任务一次,激活所有任务之间的共同反应,以及一些特定任务的残差反应。此外,为了抑制负迁移问题,在梯度水平上应用了对抗训练,迫使他们在不同任务之间在统计上难以区分。这种方法的优点是可以自然地将网络中共享的和特定于任务的信息分离出来。但消极的一面是,任务不能完全预测,只能一个接一个地预测,这大大提高了推理速度,在某种程度上违背了 MTL 的目的

____________________\triangle____________________

[ Links ]

 该综述全篇过长,故将其分为 4 部分分别讲解,相关博客链接如下:

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 摘要前言篇

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (上)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (下)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 优化方法篇

____________________\triangledown____________________

[ Extension ]

Multi-Task Learning with Deep Neural Networks: A Survey (2020)

Guess you like

Origin blog.csdn.net/u014546828/article/details/121529199