密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 优化方法篇

[ TPAMI 2021 ]

Multi-Task Learning for Dense Prediction Tasks:

A Survey

[ The authors ]

Simon Vandenhende, Wouter Van Gansbeke and Marc Proesmans

Center for Processing Speech and Images, Department Electrical Engineering, KU Leuven.

Stamatios Georgoulis and Dengxin Dai

Computer Vision Lab, Department Electrical Engineering, ETH Zurich.

Luc Van Gool

Center for Processing Speech and Images, KU Leuven;

Computer Vision Lab, ETH Zurich.

[ Paper | Code ]

Multi-Task Learning for Dense Prediction Tasks: A Survey

GitHub - SimonVandenhende/Multi-Task-Learning-PyTorch: PyTorch implementation of multi-task learning architectures, incl. MTI-Net (ECCV2020).

Figure 1 shows a structured overview of the paper. Our code is made publicly available to ease the adoption of the reviewed MTL techniques: https://github.com/ SimonVandenhende/Multi-Task-Learning-PyTorch.

[ CSDN Links ]

该综述全篇过长,故将其分为 4 部分分别讲解,相关博客链接如下:

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 摘要前言篇

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (上)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (下)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 优化方法篇

____________________\triangledown____________________

3  Optimization in MTL

In the previous section, we discussed about the construction of network architectures that are able to learn multiple tasks concurrently. Still, a significant challenge in MTL stems from the optimization procedure itself. In particular, we need to carefully balance the joint learning of all tasks to avoid a scenario where one or more tasks have a dominant influence in the network weights. In this section, we discuss several methods that have considered this task balancing problem.

在前一节中,本文讨论了能够同时学习多个任务的网络架构的构建。尽管如此,MTL 中的一个重大挑战还是来自于优化过程本身。特别是,需要小心地平衡所有任务的联合学习,以避免一个或多个任务在网络权重中占主导地位的情况。在本节中,将讨论考虑这个任务平衡问题的几种方法。

3.1 Task Balancing Approaches

Without loss of generality, the optimization objective in a MTL problem, assuming task-specific weights wi and taskspecific loss functions Li , can be formulated as

When using stochastic gradient descent to minimize the objective from Equation 4, which is the standard approach in the deep learning era, the network weights in the shared layers Wsh are updated by the following rule

From Equation 5 we can draw the following conclusions. First, the network weight update can be suboptimal when the task gradients conflict, or dominated by one task when its gradient magnitude is much higher w.r.t. the other tasks. This motivated researchers [8], [19], [20], [21] to balance the gradient magnitudes by setting the task-specific weights wi in the loss. To this end, other works [22], [26], [70] have also considered the influence of the direction of the task gradients. Second, each task’s influence on the network weight update can be controlled, either indirectly by adapting the task-specific weights wi in the loss, or directly by operating on the task-specific gradients ∂Li ∂Wsh . A number of methods that tried to address these problems are discussed next.

在不失一般性的前提下,MTL 问题的优化目标,假设特定任务的权重 wi 和特定任务的损失函数Li,可以表示为公式 (4)。

当使用随机梯度下降来最小化公式 (4) 中的目标时,共享层 Wsh 中的网络权值按照公式 (5) 的规则进行更新。该公式可以得出以下结论。

首先,当两个任务的梯度冲突时,网络权值的更新可能是次优的,或者当一个任务的梯度值比其他任务大得多时,网络权值的更新可能被一个任务主导。这促使研究者 [8],[19],[20],[21] 通过设置特定任务的权重来平衡梯度大小。为此,其他文献 [22],[26],[70] 也考虑了任务梯度方向的影响。

其次,每个任务对网络权值更新的影响可以被控制,或者间接地通过调整特定任务的权值 wi 在损失,或者直接地通过操作任务特定梯度 ∂Li∂Wsh。下面将讨论一些试图解决这些问题的方法。

3.1.1 Uncertainty Weighting

Kendall et al. [19] used the homoscedastic uncertainty to balance the single-task losses. The homoscedastic uncertainty or task-dependent uncertainty is not an output of the model, but a quantity that remains constant for different input examples of the same task. The optimization procedure is carried out to maximise a Gaussian likelihood objective that accounts for the homoscedastic uncertainty. In particular, they optimize the model weights W and the noise parameters σ1, σ2 to minimize the following objective

The loss functions L1, L2 belong to the first and second task respectively. By minimizing the loss L w.r.t. the noise parameters σ1, σ2, one can essentially balance the taskspecific losses during training. The optimization objective in Equation 6 can easily be extended to account for more than two tasks too. The noise parameters are updated through standard backpropagation during training.

Note that, increasing the noise parameter σi reduces the weight for task i. Consequently, the effect of task i on the network weight update is smaller when the task’s homoscedastic uncertainty is high. This is advantageous when dealing with noisy annotations since the task-specific weights will be lowered automatically for such tasks.

Kendall et al. [19] 使用同方差的不确定性( homoscedastic uncertainty)来平衡单任务损失。同方差的不确定性或任务相关不确定性不是模型的输出,而是同一任务的不同输入示例保持不变的数量。优化过程是进行最大的高斯似然目标,说明同方差的不确定性。特别是对模型权重 W 和噪声参数 σ1、σ2 进行优化,以达到最小化以下目标,即公式(6)。

loss function L1 属于第一个 task, L2 属于第二个 task。通过最小化噪声参数 σ1、σ2 的损耗,可以在训练过程中基本平衡各任务的损耗。公式(6)中的优化目标也可以很容易地扩展到两个以上的任务。在训练过程中通过标准反向传播更新噪声参数。

需要注意的是,增加噪声参数 σi 会降低任务 i 的权值,因此当任务的同方差不确定性较高时,任务i对网络权值更新的影响较小。这在处理 noisy annotations  时很有优势,因为针对这类任务,特定于任务的权重将自动降低。

3.1.2 Gradient Normalization

Gradient normalization (GradNorm) [20] proposed to control the training of multi-task networks by stimulating the task-specific gradients to be of similar magnitude. By doing so, the network is encouraged to learn all tasks at an equal pace. Before presenting this approach, we introduce the necessary notations in the following paragraph.

We define the L2 norm of the gradient for the weighted single-task loss w_i(t) \cdot L_i(t) at step t w.r.t. the weights W, as G^W_i(t). We additionally define the following quantities,

• the mean task gradient \bar{G}^W averaged across all task gradients \bar{G}^W_i w.r.t the weights W at step t: \bar{G}^W (t) = E_{task} \[G^W_i (t)\] ;

• the inverse training rate \tilde{L}_ i of task i at step t: \tilde{L}_i (t) = L_i (t) /L_i (0);

• the relative inverse training rate of task i at step t: r_i (t) =\tilde{ L}_ i (t) /E_{task} \[\tilde{L} _i (t)\] .

梯度归一化 (GradNorm) [20] 提出通过刺激特定任务的梯度达到相似的量级来控制多任务网络的训练。通过这样做,网络被鼓励以相同的速度学习所有的任务。在介绍这种方法之前,我们在下面的段落中介绍必要的符号。

我们定义在第 t 步 w.r.t. 加权单任务损失 w_i(t) \cdot L_i(t) 梯度的 L2 范数,权值 W 为 G^W_i(t)。我们另外定义以下量:

•  平均任务梯度 \bar{G}^W 平均跨越所有任务梯度 \bar{G}^W_i w.r.t的权重W在步骤 t: \bar{G}^W (t) = E_{task} \[G^W_i (t)\];

•  逆向训练率 \tilde{L}_ i 的任务i在步骤 t: \tilde{L}_i (t) = L_i (t) /L_i (0);

•  任务i在步骤t的相对逆向训练率 :r_i (t) =\tilde{ L}_ i (t) /E_{task} \[\tilde{L} _i (t)\]

GradNorm aims to balance two properties during the training of a multi-task network.

First, balancing the gradient magnitudes G^W_i. To achieve this, the mean gradient \bar{G}^W is considered as a common basis from which the relative gradient sizes across tasks can be measured.

Second, balancing the pace at which different tasks are learned. The relative inverse training rate r_i(t) is used to this end. When the relative inverse training rate r_i(t) increases, the gradient magnitude G^W_i(t) for task i should increase as well to stimulate the task to train more quickly. GradNorm tackles both objectives by minimizing the following loss 

Remember that, the gradient magnitude G^W_i(t) for task i depends on the weighted single-task loss w_i(t) \cdot L_i(t). As a result, the objective in Equation 7 can be minimized by adjusting the task-specific weights wi . In practice, during training these task-specific weights are updated in every iteration using backpropagation. After every update, the task-specific weights w_i (t) are re-normalized in order to decouple the learning rate from the task-specific weights. 

GradNorm 的目标是在多任务网络的训练中平衡两个属性

首先,平衡梯度大小 G^W_i。为了实现这一目标,将平均梯度 \bar{G}^W 作为一个共同的基础,从这个基础上可以衡量各个任务的相对梯度大小。

第二,平衡学习不同任务的速度。为此,使用相对逆训练速率 r_i(t)。当相对逆训练率 r_i(t)增大时,任务 i 的梯度幅值 G^W_i(t) 也应增大,以刺激任务更快地进行训练。GradNorm 通过最小化以下损失来实现这两个目标。 

任务 i 的梯度幅度 G^W_i(t) 取决于加权的单任务损失 w_i(t) \cdot L_i(t)。因此,可以通过调整特定任务的权重 wi 来最小化公式 7 中的目标。在实践中,在训练过程中,这些特定任务的权值会在每次迭代中使用反向传播进行更新。在每次更新后,任务特定的权重 w_i (t) 被重新归一化,以将学习率与任务特定的权重解耦。

Note that, calculating the gradient magnitude G^W_i(t) requires a backward pass through the task-specific layers of every task i. However, savings on computation time can be achieved by considering the task gradient magnitudes only w.r.t. the weights in the last shared layer.

Different from uncertainty weighting, GradNorm does not take into account the task-dependent uncertainty to reweight the task-specific losses. Rather, GradNorm tries to balance the pace at which tasks are learned, while avoiding gradients of different magnitude.

需要注意的是,计算梯度大小 G^W_i(t) 需要向后遍历每个任务 i 的特定任务层。然而,考虑任务梯度大小只考虑最后一个共享层的权值,可以节省计算时间。

与不确定性加权不同,GradNorm 不考虑任务相关的不确定性来重新加权特定任务的损失。相反,GradNorm 试图平衡学习任务的速度,同时避免不同程度的梯度。

3.1.3 Dynamic Weight Averaging

Similarly to GradNorm, Liu et al. [8] proposed a technique, termed Dynamic Weight Averaging (DWA), to balance the pace at which tasks are learned. Differently, DWA only requires access to the task-specific loss values. This avoids having to perform separate backward passes during training in order to obtain the task-specific gradients. In DWA, the task-specific weight wi for task i at step t is set as

with N being the number of tasks. The scalars rn (·) estimate the relative descending rate of the task-specific loss values Ln. The temperature T controls the softness of the task weighting in the softmax operator. When the loss of a task decreases at a slower rate compared to other tasks, the taskspecific weight in the loss is increased.

Note that, the task-specific weights wi are solely based on the rate at which the task-specific losses change. Such a strategy requires to balance the overall loss magnitudes beforehand, else some tasks could still overwhelm the others during training. GradNorm avoids this problem by balancing both the training rates and the gradient magnitudes through a single objective (see Equation 7). 

与 GradNorm 类似,Liu et al. [8] 提出了一种技术,称为动态加权平均 (DWA),以平衡学习任务的速度。不同的是,DWA 只需要访问特定于任务的损失值。这就避免了在训练过程中为了获得特定于任务的梯度而执行单独的向后传递。在 DWA 中,将第 t 步任务 i 的任务特定权重 wi 设置为

N是任务数。标量 rn(·) 估计任务特定损失值 Ln 的相对下降率。温度 T 控制 softmax 操作符中任务权重的柔软度。当一个任务的丢失速度比其他任务慢时,丢失中特定于任务的权重就会增加。

请注意,特定于任务的权重仅基于特定于任务的损失变化的速率。这种策略需要事先平衡总体损失的大小,否则在训练过程中,有些任务仍然可能压倒其他任务。GradNorm 通过单一目标平衡训练速率和梯度大小来避免这个问题 (见公式 7)。

3.1.4 Dynamic Task Prioritization

The task balancing techniques in Sections 3.1.1-3.1.3 opted to optimize the task-specific weights wi as part of a Gaussian likelihood objective [19], or in order to balance the pace at which the different tasks are learned [8], [20]. In contrast, Dynamic Task Prioritization (DTP) [21] opted to prioritize the learning of ’difficult’ tasks by assigning them a higher task-specific weight. The motivation is that the network should spend more effort to learn the ’difficult’ tasks. Note that, this is opposed to uncertainty weighting, where a higher weight is assigned to the ’easy’ tasks. We hypothesize that the two techniques do not necessarily conflict, but uncertainty weighting seems better suited when tasks have noisy labeled data, while DTP makes more sense when we have access to clean ground-truth annotations.

To measure the task difficulty, one could consider the progress on every task using the loss ratio \tilde{L}_ i(t) defined by GradNorm. However, since the loss ratio depends on the initial loss L_i(0), its value can be rather noisy and initialization dependent. Furthermore, measuring the task progress using the loss ratio might not accurately reflect the progress on a task in terms of qualitative results. Therefore, DTP proposes the use of key performance indicators (KPIs) to quantify the difficulty of every task. In particular, a KPI \kappa _i is selected for every task i, with 0 < \kappa _i< 1. The KPIs are picked to have an intuitive meaning, e.g. accuracy for classification tasks. For regression tasks, the prediction error can be thresholded to obtain a KPI that lies between 0 and 1. Further, we define a task-level focusing parameter γi ≥ 0 that allows to adjust the weight at which easy or hard tasks are down-weighted. DTP sets the task-specific weight wi for task i at step t as

Note that, Equation 9 employs a focal loss expression [71] to down-weight the task-specific weights for the ’easy’ tasks. In particular, as the value for the KPI \kappa _i increases, the weight wi for task i is being reduced.

DTP requires to carefully select the KPIs. For example, consider choosing a threshold to measure the performance on a regression task. Depending on the threshold’s value, the task-specific weight will be higher or lower during training. We conclude that the choice of the KPIs in DTP is not determined in a straightforward manner. Furthermore, similar to DWA, DTP requires to balance the overall magnitude of the loss values beforehand. After all, Equation 9 does not take into account the loss magnitudes to calculate the task-specific weights. As a result, DTP still involves manual tuning to set the task-specific weights.

第 3.1.1-3.1.3 节中的任务平衡技术选择优化特定任务的权重 wi 作为高斯似然目标的一部分,或者为了平衡学习不同任务的速度 [8],[20]。相反,动态任务优先级(DTP) [21] 选择通过分配更高的特定任务权重来优先考虑 “困难” 任务的学习。这样做的动机是,网络应该花更多的精力去学习 “困难” 的任务。注意,这与不确定性加权是相反的,不确定性加权会给 “简单” 的任务分配更高的权重。假设这两种技术并不一定冲突,但是不确定性加权似乎更适合于任务有噪声标记数据的情况,而当能够访问干净的真实注释时,DTP 更有意义

为了测量任务难度,可以考虑使用由 GradNorm 定义的 \tilde{L}_ i(t) loss ratio 在每个任务上的进展。然而,由于损耗率取决于初始损耗 L_i(0),它的值可能会有很大的噪声和初始化依赖。此外,使用损耗率度量任务进度可能不能准确地反映定性结果方面的任务进度。

因此,DTP 提出使用关键绩效指标 (KPIs) 来量化每个任务的难度。特别地,为每个任务 i 选择一个 KPI \kappa _i,其中 0 < \kappa _i < 1。选择 \kappa _i 是为了使其具有直观的含义,例如分类任务的准确性。对于回归任务,可以对预测误差进行阈值设置,以获得介于 0 和 1 之间的 KPI。

进一步,定义了一个任务级别的聚焦参数 \gamma _i≥0,允许调整权重,在该权重下,容易或困难的任务被降低权重。DTP 将步骤t中任务i的任务特定权重 wi 设置为公式(9)。

注意,公式 9 使用了一个焦损表达式focal loss expression)[71] 来降低 “简单” 任务的特定任务权重。特别是,随着 KPI \kappa _i 的值增加,任务i的权重 wi 正在降低。

DTP 需要仔细选择 KPI。例如,考虑选择一个阈值来度量回归任务的性能。根据阈值的值,特定于任务的权重在训练过程中会有所提高或降低。

得出结论,DTP 中 KPI 的选择不是以一种直接的方式确定的。此外,与 DWA 类似,DTP 需要事先平衡损失值的总体大小。毕竟,公式 9 在计算特定于任务的权重时没有考虑损失的大小。因此, DTP 仍然需要手动调优以设置特定于任务的权重。

3.1.5 MTL as Multi-Objective Optimization

A global optimum for the multi-task optimization objective in Equation 4 is hard to find. Due to the complex nature of this problem, a certain choice that improves the performance for one task could lead to performance degradation for another task. The task balancing methods discussed beforehand try to tackle this problem by setting the taskspecific weights in the loss according to some heuristic. Differently, Sener and Koltun [22] view MTL as a multiobjective optimization problem, with the overall goal of finding a Pareto optimal solution among all tasks.

In MTL, a Pareto optimal solution is found when the following condition is satisfied: the loss for any task can be decreased without increasing the loss on any of the other tasks. A multiple gradient descent algorithm (MGDA) [72] was proposed in [22] to find a Pareto stationary point. In particular, the shared network weights are updated by finding a common direction among the task-specific gradients. As long as there is a common direction along which the task-specific losses can be decreased, we have not reached a Pareto optimal point yet. An advantage of this approach is that since the shared network weights are only updated along common directions of the task-specific gradients, conflicting gradients are avoided in the weight update step.

Lin et al. [23] observed that MGDA only finds one out of many Pareto optimal solutions. Moreover, it is not guaranteed that the obtained solution will satisfy the users’ needs. To address this problem, they generalized MGDA to generate a set of well-representative Pareto solutions from which a preferred solution can be selected. So far, however, the method was only applied to small-scale datasets (e.g. Multi-MNIST).

对于公式 4 中的多任务优化目标,很难找到全局最优解。由于这个问题的复杂性,提高一个任务性能的某个选择可能会导致另一个任务的性能下降。之前讨论的任务平衡方法试图通过一些启发式的方法来设置损失中特定任务的权重来解决这个问题。不同的是,Sener 和 Koltun[22] 将 MTL 视为一个多目标优化问题,其总体目标是在所有任务中找到一个 Pareto 最优解

在MTL中,只要满足以下条件,即在不增加任何任务损失的情况下,任何任务的损失都能减少,就能得到一个Pareto最优解。在 [22] 中提出了一种多重梯度下降算法 (MGDA)[72] 来寻找一个 Pareto 平稳点。特别是,通过在特定任务的梯度中找到一个共同的方向来更新共享的网络权重只要有一个共同的方向可以降低特定任务的损失,还没有达到帕累托最优点。这种方法的优点是,由于共享网络权值只沿着特定任务梯度的共同方向更新,在权值更新步骤中避免了冲突梯度

Lin et al. [23] 观察到 MGDA 只能从众多 Pareto 最优解中找到一个。此外,也不能保证所得到的解决方案能够满足用户的需求。为了解决这个问题,他们将 MGDA 推广到生成一组具有代表性的帕累托解,从中可以选择一个更优的解。然而,到目前为止,该方法仅适用于小规模数据集 (如 Multi-MNIST)。

3.1.6 Discussion

In Section 3.1, we described several methods for balancing the influence of each task when training a multi-task network. Table 1 provides a qualitative comparison of the described methods. We summarize some conclusions below.

(1) We find discrepancies between these methods, e.g. uncertainty weighting assigns a higher weight to the ’easy’ tasks, while DTP advocates the opposite. The latter can be attributed to the experimental evaluation of the different task balancing strategies that was often done in the literature using different datasets or task dictionaries. We suspect that an appropriate task balancing strategy should be decided for each case individually.

(2) We also find commonalities between the described methods, e.g. uncertainty weighting, GradNorm and MGDA opted to balance the loss magnitudes as part of their learning strategy. In Section 4.4, we provide extensive ablation studies under more common datasets or task dictionaries to verify what task balancing strategies are most useful to improve the multi-tasking performance, and under which circumstances.

(3) A number of works (e.g. DWA, DTP) still require careful manual tuning of the initial hyperparameters, which can limit their applicability when dealing with a larger number of tasks.

在 3.1 节中,描述了在训练多任务网络时平衡每个任务影响的几种方法。表 1 提供了所描述方法的定性比较。下面总结了一些结论。

(1) 这些方法之间的差异:如不确定性加权赋予 “简单” 任务更高的权重,而 DTP 主张相反。后者可以归因于对不同任务平衡策略的实验评价,通常在文献中使用不同的数据集或任务字典。本文认为应该为每个案例单独决定一个适当的任务平衡策略。

(2) 这些方法之间的共性:如不确定性加权,GradNorm 和 MGDA 选择平衡损失幅度作为他们的学习策略的一部分

(3) 许多工作 (如 DWA、DTP) 仍然需要仔细手动调整初始超参数,这可能会限制它们在处理大量任务时的适用性

3.2 Other Approaches

The task balancing works in Section 3.1 can be plugged into most existing multi-task architectures to regulate the task learning. Another group of works also tried to regulate the training of multi-task networks, albeit for more specific setups. We touch upon several of these approaches here. Note that, some of these concepts can be combined with task balancing strategies too.

Zhao et al. [26] empirically found that tasks with gradients pointing in the opposite direction can cause the destructive interference of the gradient. This observation is related to the update rule in Equation 5. They proposed to add a modulation unit to the network in order to alleviate the competing gradients issue during training.

Liu et al. [24] considered a specific multi-task architecture where the feature space is split into a shared and a taskspecific part. They argue that the shared features should contain more common information, and no information that is specific to a particular task only. The network weights are regularized by enforcing this prior. More specifically, an adversarial approach is used to avoid task-specific features from creeping into the shared representation. Similarly, [18], [25] added an adversarial loss to the single-task gradients in order to make them statistically indistinguishable from each other in the shared parts of the network.

Chen et al. [29] proposed gradient sign dropout, a modular layer that can be plugged into any network with multiple gradient signals. Following MGDA, the authors argue that conflicts in the weight update arise when the gradient values of the different learning signals have opposite signs. Gradient sign dropout operates by choosing the sign of the gradient based on the distribution of the gradient values, and masking out the gradient values with opposite sign. It is shown that the method has several desirable properties, and increases performance and robustness compared to competing works.

Finally, some works relied on heuristics to balance the tasks. Sanh et al. [27] trained the network by randomly sampling a single task for the weight update during every iteration. The sampling probabilities were set proportionally to the available amount of training data for every task. Raffel et al. [28] used temperature scaling to balance the tasks. So far, however, both procedures were used in the context of natural language processing.

3.1 节中的任务平衡工作可以插入到大多数现有的多任务体系结构中,以调节任务学习。另一组研究也试图规范多任务网络的训练,尽管是针对更具体的设置。在这里讨论其中的几种方法。请注意,其中一些概念也可以与任务平衡策略相结合。

Zhao et al. 从经验上发现,梯度指向相反方向的任务会导致梯度的破坏性干扰。这个观察结果与方程5中的更新规则有关。他们建议在网络中增加一个调制单元,以缓解训练过程中的竞争梯度问题。

Liu et al. 考虑了一种特定的多任务架构,其中功能空间被分割为共享部分和特定任务部分。他们认为共享特性应该包含更多的公共信息,而不是只针对特定任务的信息。通过执行这种先验,网络权值被正则化。更具体地说,使用对抗性的方法来避免任务特定的特性潜入共享表示。类似地,[18],[25] 在单任务梯度中增加了一个对抗性损失,以便使它们在网络共享部分中在统计上彼此无法区分。

Chen et al. 提出了梯度符号 dropout,这是一个模块层,可以插入任何具有多个梯度信号的网络。在 MGDA 之后,作者认为当不同学习信号的梯度值具有相反的符号时,在权值更新中会产生冲突。梯度符号 dropout 是根据梯度值的分布选择梯度的符号,并掩盖带有相反符号的梯度值。结果表明,该方法具有若干可取的特性,与同类方法相比,提高了性能和鲁棒性。

最后,一些文献依靠启发式来平衡任务。Sanh et al.[27] 通过在每次迭代中随机采样单个任务进行权值更新来训练网络。采样概率与每个任务可用的训练数据量成比例。Raffel et al.[28] 使用温度缩放来平衡这些任务。然而,到目前为止,这两个过程都是在自然语言处理的上下文中使用的。

____________________\triangle____________________

[ Links ]

该综述全篇过长,故将其分为 4 部分分别讲解,相关博客链接如下:

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 摘要前言篇

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (上)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 网络结构篇 (下)

密集预测任务的多任务学习(Multi-Task Learning)研究综述 - 优化方法篇

____________________\triangledown____________________

[ Extension ]

Multi-Task Learning with Deep Neural Networks: A Survey (2020)

おすすめ

転載: blog.csdn.net/u014546828/article/details/121222848