Intensive reading of the paper - Gradient Surgery for Multi-Task Learning

Introduction to Multi-task Learning and PCGrad methods

Multi-task Learning (MTL) is a popular research area in machine learning, in which a single model learns multiple related tasks simultaneously. The core motivation of this approach is that different tasks can assist each other and improve learning efficiency because they may share some underlying representations or features. However, this may cause training difficulties when the gradient update directions of the task conflict.

The PCGrad method proposed in the paper "Gradient Surgery for Multi-task Learning" addresses this problem. When the gradients of multiple tasks conflict with each other when updating, PCGrad mainly considers how to trim these gradients so that they no longer conflict with each other. Specifically, it eliminates conflicts between gradients by projecting the gradients of each task onto orthogonal planes in directions that conflict with the gradients of other tasks .

The intuitive understanding of this approach is that if the gradients produced by two tasks conflict with each other (for example, one task may push the model weights in one direction, while the other task pushes it in the opposite direction), then this conflict may It will hinder model learning. To mitigate this, PCGrad trims the gradients to ensure that the gradient of each task is zero in directions that conflict with the gradients of other tasks. In this way, each task only updates the model weights in directions that do not conflict with other tasks.

The experiments in the paper demonstrate that this approach is effective in multi-task scenarios, both in multi-task supervised learning and multi-task reinforcement learning (RL).

Explanation - "It eliminates conflicts between gradients by projecting the gradients of each task onto orthogonal planes in directions that conflict with the gradients of other tasks" :

Use a simplified 2D example to understand this concept:

Suppose we have two tasks A and B, and we are considering gradients in 2D space. So, there is a gradient vector for each task.

  1. Example situation:
    • The gradient vector of task A is g A ⃗ \vec{g_A}gA , pointing to the upper right.
    • The gradient vector of task B is g B ⃗ \vec{g_B}gB , pointing to the lower right.

These two vectors are aligned horizontally (both push the model weights to the right), but opposite vertically (one up, one down).

  1. Find conflicts:

    • Since the two gradients conflict in the vertical direction, we need to resolve this conflict.
  2. Method to resolve conflict - Project to orthogonal plane:

    • For task A, we want to eliminate the gradient conflict with task B. Specifically, we need to find a vector that is related to g B ⃗ \vec{g_B}gB Orthogonal, and with g A ⃗ \vec{g_A}gA in the same direction. This vector is g A ⃗ \vec{g_A}gA In g B ⃗ \vec{g_B}gB projection in the orthogonal direction.
    • Similarly, for task B, we will find a vector that is consistent with g A ⃗ \vec{g_A}gA Orthogonal, and with g B ⃗ \vec{g_B}gB in the same direction. This vector is g B ⃗ \vec{g_B}gB g A ⃗ \vec{g_A} gA projection in the orthogonal direction.
  3. result:

    • Both tasks now have corrected gradient vectors. The two new vectors are still in the original directions, but they have no components in directions that conflict with the gradient of the other task.

Illustration:
Imagine a coordinate system. In this coordinate system, the gradient of task A is a vector pointing from the origin to the first quadrant. The gradient of task B is a vector pointing from the origin to the fourth quadrant. To eliminate the conflict in the vertical direction, we will find a new vector that is the projection of the gradient of task A in the horizontal direction (the orthogonal direction of the gradient of task B). In this way, we get a vector that is entirely in the horizontal direction and does not conflict with the gradient of task B.

Similarly, the gradient of task B will also be projected to the horizontal direction to obtain a new gradient vector that does not conflict with the gradient of task A.

In this way, both tasks can perform gradient updates simultaneously without interfering with each other.

Insert image description here

Compute the projection of one vector in the orthogonal direction of another vector:

Insert image description here

Paper information

Paper title Gradient Surgery for Multi-Task Learning
author Tianhe Yu 1 ^11 , Saurabh Kumar1 ^11 , Abhishek Gupta2 ^22, Sergey Levine 2 ^2 2 , Karol Hausman3 ^33 , Chelsea Find1 ^11
Research institutions Stanford University 1 ^1 1, UC Berkeley 2 ^2 2, Robotics at Google e ^e e
Meeting NeurlIPS
year of publication 2020
Paper link https://proceedings.neurips.cc/paper/2020/file/3fe78a8acf5fda99de95303940a2420c-Paper.pdf
open source code https://paperswithcode.com/paper/gradient-surgery-for-multi-task-learning-1
Paper contribution In this paper, the authors identify three environmental conditions for multi-task optimization that lead to undesirable gradient interference and develop a simple and general method to avoid such interference between task gradients. The authors propose a form of gradient surgery that projects the gradient of a task onto the normal plane of the gradient of any other task with conflicting gradients . This approach brings substantial improvements in efficiency and performance on a range of challenging multi-task supervised and multi-task RL problems. Furthermore, it is model-agnostic and can be combined with previously proposed multi-task architectures to enhance performance.

Paper core diagram

Insert image description here

summary translation

Although deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in areas such as image classification, game play, and robot control, data efficiency remains a significant challenge. Multi-task learning has emerged as a promising approach that allows structure to be shared across multiple tasks, enabling more efficient learning. However, the multi-task setting brings a series of optimization problems, which makes it difficult to obtain large efficiency improvements compared with learning tasks alone. Why multi-task learning is so challenging compared to single-task learning is not fully explained. In this study, we identify conditions that lead to undesirable gradient interference in three multi-task optimization scenarios and develop a simple but general method to avoid such inter-task gradient interference. We propose a gradient surgery method that projects the gradient of one task onto the normal plane of the gradient of any other task with conflicting gradients. This approach enables significant improvements in efficiency and performance across a range of challenging multi-task supervised and multi-task RL problems. Moreover, it is model-agnostic and can be combined with previously proposed multi-task architectures to further improve performance.

Introduction translation

Insert image description here

Figure 1: PCGrad visualization on a 2D multi-task optimization problem. (a) is a multi-task target landscape. (b) and © are contour plots of the individual mission targets that make up (a). (d) is the trajectory of gradient update for multi-task objectives using the Adam optimizer. The gradient vectors for the two tasks at the end of the trajectory are represented by blue and red arrows, respectively, where the relative lengths are on a logarithmic scale. (e) is the trajectory of gradient update for multi-task targets using Adam and PCGrad. In (d) and (e), the color of the optimized trajectory gradients from black to yellow.

Although deep learning and deep reinforcement learning (RL) have shown great promise in enabling systems to learn complex tasks, the data requirements of current methods make it difficult to learn each task individually from scratch. Faced with this multi-task learning problem, an intuitive approach is to jointly train all tasks, aiming to find the shared structure between tasks, in order to achieve higher efficiency and performance than when solving tasks individually. However, learning multiple tasks simultaneously becomes a complex optimization problem, which is sometimes worse than learning each task individually [42, 50]. These optimization problems are so common that many multi-task RL algorithms are first trained separately and then these independent models are integrated into a multi-task model [32, 42, 50, 21, 56], although a multi-task model, but efficiency gains in independent training are sacrificed. If we can effectively solve the optimization challenges of multi-task learning, it may be possible to actually achieve the expected advantages of multi-task learning without affecting the final performance.

Although a large number of multi-task learning studies have been conducted [6, 49], the optimization challenges are not well known. Previous work pointed out that differences in learning speed for different tasks [8, 26] and plateaus in the optimization process [52] may be responsible for this problem, while other studies focused on model architecture [40, 33]. ** In this work, we propose the hypothesis that the main optimization problem in multi-task learning arises from conflicting gradients from different tasks, thus hindering learning progress. We consider two gradients to be in conflict if they are in opposite directions (i.e., their cosine similarity is negative). **This conflict becomes detrimental when a) conflicting gradients occur simultaneously with b) high positive curvature and c) large gradient magnitude differences.

For example, consider the 2D optimization landscape of the two mission objectives in Figure 1a–c. There is a deep valley in the optimization landscape for each task, a phenomenon observed in neural network optimization [22], and each valley has high positive curvature and large task gradient magnitude differences. In this case, the gradient of one task will dominate the multi-task gradient, sacrificing the performance of the other task. Due to high curvature, improvements in dominant tasks may be overestimated, while performance degradation in non-dominant tasks may be underestimated.

Therefore, the optimizer has difficulty making progress on the optimization goal. In Figure 1d, the optimizer reaches the deep valley of Task 1 but cannot cross the valley floor (the gradient shown in Figure 1d) due to conflicting gradients, high curvature, and large gradient magnitude differences. In Section 5.3, we experimentally demonstrate that this situation also occurs in higher-dimensional neural network multi-task learning problems.

** The main contribution of this work is a method to mitigate gradient interference by directly modifying the gradient, i.e. "gradient surgery". If two gradients conflict, we adjust these gradients by projecting them onto planes orthogonal to each other, thereby preventing the interfering part of the gradient from acting on the network. We call this specific gradient surgery "Projection Conflicted Gradient" (PCGrad). **PCGrad does not rely on a specific model and requires only one modification to the application of gradients. Therefore, it can be easily applied to a variety of problem settings, including multi-task supervised learning and multi-task reinforcement learning, and can be combined with other multi-task learning methods (e.g., methods that modify the architecture). We theoretically demonstrated under what local conditions PCGrad can outperform standard multi-task gradient descent, and empirically evaluated PCGrad on a variety of challenging problems, including multi-task CIFAR classification and multi-objective scene understanding. , challenging multi-task RL domains and RL under target conditions. Overall, we find that PCGrad achieves significant improvements in data efficiency, optimization speed, and final performance compared to previous methods, including achieving over 30% absolute improvement in multi-task reinforcement learning problems. Furthermore, for multi-task supervised learning tasks, PCGrad can be combined with previous multi-task learning methods to achieve higher performance.

2 Use PCGrad for multi-task learning

Although in principle, the multi-task problem can be solved by using standard single-task algorithms and providing the model with appropriate task identifiers, or by using simple multi-head or multi-output models, some previous work [42, 50, 53 ] Found this problem quite difficult. In this section, we will introduce symbolic notation, identify the difficulties of multi-task optimization, propose a simple and general method to alleviate these difficulties, and conduct a theoretical analysis of this method.

2.1 Basic Concepts: Problems and Symbolic Representations

The goal of multi-task learning is to find the model f θ f_{\theta}fiParameters θ \thetaθ , so that in the slave task distributionp ( T ) p(\mathcal{T})Achieve high average performance across all training tasks drawn from p ( T ) . More formally, we aim to solve the following problem:
min ⁡ θ ET i ∼ p ( T ) [ L i ( θ ) ] \min_{\theta} \mathbb{E}_{\mathcal{T}_{ i} \sim p(\mathcal{T})}\left[\mathcal{L}_{i}(\theta)\right]miniETip(T)[Li( θ ) ] ,
in whichL i \mathcal{L}_{i}Liis the ii we wish to minimizei tasksT i \mathcal{T}_{i}Tiloss function. For a set of tasks { T i } \left\{\mathcal{T}_{i}\right\}{ Ti} , we express the multi-task loss asL ( θ ) = ∑ i L i ( θ ) \mathcal{L}(\theta)=\sum_{i} \mathcal{L}_{i}(\theta)L ( i )=iLi( θ ) , and express the gradient of each task asgi = ∇ L i ( θ ) \mathbf{g}_{i}=\nabla \mathcal{L}_{i}(\theta)gi=Li( θ ) , for a specificθ \thetaθ . In order to obtain a solution from the task distributionp ( T ) p(\mathcal{T})For the model of a specific task in p ( T ) , we define a task condition modelf θ ( y ∣ x , zi ) f_{\theta}\left(y \mid x, z_{i}\right)fi(yx,zi) , where the input isxxx , the output isyyy , coded aszi z_{i}ziRepresents task T i \mathcal{T}_{i}Ti, can be provided in one-hot encoding or any other form.

Explain this formula - f θ ( y ∣ x , zi ) f_{\theta}\left(y \mid x, z_{i}\right)fi(yx,zi):

This formula represents a conditional probability model, and the formula can be decomposed to explain more clearly.

  1. Function form : f θ f_{\theta}fi

This is a function given by the parameter θ \thetaθ parameterization. Typically, in the context of deep learning or machine learning,f θ f_{\theta}fiMay be a neural network or other type of model, where θ \thetaθ represents the parameters of the model, such as weights and biases.

  1. Conditional symbols : ∣ \mid

This symbol means "given" or "conditional on" and is often used to express conditional probability. Here, it means we want to predict yyy , but this prediction is based on the givenxxxzi z_{i}zi

  1. Change amount : y , x , ziy, x, z_{i}y,x,zi
  • yyy : This is the variable we want to predict or output.

  • x x x : This is the input variable or feature. In many machine learning tasks,xxx represents the data on which we base our predictions.

  • z i z_{i} zi: This is the task T i \mathcal{T}_{i}Tiencoding. From the original text, we can see that this encoding may be a one-hot vector or other form to indicate or describe the task T i \mathcal{T}_{i}Ti. This encoding is common in multi-task learning and tells the model which specific task it should optimize for.

总结: f θ ( y ∣ x , z i ) f_{\theta}\left(y \mid x, z_{i}\right) fi(yx,zi) represents a parameterθ \thetaθ parameterized model that attempts to predictyyy , based on the given inputxxx and task encodingzi z_{i}zi. This form usually occurs in multi-task learning scenarios, where the model needs to know which specific task it is working on in order to make predictions correctly.

2.2 Triple Tragedy: Conflicting Gradient, Dominant Gradient, High Curvature

We hypothesize that a key optimization problem in multi-task learning arises from conflicting gradients, i.e., gradients from different tasks pointing toward each other, as measured by a negative inner product. However, conflicting gradients are not harmful in themselves. In fact, simply averaging task gradients should provide a suitable solution for reducing the multi-task objective. However, in some cases, such conflicting gradients can lead to significant performance degradation. Consider an optimization problem with two tasks. If the magnitude of the gradient of one task is much larger than the gradient of another task, it will dominate the average gradient. If there is also high positive curvature in the direction of the task gradient, performance gains from the dominant task may be greatly overestimated, and performance decreases from the dominated task may be greatly underestimated. Therefore, we can describe the co-occurrence of these three conditions as follows: (a) the gradients of multiple tasks conflict with each other, (b) the gradient magnitudes vary greatly, causing some task gradients to dominate other task gradients, and © Multi-task Optimization There is high curvature in the landscape. Below, we formally define these three conditions.

Definition 1. \textbf{Definition 1.}Definition  1. We defineϕ ij \phi_{ij}ϕijFor the two task gradients gi \mathbf{g}_{i}gigj \mathbf{g}_{j}gjthe angle between them. When cos ⁡ ϕ ij < 0 \cos \phi_{ij}<0cosϕij<0 , we define a conflict between these gradients.

Definition 2. \textbf{Definition 2.}Definition  2. We define two gradientsgi \mathbf{g}_{i}gigj \mathbf{g}_{j}gjThe gradient magnitude similarity between them is
Φ ( gi , gj ) = 2 ∥ gi ∥ 2 ∥ gj ∥ 2 ∥ gi ∥ 2 2 + ∥ gj ∥ 2 2 . \Phi(\mathbf{g}_{i}, \ mathbf{g}_{j})=\frac{2\|\mathbf{g}_{i}\|_{2}\|\mathbf{g}_{j}\|_{2}}{ \|\mathbf{g}_{i}\|_{2}^{2}+\|\mathbf{g}_{j}\|_{2}^{2}}.Φ ( gi,gj)=gi22+gj222∥gi2gj2.When
the magnitude of the two gradients is the same, this value is equal to 1. This value approaches zero as the magnitude difference of the gradient increases.

Definition 3. \textbf{Definition 3.}Probability  3. Give the equivalent function
H ( L ; θ , θ ′ ) = ∫ 0 1 ∇ L ( θ ) T ∇ 2 L ( θ + a ( θ ′ − θ ) ) ∇ L ( θ ) by \mathbf {H}(\mathcal{L};\theta,\theta^{\prime})=\int_{0}^{1}\nabla\mathcal{L}(\theta)^{T}\nabla^{ 2} \mathcal{L}(\theta+a(\theta^{\prime}-\theta)) \nabla \mathcal{L}(\theta)\, daH(L;i ,i)=01L ( θ )T2 L(θ+a ( iθ )) L ( θ )d a ,
which is the multi-task gradient∇ L ( θ ) \nabla \mathcal{L}(\theta)∇ In the direction of L ( θ ) ,L \mathcal{L}Lθ \thetaθθ′ \theta^{\prime}iaverage curvature between ′ .

H ( L ; θ , θ ′ ) > C \mathbf{H}(\mathcal{L}; \theta, \theta^{\prime})>CH(L;i ,i)>C for some large positive constantCCC , for the model parameterθ \thetaθθ′ \theta^{\prime}i' At the current and next iteration, we describe the optimized landscape as having high curvature.

Our goal is to study this triple tragedy and observe the presence of these three conditions through two examples. First, consider the two-dimensional optimized landscape shown in Figure 1a, where the landscape for each task target corresponds to a deep valley with large curvature (see Figures 1b and 1c). The optimal solution to this multi-task objective corresponds to where these two valleys meet. See Appendix D for more details on optimizing your landscape. Certain specific points in this optimization landscape exhibit the above three conditions, and we observe that the Adam optimizer [30] stalls at exactly one of these points, preventing it from reaching the optimal solution. This provides some empirical evidence for our hypothesis. Our experiments in Section 5.3 further show that this phenomenon does occur in multi-task learning using deep networks. Inspired by these observations, we developed an algorithm designed to mitigate optimization challenges caused by conflicting gradients, dominant gradients, and high curvature, which we describe next.

2.3 PCGrad: Resolving Gradient Conflicts

Our goal is to break one of the conditions of the tragic trio by directly modifying the gradient to avoid conflicts. In this section we outline methods on how to modify gradients. In the next section, we will theoretically demonstrate that resolving gradient conflicts can benefit multi-task learning in the presence of dominant gradients and high curvature.

To achieve maximum effectiveness and broad applicability, we aim to modify the gradients in a way that generates positive interactions between task gradients and does not make assumptions about the model form. Therefore, we do not change the gradient when the gradients do not conflict. When gradients conflict, PCGrad aims to modify the gradients for each task to minimize negative conflicts with other task gradients, which will further alleviate the underestimation and overestimation problems caused by high curvature.

To resolve gradient conflicts during optimization, PCGrad employs a simple procedure: if the gradients between two tasks conflict, i.e. their cosine similarity is negative, then we project the gradients of each task onto the other task on the normal plane of the gradient. This is equivalent to removing the conflicting components of gradients for tasks, thereby reducing gradient interference between tasks. A diagram of this idea is shown in Figure 2.
Insert image description here
Insert image description here
Figure 2: Conflict gradient and PCGrad. In (a), tasks i and j have conflicting gradient directions, which may lead to destructive interference. In (b) and ©, we illustrate the PCGrad algorithm when gradients conflict. PCGrad projects the gradient of task i onto the normal vector of the gradient of task j and vice versa. Non-conflicting task gradients (d) do not change under PCGrad, allowing constructive interactions.

Assume task T i \mathcal{T}_{i}TiThe gradient of is gi \mathbf{g}_{i}gi, task T j \mathcal{T}_{j}TjThe gradient of is gj \mathbf{g}_{j}gj. PCGrad proceeds as follows:

(1) First, by calculating the vector gi \mathbf{g}_{i}gigj \mathbf{g}_{j}gjThe cosine similarity between them is used to determine gi \mathbf{g}_{i}giWhether with gj \mathbf{g}_{j}gjconflict, where negative values ​​represent conflict gradients.

(2) If the cosine similarity is negative, then we use gi \mathbf{g}_{i}giin gj \mathbf{g}_{j}gjThe projection on the normal plane replaces gi \mathbf{g}_{i}gi: gi = gi − gi ⋅ gj ∥ gj ∥ 2 gj \mathbf{g}_{i}=\mathbf{g}_{i}-\frac{\mathbf{g}_{i} \cdot \mathbf{ g}_{j}}{\left\|\mathbf{g}_{j}\right\|^{2}} \mathbf{g}_{j}gi=gigj2gigjgj. If the gradients do not conflict, that is, the cosine similarity is non-negative, then the original gradient gi \mathbf{g}_{i}giconstant.

(3) PCGrad repeats this process from the current batch T j \mathcal{T}_{j}TjAmong all other tasks randomly selected in , for all j ≠ ij \neq ij=i , generated and applied to taskT i \mathcal{T}_{i}TiThe gradient gi PC \mathbf{g}_{i}^{\mathrm{PC}}giPC

We perform the same procedure on all tasks in the batch to obtain their respective gradients. The complete update procedure is described in Algorithm 1, while a discussion on using randomized task order is included in Appendix H.

This procedure, while simple to implement, ensures that the gradients we apply for each task per batch interfere minimally with other tasks in the batch, thereby reducing gradient conflict problems and producing a variation of standard first-order gradient descent in multi-objective settings. body. In practice, PCGrad can be used in conjunction with any gradient-based optimizer, including commonly used methods such as SGD with momentum and Adam [30], by simply passing the calculated updates to the corresponding optimizer instead of the raw gradients. Our experimental results validate the hypothesis that this procedure can reduce gradient conflict problems and find that, as a result, learning progress is substantially improved.

2.4 Theoretical analysis of PCGrad

In this section, we will theoretically analyze the performance of PCGrad when handling two tasks:

Definition 4. Consider two task loss functions L 1 : R n → R \mathcal{L}_{1}: \mathbb{R}^{n} \rightarrow \mathbb{R}L1:RnR L 2 : R n → R \mathcal{L}_{2}: \mathbb{R}^{n} \rightarrow \mathbb{R} L2:RnR. _ We define the two-task learning objective asL ( θ ) = L 1 ( θ ) + L 2 ( θ ) \mathcal{L}(\theta)=\mathcal{L}_{1}(\theta)+\mathcal{ L}_{2}(\theta)L ( i )=L1( i )+L2( θ ) , for allθ ∈ R n \theta \in \mathbb{R}^{n}iRn , whereg 1 = ∇ L 1 ( θ ) \mathbf{g}_{1}=\nabla \mathcal{L}_{1}(\theta )g1=L1( θ ) ,g 2 = ∇ L 2 ( θ ) \mathbf{g}_{2}=\nabla \mathcal{L}_{2}(\theta)g2=L2(θ),且 g = g 1 + g 2 \mathbf{g}=\mathbf{g}_{1}+\mathbf{g}_{2} g=g1+g2

First, our goal is to verify that PCGrad updates correspond to a reasonable optimization procedure under simplifying assumptions. We analyze the convergence of PCGrad in the convex set setting, based on the standard assumptions in Theorem 1 . For further analysis of convergence, including non-convex settings, more than two tasks, and momentum-based optimizers, see Appendices A.1 and A.4.

Theorem 1. Assume L 1 \mathcal{L}_{1}L1L 2 \mathcal{L}_{2}L2is convex and differentiable. Assume L \mathcal{L}The gradient of L isLLL Lipschitz 且L > 0 L>0L>0 . Then, use the step sizet ≤ 1 L t \leq \frac{1}{L}tL1The PCGrad update rule will converge to a position in the optimization landscape of (1) where cos ⁡ ( ϕ 12 ) = − 1 \cos \left(\phi_{12}\right)=-1cos( ϕ12)=1 or (2) optimal valueL ( θ ∗ ) \mathcal{L}\left(\theta^{*}\right)L( i ).
Proof.See Appendix A.1.

Simply put, Theorem 1 states that in a convex environment, the multi-task loss function L \mathcal{L} for two tasksL , applying the PCGrad update results in a position that is eitherL \mathcal{L}The minimum value of L is either a suboptimal position where the gradients completely conflict. This means that if the gradient directions of the two tasks are exactly opposite, the gradient of PCGrad is updated to zero, which may lead to a suboptimal solution. But in fact, because we are using SGD, which is only a noisy estimate of the true gradient, the similarity of the gradients of two tasks in a mini-batch is rarely exactly -1, which avoids this Condition. It is worth noting that theoretically, if the angle between two gradients is very close to a right angle, then convergence may be slower. But in practical applications, we did not observe this situation, as shown in the learning curve in Appendix B.

Now that we have checked the plausibility of PCGrad, our goal is to understand the relationship between PCGrad and the three conditions in the tragic trio. In particular, we derive sufficient conditions for PCGrad to achieve lower loss after an update. Here, we still analyze the two-task setting, but no longer assume convexity of the loss function.

Definition 5. We define the multi-task curvature bounding metric as
ξ ( g 1 , g 2 ) = ( 1 − cos ⁡ 2 ϕ 12 ) ∥ g 1 − g 2 ∥ 2 2 ∥ g 1 + g 2 ∥ 2 2 \xi\ left(\mathbf{g}_{1}, \mathbf{g}_{2}\right)=(1-\cos^{2} \phi_{12}) \frac{\|\mathbf{g} _{1}-\mathbf{g}_{2}\|_{2}^{2}}{\|\mathbf{g}_{1}+\mathbf{g}_{2}\|_ {2}^{2}}X(g1,g2)=(1cos2ϕ12)g1+g222g1g222

Based on the above definition, we propose our next theorem:

Theorem 2. Assume L \mathcal{L}L is differentiable, andL \mathcal{L}The gradient of L is with constantL > 0 L > 0L>0 Lipschitz continuous. Letθ MT \theta^{MT}iMT θ PCGrad \theta^{\text{PCGrad}} iPCGrad is respectively forθ \thetaθ usesg \mathbf{g}g and PCGrad modified gradientg PC \mathbf{g}^{PC}gParameters after PC , step size is t>0 t>0t>0 . Furthermore, suppose that for some constantℓ ≤ L \ell \leq LLH ( L ; θ , θ MT ) ≥ l ∥ g ∥ 2 2 \mathbf{H}\left(\mathcal{L} ; \theta, \theta^{MT}\right) \geq \ell\| \mathbf{g}\|_{2}^{2}H(L;i ,iMT)g22, that is, the multi-task curvature is lower bounded. Then, if (a) cos ⁡ ϕ 12 ≤ − Φ ( g 1 , g 2 ) \cos \phi_{12} \leq -\Phi\left(\mathbf{g}_{1}, \mathbf{g }_{2}\right)cosϕ12- F(g1,g2),(b) ℓ ≥ ξ ( g 1 , g 2 ) L \ell \geq \xi\left(\mathbf{g}_{1}, \mathbf{g}_{2}\right) L X(g1,g2)L,和 © t ≥ 2 ℓ − ξ ( g 1 , g 2 ) L t \geq \frac{2}{\ell-\xi\left(\mathbf{g}_{1}, \mathbf{g}_{2}\right) L} tξ ( g1,g2)L2,则 L ( θ PCGrad ) ≤ L ( θ M T ) \mathcal{L}\left(\theta^{\text{PCGrad}}\right) \leq \mathcal{L}\left(\theta^{MT}\right) L( iPCGrad)L( iMT ).
Proof.See Appendix A.2.

Intuitively, Theorem 2 means that after a single gradient update, PCGrad can achieve lower loss values ​​compared to standard multi-task learning gradient descent, when (i) the angle between task gradients is not too small, i.e. The two tasks need to have enough conflict (condition (a)), (ii) the size of the difference needs to be large enough (condition (a)), (iii) the curvature of the multi-task gradient should be large (condition (b)), ( iv) The learning rate should be large enough so that large curvatures lead to overestimation of performance improvement on the dominant task and underestimation of performance degradation on the dominated task (Condition ©). These first three points (i-iii) correspond exactly to the conditions of the trio outlined in Section 2.2, while the latter condition (iv) is desirable since we want to learn quickly. We empirically verify through Figure 4 in Section 5.3 that the first three points, (i-iii), are often satisfied in neural network multi-task learning problems. For further analysis of the complete sufficient and necessary conditions for PCGrad updates to be superior to native multi-task gradients, see Appendix A.3.

3 PCGrad in practical applications

We use PCGrad in supervised learning and reinforcement learning to handle multi-task or multi-objective problems. Below, we will discuss how PCGrad is actually used in these scenarios.

In multi-task supervised learning, each task T i ∼ p ( T ) \mathcal{T}_{i} \sim p(\mathcal{T})Tip ( T ) has its corresponding training data setD i \mathcal{D}_{i}Di, which consists of labeled training samples, that is, D i = { ( x , y ) n } \mathcal{D}_{i}=\left\{(x, y)_{n}\right\}Di={ (x,y)n} . In this supervised environment, the goal of each task is defined asLi ( θ ) = E ( x , y ) ∼ D i [ − log ⁡ f θ ( y ∣ x , zi ) ] \mathcal{L}_{ i}(\theta)=\mathbb{E}_{(x, y) \sim \mathcal{D}_{i}}\left[-\log f_{\theta}\left(y \mid x, z_{i}\right)\right]Li( i )=E(x,y)Di[logfi(yx,zi) ],这りのzi z_{i}ziis task T i \mathcal{T}_{i}TiOne-hot encoding. At each training step, we start from all data sets ⋃ i D i \bigcup_{i} \mathcal{D}_{i}iDiRandomly select a batch of data B \mathcal{B}B , and then group these random samples according to their task codes to form each taskT i \mathcal{T}_{i}TiMini-batch data B i \mathcal{B}_{i}Bi. We remember B \mathcal{B}The set of tasks in B is BT \mathcal{B}_{\mathcal{T}}BT. After sampling, we are BT \mathcal{B}_{\mathcal{T}}BTLet us assume that there is a smooth function, and then θ L i ( θ ) = E ( x , y ) ∼ B i [ − ∇ θ log ⁡ f θ ( y ∣ x , zi ) ] \nabla_{\theta} . \mathcal{L}_{i}(\theta)=\mathbb{E}_{(x, y) \sim \mathcal{B}_{i}}\left[-\nabla_{\theta}\log f_{\theta}\left(y\mid x, z_{i}\right)\right]iLi( i )=E(x,y)Bi[ilogfi(yx,zi) ] . With these precomputed gradients, we also precompute the cosine similarity of all gradient pairs in the set. Using these pre-computed gradients and their similarities, we can get the update of PCGrad according to Algorithm 1 without needing to calculate the task gradient again or perform backpropagation on the network. Since PCGrad only modifies the gradients of shared parameters during optimization, it does not depend on any specific model and can be applied to any architecture with shared parameters. In Section 5, we verify the practical effect of PCGrad on multiple architectures.

For multi-task reinforcement learning and goal-based reinforcement learning, PCGrad can be directly applied to the policy gradient method. This is achieved by directly updating the policy gradient calculated for each task, which is similar to the situation of supervised learning. . For the actor-critic algorithm, applying PCGrad is also straightforward: we simply replace the actor and critic task gradients with the gradients calculated via PCGrad. For more details on the actual implementation of reinforcement learning, please see Appendix C.

(to be continued)

Reference article

  1. Multi-task learning——[ICLR 2020] PCGrad

  2. Paper reading: Gradient Surgery for Multi-Task Learning

Guess you like

Origin blog.csdn.net/Waldocsdn/article/details/132702853