Multi-objective loss optimization

Uncertainty modeling seems to be applicable to data with more noisy labels, while DTP may work better in clean labeled data

1. Uncertainty weighting with the same variance (high weight for low noise, high weight for studious tasks)

ReferenceUsing    uncertainty to measure loss functions in multi-task learning-Knowledge

The optimal weights for each task depend on the metric and ultimately on the magnitude of the task noise .

Set the weights of different task loss functions by considering the homoscedastic uncertainty between each task ( I will explain further about the specific meaning of homoscedastic uncertainty)

In summary, there are three main innovations in this paper:

  1. A new multi-task learning loss function weight setting method is proposed, which uses the same variance uncertainty to simultaneously learn regression and classification problems of different scales and quantities.
  2. Propose a unified framework for semantic segmentation, instance segmentation and deep regression.
  3. It illustrates the effect of different loss weight settings on the performance of the final multi-task network, and how to achieve better performance improvement compared to the model trained separately.

Multi Task Learning with Homoscedastic Uncertainty

Multi-task learning involves the problem of optimizing the model according to multiple objectives, which is essentially the unification of loss. The most original approach is as shown in the following formula, which simply linearly weights and sums the losses of different tasks:

However, this loss calculation method has many disadvantages. As shown in the figure below, different tasks are very sensitive to the setting of wi, and different settings of wi have great differences in performance, and the results of the last line also show that the method designed in this paper Weights are better for multi-task training than single-task training.

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics》

CVPR 2018, Cites:676

【The main idea】:

This paper hopes to give "easy" tasks a higher weight.

【background】:

The NIPS2017 paper "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" mentions that when we use a data set to train a model based on input x and output y, we face two kinds of uncertainties (Uncertainties): cognition Epistemic and aleatoric.

  • Epistemic: Refers to cognitive biases due to lack of data . When there is very little data, the sample distribution provided by the training data is difficult to represent the global distribution of the data, resulting in biased model training. This uncertainty can be improved by adding data.
  • Aleatoric: refers to the cognitive bias caused by the data itself or the task itself . A characteristic of accidental uncertainty is that it does not improve the results as the amount of data increases, and even if the data increases, the bias still exists.

(1) Homoscedastic uncertainty (the confidence of the model output)

Before introducing relevant knowledge, I would like to briefly talk about how "uncertainty" should be understood. In many cases, the field of deep learning has excellent performance. This performance depends on powerful computing power and deep network structure. , will give a specific result to the problem, such as semantic segmentation in unmanned driving. Discriminate the vehicle as the background. This shows that in most cases, deep learning will give an answer to the question. Note that it is one, no matter how confident the answer is internally, that is to say, the model will not give an answer to the final output of its own. Confidence. You might think that the answer is that the model finds the one with the highest probability among many candidates (for example, softmax is used at the end of the classification problem), but in extreme cases, such as classifying A and B but inputting pictures of class C in the test phase, then the classifier The high probability will bring unpredictable results, and this kind of error must not be tolerated in industries with extremely low fault tolerance, such as aerospace, military and other fields. Just imagine, if there are many extreme situations like this, a very low confidence level of the result is given while outputting the result. This low confidence level will bring an early warning and allow people to intervene, and the effect will be much better. This kind of confidence The output of the degree depends on Bayesian modeling.

Among them, accidental uncertainty can be subdivided into two categories:

(1) Data-dependent or Heteroscedastic uncertainty (Data-dependent or Heteroscedastic uncertainty): This uncertainty depends on the input data, and the prediction results are used as the output of the model

(2) Task-dependent or homoscedastic uncertainty (Task-dependent or Homoscedastic uncertainty): It does not depend on the input data, nor will it be the model output result, but the same constant for all input data, different for different tasks variable, based on this property, is called task-dependent uncertainty

Because in multi-task learning, task uncertainty indicates the relative confidence of the task keys, reflecting the inherent uncertainty in regression and classification problems. Therefore, this paper proposes to use homoscedastic uncertainty as noise to optimize the weights in multi-task learning.

(2). Multi-task likelihoods

This section mainly derives a multi-task loss function that maximizes the Gaussian likelihood estimation using homoscedastic uncertainty. First define a probability model:

This is the definition of a probability model for a regression problem, fw(x) is the output of the neural network, x is the input data, W is the weight

For classification problems, the output is usually pushed into a sigmoid, as follows:

Next, define the likelihood function for multitasking:

where yi is the output of each subtask in the multitasking

Then, the maximum likelihood estimate can be expressed in the following formula, and the formula (5) also shows that the maximum likelihood estimate is proportional to the right side, where σ is the standard deviation of the Gaussian distribution, which is also the noise of the model, and the following The task is to maximize the likelihood distribution according to W and σ

Take the two outputs y1 and y2 as an example: get a Gaussian distribution such as (6):

Then the maximum likelihood estimate at this time is (7):

It can be seen that the distance calculation of y and f is replaced by the loss function in the last step, namely:

In the same way, we can know that L2

Continuing to analyze the formula (7), we can see that our task is to minimize the maximum likelihood estimate, so when σ (noise) increases, the corresponding weight will decrease; on the other hand, as the noise σ Decrease, the corresponding weight will increase

Next, the classification problem is also considered. Generally, a layer of softmax is added to the classification problem, as shown in formula (8):

Then the softmax likelihood estimate is:

Next, consider this situation: the two outputs of the model, one is continuous type y1, and the other is independent type y2, which are modeled with Gaussian distribution and softmax distribution respectively, and formula (10) can be obtained:

In the same way,

L2(W) is replaced by: 

 2. DTP Dynamic Task Prioritization ((low weight for high noise, high weight for difficult tasks)) dynamic task priority

Loss weight for multi-objective learning of recommendation system_console.log(**)'s blog - CSDN Blog

DTP hopes to give higher weight to harder-to-learn tasks.

Intuitively, tasks with high KPI are relatively easy to learn, and the weight will be smaller; conversely, the weight of difficult tasks will be larger.

【evaluate】:

advantage:

It is necessary to obtain the KPI values ​​of different steps, thereby avoiding the need to obtain the gradient of different tasks, and the calculation is faster

shortcoming:

DTP does not consider the magnitude of the loss of different tasks, and additional operations are required to adjust the magnitude of each task to the same level; and KPIs need to be calculated frequently...

# Dynamic Task Prioritization 动态任务优先级
def focal_loss_dtp(auc,k,alpha=0.95, gamma=2.0):
    k = alpha * auc + (1-alpha) * k
    return -tf.pow(1-k, gamma) * tf.log(k)

dtp_1 = focal_loss_dtp(auc_1[1], k_1)
dtp_2 = focal_loss_dtp(auc_2[1], k_2)
dtp_3 = focal_loss_dtp(auc_3[1], k_3)

loss = dtp_1 * loss_1 + dtp_2 * loss_2 + dtp_3 * loss_3

3. Dynamic Weight Averaging - dynamic weighted average

 Four,  Gradient Normalization - gradient normalization

 

insert image description here

Optimization in Multi-task learning - Zhihu is written in front: Most articles, reviews or papers on multi-task learning (Multi-task learning) focus on the iteration and innovation of the network structure; however, for multi-task learning The optimization of task learning Loss is also very important. This article is based on a review of multi-task learning in 2020 - "Multi-Task L... https://zhuanlan.zhihu.com/p/269492239   loss weighting_How to solve the problem that the loss cannot converge synchronously in the multi-task model? _ABEL Su's Blog-CSDN Blog Foreword When training a multi-task model, due to the problems of data distribution, positive and negative sample ratio, loss scale, etc. among multiple tasks, there will be problems in the model training process, such as loss and convergence of some tasks. Fast, but the loss convergence of some other tasks is very slow or even stagnant, which eventually leads to the failure of training for those tasks with stagnant convergence that cannot share the characteristics of the input layer. So, how to avoid this problem? This article will introduce an article "Multi-Task Learning Using Uncertainty to Weigh Losse... https://blog.csdn.net/weixin_36261487/article/details/112180071

Multi-task Learning (Multi-task Learning)-Network Design and Loss Function Optimization_From Daisy and her one-way ticket-CSDN blog At present, multi-objective learning mainly develops from two directions, one is network structure design, and the other is loss function optimization; 1. MTL network design MTL networks can usually be divided into two types, one is hard-parameter sharing sharing the hidden layer at the bottom between different tasks, and the other is soft-parameter sharing, which has a variety of forms, such as two task parameters Not shared, but the L2 norm limit is added to the parameters of different tasks; there are also some hidden layers for each task, and the combination of all hidden layers is learned; these two methods have their own advantages and disadvantages, and the hard network is relatively Soft is not easy to fall into overfitting, but if the task difference is large, the model result is poor, but the soft class network usually has better parameters https://blog.csdn.net/u012513618/article/details/110439185

The application of the multi-objective sorting algorithm in the CRM business opportunity intelligent distribution system and the sorting algorithm in the 58 Yellow Pages (local service) CRM system. https://mp.weixin.qq.com/s/gb3UPtDSW6h7kgBrQWNs_A

Guess you like

Origin blog.csdn.net/u013385018/article/details/120891324