Uncertainty modeling seems to be applicable to data with more noisy labels, while DTP may work better in clean labeled data
1. Uncertainty weighting with the same variance (high weight for low noise, high weight for studious tasks)
ReferenceUsing uncertainty to measure loss functions in multi-task learning-Knowledge
The optimal weights for each task depend on the metric and ultimately on the magnitude of the task noise .
Set the weights of different task loss functions by considering the homoscedastic uncertainty between each task ( I will explain further about the specific meaning of homoscedastic uncertainty)
In summary, there are three main innovations in this paper:
- A new multi-task learning loss function weight setting method is proposed, which uses the same variance uncertainty to simultaneously learn regression and classification problems of different scales and quantities.
- Propose a unified framework for semantic segmentation, instance segmentation and deep regression.
- It illustrates the effect of different loss weight settings on the performance of the final multi-task network, and how to achieve better performance improvement compared to the model trained separately.
Multi Task Learning with Homoscedastic Uncertainty
Multi-task learning involves the problem of optimizing the model according to multiple objectives, which is essentially the unification of loss. The most original approach is as shown in the following formula, which simply linearly weights and sums the losses of different tasks:
However, this loss calculation method has many disadvantages. As shown in the figure below, different tasks are very sensitive to the setting of wi, and different settings of wi have great differences in performance, and the results of the last line also show that the method designed in this paper Weights are better for multi-task training than single-task training.
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics》
CVPR 2018, Cites:676
【The main idea】:
This paper hopes to give "easy" tasks a higher weight.
【background】:
The NIPS2017 paper "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" mentions that when we use a data set to train a model based on input x and output y, we face two kinds of uncertainties (Uncertainties): cognition Epistemic and aleatoric.
- Epistemic: Refers to cognitive biases due to lack of data . When there is very little data, the sample distribution provided by the training data is difficult to represent the global distribution of the data, resulting in biased model training. This uncertainty can be improved by adding data.
- Aleatoric: refers to the cognitive bias caused by the data itself or the task itself . A characteristic of accidental uncertainty is that it does not improve the results as the amount of data increases, and even if the data increases, the bias still exists.
(1) Homoscedastic uncertainty (the confidence of the model output)
Before introducing relevant knowledge, I would like to briefly talk about how "uncertainty" should be understood. In many cases, the field of deep learning has excellent performance. This performance depends on powerful computing power and deep network structure. , will give a specific result to the problem, such as semantic segmentation in unmanned driving. Discriminate the vehicle as the background. This shows that in most cases, deep learning will give an answer to the question. Note that it is one, no matter how confident the answer is internally, that is to say, the model will not give an answer to the final output of its own. Confidence. You might think that the answer is that the model finds the one with the highest probability among many candidates (for example, softmax is used at the end of the classification problem), but in extreme cases, such as classifying A and B but inputting pictures of class C in the test phase, then the classifier The high probability will bring unpredictable results, and this kind of error must not be tolerated in industries with extremely low fault tolerance, such as aerospace, military and other fields. Just imagine, if there are many extreme situations like this, a very low confidence level of the result is given while outputting the result. This low confidence level will bring an early warning and allow people to intervene, and the effect will be much better. This kind of confidence The output of the degree depends on Bayesian modeling.
Among them, accidental uncertainty can be subdivided into two categories:
(1) Data-dependent or Heteroscedastic uncertainty (Data-dependent or Heteroscedastic uncertainty): This uncertainty depends on the input data, and the prediction results are used as the output of the model
(2) Task-dependent or homoscedastic uncertainty (Task-dependent or Homoscedastic uncertainty): It does not depend on the input data, nor will it be the model output result, but the same constant for all input data, different for different tasks variable, based on this property, is called task-dependent uncertainty
Because in multi-task learning, task uncertainty indicates the relative confidence of the task keys, reflecting the inherent uncertainty in regression and classification problems. Therefore, this paper proposes to use homoscedastic uncertainty as noise to optimize the weights in multi-task learning.
(2). Multi-task likelihoods
This section mainly derives a multi-task loss function that maximizes the Gaussian likelihood estimation using homoscedastic uncertainty. First define a probability model:
This is the definition of a probability model for a regression problem, fw(x) is the output of the neural network, x is the input data, W is the weight
For classification problems, the output is usually pushed into a sigmoid, as follows:
Next, define the likelihood function for multitasking:
where yi is the output of each subtask in the multitasking
Then, the maximum likelihood estimate can be expressed in the following formula, and the formula (5) also shows that the maximum likelihood estimate is proportional to the right side, where σ is the standard deviation of the Gaussian distribution, which is also the noise of the model, and the following The task is to maximize the likelihood distribution according to W and σ
Take the two outputs y1 and y2 as an example: get a Gaussian distribution such as (6):
Then the maximum likelihood estimate at this time is (7):
It can be seen that the distance calculation of y and f is replaced by the loss function in the last step, namely:
In the same way, we can know that L2
Continuing to analyze the formula (7), we can see that our task is to minimize the maximum likelihood estimate, so when σ (noise) increases, the corresponding weight will decrease; on the other hand, as the noise σ Decrease, the corresponding weight will increase
Next, the classification problem is also considered. Generally, a layer of softmax is added to the classification problem, as shown in formula (8):
Then the softmax likelihood estimate is:
Next, consider this situation: the two outputs of the model, one is continuous type y1, and the other is independent type y2, which are modeled with Gaussian distribution and softmax distribution respectively, and formula (10) can be obtained:
In the same way,
L2(W) is replaced by:
2. DTP Dynamic Task Prioritization ((low weight for high noise, high weight for difficult tasks)) dynamic task priority
Loss weight for multi-objective learning of recommendation system_console.log(**)'s blog - CSDN Blog
DTP hopes to give higher weight to harder-to-learn tasks.
Intuitively, tasks with high KPI are relatively easy to learn, and the weight will be smaller; conversely, the weight of difficult tasks will be larger.
【evaluate】:
advantage:
It is necessary to obtain the KPI values of different steps, thereby avoiding the need to obtain the gradient of different tasks, and the calculation is faster
shortcoming:
DTP does not consider the magnitude of the loss of different tasks, and additional operations are required to adjust the magnitude of each task to the same level; and KPIs need to be calculated frequently...
# Dynamic Task Prioritization 动态任务优先级
def focal_loss_dtp(auc,k,alpha=0.95, gamma=2.0):
k = alpha * auc + (1-alpha) * k
return -tf.pow(1-k, gamma) * tf.log(k)
dtp_1 = focal_loss_dtp(auc_1[1], k_1)
dtp_2 = focal_loss_dtp(auc_2[1], k_2)
dtp_3 = focal_loss_dtp(auc_3[1], k_3)
loss = dtp_1 * loss_1 + dtp_2 * loss_2 + dtp_3 * loss_3
3. Dynamic Weight Averaging - dynamic weighted average
Four, Gradient Normalization - gradient normalization
Optimization in Multi-task learning - Zhihu is written in front: Most articles, reviews or papers on multi-task learning (Multi-task learning) focus on the iteration and innovation of the network structure; however, for multi-task learning The optimization of task learning Loss is also very important. This article is based on a review of multi-task learning in 2020 - "Multi-Task L... https://zhuanlan.zhihu.com/p/269492239 loss weighting_How to solve the problem that the loss cannot converge synchronously in the multi-task model? _ABEL Su's Blog-CSDN Blog Foreword When training a multi-task model, due to the problems of data distribution, positive and negative sample ratio, loss scale, etc. among multiple tasks, there will be problems in the model training process, such as loss and convergence of some tasks. Fast, but the loss convergence of some other tasks is very slow or even stagnant, which eventually leads to the failure of training for those tasks with stagnant convergence that cannot share the characteristics of the input layer. So, how to avoid this problem? This article will introduce an article "Multi-Task Learning Using Uncertainty to Weigh Losse... https://blog.csdn.net/weixin_36261487/article/details/112180071