Uncertainty in Deep Learning

Refer to the original text here,Two kinds of uncertainty in deep learning

Note: In this article, the concepts, formulas and experiments are based on Alex Kendall & Yarin Gal’s paper:https://arxiv.org/pdf/1703.04977.pdf a>

Uncertainty

At present, deep learning performs very well in many fields. For example, the accuracy of semantic segmentation used in driverless driving is quite amazing. But as we all know, Tesla driverless cars had many accidents some time ago. One person died. The ultimate cause was that the algorithm misidentified a light-colored truck as the sky.

One problem exposed by this accident is that now our traditional deep learning algorithms can almost only give a specific result, but cannot give how confident the model itself is with the result. Indeed, in classification problems, we will use the network's The last layer adds a softmax function to obtain the probability, but imagine the following situation: suppose I trained a model to classify human faces and orangutan faces. But I gave the model a picture of a big-faced cat during the test phase. Our model would give a rather inaccurate result, and there was no way to tell us "I really don't know what the hell this is." Some people may say, In this case, will the final result output a result of [0.5, 0.5] to indicate that you are not sure about the result? In fact, the characteristics of the softmax function determine that in this case the network is unlikely to output a result like [0.5, 0.5] [1].

This question is a very important one. I once saw a friend who works on spacecrafts on Zhihu answer why ML cannot be applied in the aerospace field now. His answer said that NN can give a special answer in most cases. Very good results, but occasionally a particularly bad result will be given, but this particularly bad result is absolutely unacceptable in their field. If the model can output this result while giving a very low confidence level, humans can be informed that intervention is needed, then ML can be applied in a wider range of fields.

So how to get the network to obtain a confidence output? A very common method currently is to use BNN (Bayesian Neural Network). The principle of BNN is generally that the weight of each parameter in our network will no longer be a specific number, but will be replaced by a prior distribution. In this way, the network we train will no longer be a function, but a distribution of a function [2]. Through this distribution, we can get a confidence level in the result. However, friends who have implemented BNN and used pyro should know that BNN is more difficult to apply to large networks with hundreds of convolutional layers. Its training speed and computational complexity limit its development.

This article will discuss the different causes of uncertainty in deep learning and describe how to quantify these uncertainties. We will perform Bayesian inference through a method called MC Dropout (Monte Carlo Dropout), and then modify the loss function to obtain the uncertainty.

Aleatoric Uncertainty & Epistemic Uncertainty

Let’s first explain the two different types of uncertainty that exist in deep learning.

  1. accidental uncertainty

When we were learning physics in junior high and high school, the teacher must have mentioned the term accidental error. When we measure the gravitational acceleration constant when a car is dropped, the value obtained will fluctuate up and down each time. This is caused by air flow disturbance, insufficient measurement accuracy, etc. It is an unavoidable type of error. In deep learning, we call this error accidental uncertainty.

As an example from the perspective of deep learning, let’s take a face key point regression problem that everyone should be familiar with [3]:

We can see that for a very similar set of data, there is a relatively large error in the labeling of the dataset (see the right edge of the right picture). Such errors are not introduced by our model, but are inherently errors in the data. The larger this bias is in the data set, the larger our chance uncertainty should be.

2. Cognitive uncertainty

Epistemic uncertainty is the uncertainty that exists in our models. Take the example given at the beginning of our article. Suppose we train a model to classify human faces and orangutan faces. No enhancements are made during the training, which means that no operations such as rotation and blurring of the data set are performed. If I give the model a normal human face, or a normal orangutan face, our model should have a high level of confidence in the results it produces. But if I give him a picture of a cat, a blurred human face, or a gorilla face rotated 90°, the model's confidence level should be particularly low. In other words, epistemic uncertainty measures whether our input data exists in the distribution of data we have seen.

Quantification of two types of uncertainty

Note: This article only focuses on the quantification of uncertainty in regression problems. For classification problems, the formulas I list below will become more complicated, and I will describe them in detail in "Two Uncertainties in Deep Learning (Part 2)".

1. Quantification of cognitive uncertainty

Unlike what we have done in the past, in order to obtain cognitive uncertainty, will not disable Dropout during the testing phase.

Why is this? We still need to start with BNN.

The ultimate principle of BNN is to find a posterior distribution P(W|D)P(W|D), where WW is the weight and DD is the data set. If we use Bayes' formula, we will get the following formula:

P(W|D)=P(D|W)P(W)P(D)P(W|D)=\frac{P(D|W)P(W)}{P(D)}\\

Among them, it is very difficult to obtain the item P(D)P(D). First of all, this P(D)P(D) actually represents the real data distribution, which cannot be obtained theoretically (if it can be obtained easily) , what else do you need ML for...). Secondly, you might think of this formula:

P(D)=∑iP(D|Wi)P(Wi)P(D)=\sum_iP(D|W_i)P(W_i)\\

What this announcement means is that you need to traverse all WW to calculate P(D)P(D), but this is obviously not feasible.

Now we have to find a way to solve the problem of difficult calculation of posterior. Bayesian neural network is generally calculated using variational inference. Interested readers can find more information in my reference reading. What we are going to use is another method called the Monte Carlo method.

The essence of the Monte Carlo method is to generate an estimate of our posterior distribution P(W|D)P(W|D) through a finite number of samples, and then we can use the obtained approximate distribution as P(W|D)P(W|D) is used, and with this posterior distribution, we can know whether DD is in the distribution we have learned, thereby obtaining cognitive uncertainty.

However, every time our network feeds it data, it must come out with a fixed value. The result of the sample is the same every time. Is there no way to use the Monte Carlo method? This is why we chose not to disable Dropout during the testing phase and treat Dropout as a natural random generator, so that we can obtain P(W|D)P(W without making any changes to our existing network |D). Friends who are interested in the derivation process can find a very detailed derivation process in the reference reading [4].

So ultimately, how do you get epistemic uncertainty? Although the above principle is very obscure, in fact, this operation is very simple. You only need to feed the data into the network TT times, and then take an average of the results, which is your final prediction, and calculate the variance of the result. , the result is your epistemic uncertainty, the formula is as follows:

E(y)=1T∑txtVar(y)=σ2+1T∑tf(xt)Tf(xt)−E(y)TE(y)E(\boldsymbol y)=\frac{1}{T}\sum_tx_t\\ Var(\boldsymbol y)=\sigma2+\frac{1}{T}\sum_tf(x_t)Tf(x_t)-E(\boldsymbol y)^TE(\boldsymbol y)

Among them, Var(y)Var(\boldsymbol y) is your cognitive uncertainty. As for where this σ2\sigma^2 comes from, we will talk about it next.

2. Quantification of accidental uncertainty

For accidental uncertainty, the author derived the loss function in a very complicated way (not stated in the paper, but in the references, interested readers can read it):

L(θ)=1N∑i=1N||yi−f(xi)||22σ(xi)2+12log⁡(σ(xi)2)L(\theta)=\frac{1}{N}\sum_{i=1}{N}\frac{||y_i-f(x_i)||2}{2\sigma(x_i)2}+\frac{1}{2}\log(\sigma(x_i)2)\\

Our current model is represented as ff, and the corresponding output of the model is {f(xi),σ2}\{f(x_i),\sigma^2\}. In the loss function, this σ(xi)2\sigma(x_i)^2 describes the accidental uncertainty of the model on the data xix_i, which is the variance of the data. Our model will now learn this variance through unsupervised learning. The derivation of this loss function is very complicated, but we can understand this loss function intuitively and qualitatively.

We imagine a very simple regression problem. The training set is a sin function plus a noise (the following simple experiment uses this data set). The purpose of our training is to fit this noisy sin as much as possible, so that We hope that our model will be close to the real data distribution sin. Assuming that the network has no regularization and overfitting, then for each xix_i, ||yi−f(xi)||2||y_i-f(x_i)||^2 will be equal to 0. But our current network has regularization in it (Dropout), so our network will try to fit a trend as much as possible instead of memorizing every data point. So now we can treat the error ||yi−f(xi)||2||y_i-f(x_i)||^2 as an error in the data itself. In other words, if my regularized network has trained very well on the noisy sin and has learned the trend, but there is always a small error, then this error can be regarded as the noise. This loss function attempts to reduce the noise caused by dividing ||yi−f(xi)||2||y_i-f(x_i)||^2 by σ(xi)2\sigma(x_i)^2 Losses are offset. Then why do we need to add the following 12log(σ(xi)2)\frac{1}{2}log(\sigma(x_i)^2), something similar to a regular term? Because if only σ(xi)2\sigma(x_i)^2 is removed, the network will tend to predict all σ2\sigma^2 as infinity to minimize the loss function. Adding this 12log(σ(xi)2)\frac{1}{2}log(\sigma(x_i)^2) , our network will learn to only operate on ||yi−f(xi)||2| Only when |y_i-f(x_i)||^2 is very large, increase the value of σ(xi)2\sigma(x_i)^2. In this way, our network learns how to output accidental uncertainty. The author of the original paper said that such a structure is not a deliberate design, but a derived property.

In this way, every time a piece of data xx is given, the model can output a result and a σ2\sigma^2 representing the accidental uncertainty. At the same time, our σ2\sigma^2 will also be used to calculate cognitive uncertainty. inside.

But in actual training, we will not let the network directly output σ2\sigma^2, because if σ2\sigma^2 is 0, our loss function will be directly nan. So when we train, let the network output log(σ2)log(\sigma^2) to prevent the occurrence of 0. At the same time, our loss function will become:

L(θ)=1N∑i=1N12exp⁡(−log⁡(σ2))||yi−f(xi)||2+12log⁡(σ(xi)2)L(\theta)=\frac{1}{N}\sum_{i=1}^{N} \frac{1}{2} \exp(-\log(\sigma^2)) ||y_i-f(x_i)||2+\frac{1}{2}\log(\sigma(x_i)2)\\

In terms of application, if you want to choose a threshold value and try to reject uncertain results, you can add the two uncertainties together.

experiment

Let's first do an experiment on a very simple data set. This is a simple sin function, the blue data is our training data and the orange data is our test data.

Among them, in the interval [2.5, 5.0], we impose a noise obeying the distribution N(0,σ2=0.5)\mathcal N(0,\sigma^2=0.5) to the training data. Note that the range of the test data significantly exceeds that of the training data. We used Adam, lr=0.001, weight_decay=1e-4, Dropout probability 0.1 for training, and finally obtained the following results:

Accidental uncertainty:

The red line is the prediction, the orange area is 1x chance uncertainty

Epistemic uncertainty:

The red line is the prediction, the orange area is 1x cognitive uncertainty

We can see that the accidental uncertainty increases significantly in the noisy interval. The cognitive uncertainty gradually increases outside the training set range.

Next, we use the example in the Alex Kendall & Yarin Gal paper to see how this method performs on more complex problems. The following pictures are all from the paper mentioned at the beginning of this article (originally I wanted to run one by myself The result, but I don’t have a graphics card at hand...).

The authors used DenseNet and ran a depth regression at NYUv2 depth. The following is the result.

From left to right: input image, ground truth, network output, accidental uncertainty, epistemic uncertainty

We can see that places with high accidental uncertainty are often places that are too deep and do not have depth information labeled. Epistemic uncertainty is concentrated at the edges of items, and at very deep depths where these networks are prone to prediction failure. Detailed information can be found in the paper.

write at the end

We introduced how to use MC dropout to determine uncertainty in the network, but this article is limited to discussing relatively simple regression problems. As for the classification problem, the formula will become more complicated. I am currently running a semantic segmentation model of DenseNet. I will sort it out after running it. I will elaborate on it in [Experimental Notes] Two Uncertainties in Deep Learning (Part 2) .

Thanks for reading!

Reference reading

[1] [0.5,0.5] For reasons why results like this are not easy to appear in softmax, please refer to: http://www.cs.ox.ac.uk /people/yarin.gal/website/blog_3d801aa532c1ce.html

[2] For the detailed principles of BNN, please refer to:https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian- nns-using-pyro-and-pytorch-b1c24e6ab8cd

[3] Sample dataset and reference data come from:Exploring YouTube Faces with Keypoints Dataset

[4] http://www.cs.ox.ac.uk/people/yarin.gal/website/thesis/thesis.pdf

Guess you like

Origin blog.csdn.net/chumingqian/article/details/134561732