Using Bayesian Neural Networks from Theory to Practice using Python

1. Description

        In this article, we learned how to build a machine learning model that combines the power of neural networks and still maintains a probabilistic approach to prediction. To do this, we can build so-called Bayesian neural networks.         The idea is not to optimize the loss of the neural network, but to optimize the loss of the infinite neural network. In other words, we are optimizing the probability distribution of the model parameters given a data set.

        We do this using a loss function containing an indicator called Kullback-Leibler  divergence. This is used to calculate the distance between two distributions.

        After optimizing the loss function, we can use the probabilistic modelThis means that if we repeat this model twice, we will get two different results, and if we repeat it 10k times, we can extract a robust statistical distribution of the results.

        We use torch and a file called torchbnn library to achieve this. We constructed a simple regression task and solved it using a two-layer feedforward neural network.

2. Physicists and Engineers

        Physics and engineering are two distinct sciences that share a desire to understand nature and the ability to simulate it.

        The method of physicists is more theoretical . Physicists observe the world and try to model it in the most accurate way. The reality that physicists model is imperfect and has approximations, but once we take those imperfections into account, reality becomesneat and perfect andelegant.

        The engineer’s method is morepractical. Engineers are aware of all the limitations of physicists' models and try to make the experience in the lab as smooth as possible. Engineers may make a more brutal approximation (e.g. pi = 3), but its approximation actually works better in real-life experiments.

        This quote from Gordon Lindsay Glegg sums up the gap between the practical approach of engineers and the elegant theoretical approach of physicists. Differences:

A scientist can discover a new star, but he cannot create a new star. He had to ask an engineer to do it for him.

        In the daily life of a researcher, it works a bit like this. A physicistis a person who has a theory about a specific phenomenon. Engineersare scientists who can set up experiments and see if a theory works.

        Actually, when I start to change from physicists to engineers, I often ask a question is : : : : : /span>

“Okay, your model seems to work…but how powerful is it?

This is a typical engineer question.

When you have a physical model, under certain conditions, the model is theoretically perfect.

 

        There is a certain degree of error, and you have to be able to estimate it correctly.

Picture provided by the author

        In this specific example we are doing, how do we estimate the energy difference between the theoretical output and the experimental result?

        Two options:

        one. If the model is deterministic, you can change the initial conditions by some increment (e.g., apply the deterministic rule to the input >Noisyversion)

        B. If the model isprobabilistic, for some given input, you can extract some from the output StatisticsInformation (e.g. mean, standard deviation, uncertainty bounds...

        Now let’s get into the language of machine learning. In this specific case:

        one. If the machine learning model is deterministic, we can test its robustness by shuffling the training and validation sets.

        B. If the machine learning model isprobabilistic, for some given input, you can extract some from the output a>StatisticalInformation (such as mean, standard deviation, uncertainty bounds...

        Now, let’s assume that the model we want to use is a neural network.
        The first question: Do you need a neural network? If the answer is yes, then you must use it (you don't say it). Question:

“Is your machine learning model robust?”

        The original definition of neural network is "pure determinism".
        We can shuffle the training set, validation set and test set, but we need to consider that the neural network may take a long time to train if we want to conduct multiple tests (assuming CV = 10,000), then you may have to wait for a while.

        Another thing we need to consider is that neural networks are optimized using an algorithm called gradient descent. The idea is that we start from a point in parameter space, as the name suggests, along the direction indicated by the negative gradient of the loss 5>Fall. Ideally, this would take us toa global minimum (spoiler: it is never actually global).

        For an unrealistic simple one-dimensional loss function, the ideal situation is as follows:

        Now, in this case, if we change the starting point, we still converge to the unique global minimum.

        A more realistic situation is this:

        Therefore, if we randomly restart the training algorithm from a different starting point, we will converge to a different local minimum.

        So if we start from point 1 or point 3, we get a point lower than starting point 2.

        The loss function may be filled with local minima, so finding the trueglobal minimum can be a difficult task. Another thing we can do is restart training from a different starting point and compare the loss function values. This approach is the same as before, and we run into the same problem: we can only do it so many times.

        There is a more powerful, rigorous, and elegant way to use the same computational power of neural networks in a probabilistic way; it's called a Bayesian neural network.

        ​ ​ ​ In this article, we will learn:

  1. The idea behind Bayesian neural networksThe idea
  2. The mathematical formula behind Bayesian neural networks
  3. Implement Bayesian Neural Network using Python (more specifically Pytorch)
  4. How to use Bayesian neural networks to solveregressionproblems

let's start!

3. What is Bayesian neural network?

        As we said before, the idea of ​​Bayesian neural networks is to add a probabilistic "feel" to a typical neural network. How did we do it?

        Before understanding Bayesian neural networks, we should probably review Bayes theorem.

        ​​​​A very effective way to look at Bayes’ Theorem is as follows:

"Bayes' theorem is a mathematical theorem that explains why if all cars in the world are blue, then my car must be blue, but just because my car is blue, it doesn't mean All cars in the world are blue.

        In mathematical terms, given events “A” and “B”, given the occurrence of event “B”, the probability of event “A” occurring is as follows:

        Given that event “A” has occurred, the probability of event “B” occurring is as follows:

The formula linking the first expression and the last expression is as follows:

Understood? great. Now, let's say you have your neural network model. This neural network is nothing but a set of parameters used to convert a given input into a desired output.

Feedforward neural networks (the simplest deep learning structure) process your input by multiplying it by a matrix of parameters. Then, a nonlinear activation function (This is the true power of neural networks) is applied to the result of this matrix multiplication. The result is the input to the next layer, where the same process is applied.

Now, we will call the parameter set of the model w. Now we can ask ourselves the hard question.

Suppose I have a data set D which is a set of input x_i and output y_i pairs, for example, the ith image of an animal and the ith label (cat or dog):

Given some data set D, what is the probability of having a set of parameters?

You may need to read this question 3 or 4 times to get the hang of it, but the idea is there. If there is a certain mapping between input and output, then in extremedeterministic cases, only one set of parameters can process the input and bring about the desired output. Inprobabilistic fashion there will be one set of parameters that is more likely than another.

So what we're interested in is quantity.

Now, there are three cool things:

  1.    When you consider the mean given this distribution, you can still think of it as a standard neural network model. For example:

        The left-hand side of the equation represents the calculated average output, and the right-hand side represents the mean (N) of all possible parameter outcome sets, with the probability distribution providing a weight for each outcome.

        2. Although p(w|D) is obviously a mystery, p(D|w) is something we can always study thing. If we apply the above equation to a huge N, no machine learning is needed. You could simply say: "try all possible models given a certain neural network and weigh all possible outcomes using the equation above"

        3. When we get p, we get more than just a machine learning model; we actually get infinite machines Learning model. This means we can extract some uncertainty bounds and statistics from your forecasts. The result is not just "10.23" but more like "10.23 with a possible error of 0.50".

I hope I hyped you up. Let's go to the next chapter

4. Some Mathematics

I don’t want this post to be small talk, but I don’t want it to be painful. If you understand the concept of Bayesian neural networks, or you already know the mathematics behind them, feel free to skip this chapter. If you want to have a reference, a good reference is the following. (Hands-on Bayesian Neural Networks – Deep Learning User Tutorial)

        Now this all seems cool, but I think if you are a machine learning user, you will have this idea:

"How could I optimize such a strange creature?"

The short answer is, "By maximizing:

        But I don't think it's self-evident.

        In this case, the optimization principle is to find the distribution p(w| We will call this distribution q, and we want a measure of the distance between the two distribution functions.

The indicator we will use is called Kullback-Leibler Divergence

        Some interesting facts about it:

  1. For two equal distributions it is 0
  2. If the denominator of two distributions goes to zero but the numerator remains non-zero, then it is infinite
  3. It isasymmetric.

Now, the loss function you see above is a proxy for the Kullback-Leibler divergence, which is called the Evidential Lower Bound (ELBO).

The distribution of weights q is considered to be a normal distribution with mean mu and variance sigma2:

        Optimization is therefore about determining the best mu and sigma values ​​for this distribution.

        In the actual PyTorch implementation, the MSE between the distribution mean and the target is also added to our L (mu, sigma) middle.

5. pyt(orch)hon implementation

        Use PyTorch with the help of a library called torchbnn Implementing Bayesian neural networks in Python is very simple.  

Installing it is very easy:

pip install torchbnn

As we will see, we will build something very similar to a standard Tor neural network:

model = nn.Sequential(
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1, out_features=1000),
    nn.ReLU(),
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1000, out_features=1),
)

Actually, there is a library that can convert your torch model into its Bayesian surrogate:

transform_model(model, nn.Conv2d, bnn.BayesConv2d, 
                args={"prior_mu":0, "prior_sigma":0.1, "in_channels" : ".in_channels",
                      "out_channels" : ".out_channels", "kernel_size" : ".kernel_size",
                      "stride" : ".stride", "padding" : ".padding", "bias":".bias"
                     }, 
                attrs={"weight_mu" : ".weight"})

But let's do a hands-on detailed example:

6. Hands-on return mission

The first thing to do is to import some libraries:

import numpy as np
from sklearn import datasets
import torch
import torch.nn as nn
import torch.optim as optim
import torchbnn as bnn
import matplotlib.pyplot as plt

After that we will make a very simple 2D dataset:

x = torch.linspace(-2, 2, 500)
y = x.pow(5) -10* x.pow(1) + 2*torch.rand(x.size())
x = torch.unsqueeze(x, dim=1)
y = torch.unsqueeze(y, dim=1)

plt.scatter(x.data.numpy(), y.data.numpy())
plt.show()

So, given our 1D input x (ranging from -2 to 2), we want to find our y.

def clean_target(x):
    return x.pow(5) -10* x.pow(1)+1
def target(x):
    return x.pow(5) -10* x.pow(1) + 2*torch.rand(x.size())

Clean_target is our ground truth generator, and Target is the noisy data generator.

Now we will define the Bayesian feedforward neural network:

model = nn.Sequential(
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1, out_features=1000),
    nn.ReLU(),
    bnn.BayesLinear(prior_mu=0, prior_sigma=0.1, in_features=1000, out_features=1),
)

As we can see, it is a two-layer feedforward neural network with Bayesian layers. This will allow us to obtainprobabilistic output.

Now we will define our MSE loss and remaining Kullback-Leibler divergence:

mse_loss = nn.MSELoss()
kl_loss = bnn.BKLLoss(reduction='mean', last_layer_only=False)
kl_weight = 0.01

optimizer = optim.Adam(model.parameters(), lr=0.01)

Both losses will be used in our optimization step:

for step in range(2000):
    pre = model(x)
    mse = mse_loss(pre, y)
    kl = kl_loss(model)
    cost = mse + kl_weight*kl
    
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()
    
print('- MSE : %2.2f, KL : %2.2f' % (mse.item(), kl.item()))

2000 epochs have been used.

Let's define our test set:

x_test = torch.linspace(-2, 2, 300)
y_test = target(x_test)

x_test = torch.unsqueeze(x_test, dim=1)
y_test = torch.unsqueeze(y_test, dim=1)

Now, the model class result is probabilistic. This means that if we run the model 10,000 times, we will get 10,000 slightly different values. For each data point from -2 to 2, we will get the mean and standard deviation,

models_result = np.array([model(x_test).data.numpy() for k in range(10000)])
models_result = models_result[:,:,0]    
models_result = models_result.T
mean_values = np.array([models_result[i].mean() for i in range(len(models_result))])
std_values = np.array([models_result[i].std() for i in range(len(models_result))])

We'll plot our confidence intervals.

plt.figure(figsize=(10,8))
plt.plot(x_test.data.numpy(),mean_values,color='navy',lw=3,label='Predicted Mean Model')
plt.fill_between(x_test.data.numpy().T[0],mean_values-3.0*std_values,mean_values+3.0*std_values,alpha=0.2,color='navy',label='99.7% confidence interval')
#plt.plot(x_test.data.numpy(),mean_values,color='darkorange')
plt.plot(x_test.data.numpy(),y_test.data.numpy(),'.',color='darkorange',markersize=4,label='Test set')
plt.plot(x_test.data.numpy(),clean_target(x_test).data.numpy(),color='green',markersize=4,label='Target function')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')

7,Reference site

         A. 4>.   

      From Theory to Practice with Bayesian Neural Network, Using Python | by Piero Paialunga | Towards Data Science

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/135006916