Introduction to Explainable Artificial Intelligence - Bayesian Methods
Chapter 2 Bayesian Methods
Model joint probability distributions of multiple random variables, characterizing uncertainty and correlation in data and models.
2.1 Bayesian Networks
It is an important class of probabilistic graphical models that mainly solve problems: representation, inference, and learning.
Key elements: directed acyclic graph G, probability distribution p
express
Random variable X = ( X 1 , X 2 , . . . , X d ) , π k is the set of parent nodes corresponding to X k, X π k is the set of corresponding random variables X = (X_1,X_2,...,X_d), \pi_k is the set of parent nodes corresponding to X_k, X_{\pi k} is the set of corresponding random variablesX=(X1,X2,...,Xd),Pikis xkThe corresponding set of parent nodes, Xpkis the set of corresponding random variables
The joint probability distribution is expressed in the form of factor multiplication:
p ( X ) = ∏ i = 1 dp ( X i ∣ X π x ) p(X) = \prod\limits_{i=1}^dp(X_i|X_{\pi x})p(X)=i=1∏dp(Xi∣Xx _)
Conditional independence: A and B are independent given C.
Three conditionally independent basic structures: fork, chain, collision
infer
- Likelihood: Observing a value e (aka evidence) of a variable, computing the probability of the value. (e.g. calculating the probability P(A=1,D=1) that the mouse and eagle populations are doing well)
- Conditional probability: Observing the evidence e, calculate the conditional probability posterior probability of the unobserved variable. (e.g., assuming the mouse is well developed, how is the eagle developed P(A|D=1))
- Maximum posterior probability value: Given some evidence e, calculate the maximum probability value of not being observed. (Same as above, then the most likely development of the eagle? argmax p(A=a|D=1))
Variable Reduction: An Exact Inference Approach
Approximate inference methods: quickly give approximate results. There are two main categories. The first category is the sampling-based method and the Markov chain Monte Carlo method. The second category is the variational inference method, which finds the one closest to the true posterior distribution as an approximation.
Learning Bayesian Networks
-
Parameter learning: Assuming that the Bayesian network structure is given, estimate the optimal parameter or probability distribution
point estimate. The indicator is statistical divergence. Maximum likelihood estimation, equivalent to KL divergence.
Complete Bayesian method: regard the model parameters as global random variables (prior), apply the Bayesian formula, estimate the posterior probability distribution on a parameter, and consider all models to average
-
structured learning
Bayesian programming learning
Small sample learning: Given several data situations, how to learn a suitable model to complete the prediction
Bayesian Programming Learning BPL is an interpretable hierarchical Bayesian model:
-
Representation: symbol level (BPL samples different basic units to construct sub-parts, to the relationship between parts, to words) + entity level (given template to write step by step)
-
Inference: Given an image, BPL infers the posterior probability distribution of the corresponding parts, subparts, and relations. (Random walk from the upper left corner, sample all possibilities, and get an approximate posterior)
-
Learning: Two levels, traditional learning (training on many different characters, inferring the posterior distribution of parameters), learning how to learn (transfer learning from previous experience on new data)
2.2 Bayesian Deep Learning
Cross-fusion of Bayesian learning and deep learning
- Deep generative model: use the fitting ability of NN to describe the complex relationship of variables in probability modeling, and obtain a more capable probability model
- Bayesian neural network: Bayesian inference is used to describe the model uncertainty in deep learning, and the weight is changed to a probability distribution
deep generative model
Variational Autoencoder VAE and Generative Adversarial Network GAN. The generation of the fitted data of the two is unexplainable. The interpretable is thus expressed by the Bayesian network, and the network fits the rest.
Example: Graphical-GAN, a probabilistic graph generation confrontation network, can automatically learn interpretable features without semantic annotation.
Bayesian neural network
dropout, approximate Bayesian inference on deep learning
MC dropout, which samples different random versions of the same network as a posterior distribution, can estimate the average prediction, and can also estimate the uncertainty of the prediction.
From Bayesian Networks to Interpretable Causal Models
The causal model considers variables outside the model, and the connection relationship describes the causal relationship (directed) t.js/# Chapter 2 Bayesian method
Model joint probability distributions of multiple random variables, characterizing uncertainty and correlation in data and models.
2.1 Bayesian Networks
It is an important class of probabilistic graphical models that mainly solve problems: representation, inference, and learning.
Key elements: directed acyclic graph G, probability distribution p
express
Random variable X = ( X 1 , X 2 , . . . , X d ) , π k is the set of parent nodes corresponding to X k, X π k is the set of corresponding random variables X = (X_1,X_2,...,X_d), \pi_k is the set of parent nodes corresponding to X_k, X_{\pi k} is the set of corresponding random variablesX=(X1,X2,...,Xd),Pikis xkThe corresponding set of parent nodes, Xpkis the set of corresponding random variables
The joint probability distribution is expressed in the form of factor multiplication:
p ( X ) = ∏ i = 1 dp ( X i ∣ X π x ) p(X) = \prod\limits_{i=1}^dp(X_i|X_{\pi x})p(X)=i=1∏dp(Xi∣Xx _)
Conditional independence: A and B are independent given C.
Three conditionally independent basic structures: fork, chain, collision
infer
- Likelihood: Observing a value e (aka evidence) of a variable, computing the probability of the value. (e.g. calculating the probability P(A=1,D=1) that the mouse and eagle populations are doing well)
- Conditional probability: Observing the evidence e, calculate the conditional probability posterior probability of the unobserved variable. (e.g., assuming the mouse is well developed, how is the eagle developed P(A|D=1))
- Maximum posterior probability value: Given some evidence e, calculate the maximum probability value of not being observed. (Same as above, then the most likely development of the eagle? argmax p(A=a|D=1))
Variable Reduction: An Exact Inference Approach
Approximate inference methods: quickly give approximate results. There are two main categories. The first category is the sampling-based method and the Markov chain Monte Carlo method. The second category is the variational inference method, which finds the one closest to the true posterior distribution as an approximation.
Learning Bayesian Networks
-
Parameter learning: Assuming that the Bayesian network structure is given, estimate the optimal parameter or probability distribution
point estimate. The indicator is statistical divergence. Maximum likelihood estimation, equivalent to KL divergence.
Complete Bayesian method: regard the model parameters as global random variables (prior), apply the Bayesian formula, estimate the posterior probability distribution on a parameter, and consider all models to average
-
structured learning
Bayesian programming learning
Small sample learning: Given several data situations, how to learn a suitable model to complete the prediction
Bayesian Programming Learning BPL is an interpretable hierarchical Bayesian model:
-
Representation: symbol level (BPL samples different basic units to construct sub-parts, to the relationship between parts, to words) + entity level (given template to write step by step)
-
Inference: Given an image, BPL infers the posterior probability distribution of the corresponding parts, subparts, and relations. (Random walk from the upper left corner, sample all possibilities, and get an approximate posterior)
-
Learning: Two levels, traditional learning (training on many different characters, inferring the posterior distribution of parameters), learning how to learn (transfer learning from previous experience on new data)
2.2 Bayesian Deep Learning
Cross-fusion of Bayesian learning and deep learning
- Deep generative model: use the fitting ability of NN to describe the complex relationship of variables in probability modeling, and obtain a more capable probability model
- Bayesian neural network: Bayesian inference is used to describe the model uncertainty in deep learning, and the weight is changed to a probability distribution
deep generative model
Variational Autoencoder VAE and Generative Adversarial Network GAN. The generation of the fitted data of the two is unexplainable. The interpretable is thus expressed by the Bayesian network, and the network fits the rest.
Example: Graphical-GAN, a probabilistic graph generation confrontation network, can automatically learn interpretable features without semantic annotation.
Bayesian neural network
dropout, approximate Bayesian inference on deep learning
MC dropout, which samples different random versions of the same network as a posterior distribution, can estimate the average prediction, and can also estimate the uncertainty of the prediction.
From Bayesian Networks to Interpretable Causal Models
The causal model considers variables outside the model, and the connection relationship depicts the causal relationship (directed)