Gain a more comprehensive understanding of Efficient Global Optimization (EGO), an optimization method that is closely related to Bayesian inference. Starting from the type of data applied (discrete and continuous), the article reviews the representative application methods of Bayesian methods in discrete space and in continuous space. In discrete spaces, Bayesian often relies on the assumptions of frequency statistics and prior conditions. The article briefly describes the application scenarios and preconditions of Bayesian inference in discrete conditions. In the context of continuous data types, the Bayesian inference method is more dependent on the establishment of continuous distributions. The article focuses on the inference principles in the two cases of whether the Bayesian likelihood function can be calculated, and the application background and distinction.

Bayesian Theorem

Bayesian theorem, also known as Bayesian inference, is related to the conditional probability and marginal probability distribution of random variables. It is a method to obtain the target posterior by calculating the conditional probability.

$p\left(f \mid D_{1: t}\right)=\frac{p\left(D_{1: t} \mid f\right) p(f)}{p\left(D_{1: t}\right)}$

Among them, $f$ represents the unknown objective function; $D_{1: t}=\left\{\left(\boldsymbol{x}_{1}, y_{1}\right),\left(\boldsymbol{x}_{2}, y_{2}\right), \ldots,\left(\boldsymbol{x}_{t}, y_{t}\right)\right\}$ represents the observed set, $\boldsymbol{x}_{t}$ represents the decision vector, $y_{t}=f\left(x_{t}\right)+\varepsilon_{t}$ represents the observed value, $\varepsilon_{t}$ represents the observation error; $p(f)$ represents $f$ the prior probability distribution, that is, the assumption of the state of the unknown objective function; $p\left(D_{1: t} \mid f\right)$ represents $y$ the likelihood distribution, due to the existence of the observed value Error, so it is also called "noise"; $p\left(D_{1: t}\right)$ it represents the marginalized $f$ marginal likelihood distribution or "evidence". The marginal likelihood has the product and integral of the probability density function. It is usually difficult to obtain a clear analytical formula, and the least squares can be used. The estimation method is determined; it $p\left(f \mid D_{1: t}\right)$ represents the posterior probability distribution of , and the posterior probability distribution describes the confidence of the unknown objective function after the prior correction of the observed data set. Then according to the type of data, it can be divided into discrete type and continuous type. The following introduces each application method based on Bayesian inference according to the type of data. $f$ $D_{1: t}$

discrete space

Naive Bayes

Naive Bayesian is a classification method based on Bayesian theorem, which assumes that the influence of each attribute on the result is independent, so that the calculation of the joint probability density is converted into the calculation of multiple one-dimensional probability densities, which reduces the calculation cost .
For the likelihood function, calculated by the number of frequencies, the goal of Naive Bayes is the largest posterior distribution, that is, the largest possible classification:

$P\left(c_{i} \mid a_{1}, a_{2}, a_{3}, a_{4}\right)=\alpha P\left(c_{i}\right) \prod_{j=1}^{N} P\left(a_{j} \mid c_{i}\right)$

$C_{\mathrm{MAP}}=\underset{c_{i} \in C}{\arg \max } P\left(c_{i} \mid a_{1}, a_{2}, a_{3}, a_{4}\right)$

For the task of text classification, there are a large number of excellent neural network models available, but these models often require a lot of time and resources for training, while the Naive Bayesian model is simple, but its classification effect has been improved. Basically meet the needs of the project. Commonly used in data classification, attack identification, and resource allocation decisions.

Bayesian network inference algorithm

Belief Network, also known as Belief Network, is a statistical reasoning method based on multivariate statistical analysis techniques. The probability relationship between events is expressed through directional graphics, and each event is composed of nodes connected by directional arrows. Each node represents a random variable, and each variable is independent of each other, and the arrow represents the relationship between the cause and the effect of the variables. The significance of the direction of the arrow is that a change in one variable can cause a change in another variable.

$\mathrm{P}(\mathrm{A}, \mathrm{B}, \mathrm{C}, \mathrm{D}, \mathrm{E}) =P(E \mid C) P((D \mid C) P(C \mid A, B) P(A, B) \\ =P(E \mid C) P((D \mid C) P(C \mid A, B) P(A) P(B)$

The Bayesian network is a causal probability, and all prior probabilities are obtained based on empirical statistics. Among them, the prerequisites require the structuring of data and the construction of the network. Commonly used in risk analysis and behavior analysis.

continuous space

Likelihood/posterior distributions can be obtained

When optimizing the composition of dangerous chemical reagents, the wrong fusion of reagent components may cause a devastating explosion; Component size and structural configuration may lead to unstable operation of the space shuttle or even serious aerospace accidents. Since it will take a lot of time, cost and even endanger life to evaluate these optimization objectives, it is usually desirable to evaluate the cost in a small amount during optimization. A satisfactory solution is obtained below.

Bayesian optimization has two parts:

(1) Use the probabilistic model to proxy the original evaluation of the costly complex objective function. Through the observable points $D_{1: t}$ , the posterior distribution is obtained: $p\left(f \mid D_{1: t}\right)$ .

(2) Use the posterior information of the proxy model $p\left(f \mid D_{1: t}\right)$ to construct a selection strategy to select sample points, that is, the filling strategy or the acquisition function.

Probabilistic models: parametric models and non-parametric models. Parametric model: Beta-Bernoulli (Beta-Bernoulli) model, linear model (radial basis function); non-parametric model: Gaussian process, deep neural network.

$p\left(\boldsymbol{w} \mid D_{1: t}\right)=\prod_{i=1}^{K} \operatorname{Beta}\left(w_{i} \mid \alpha+n_{i, 1}, \beta+n_{i, 0}\right)$

$p\left(\boldsymbol{w}, \sigma \mid \mu_{0}, \boldsymbol{V}_{0}, \alpha_{0}, \beta_{0}\right)=\left|2 \pi \sigma^{2} V_{0}\right|^{-\frac{1}{2}} \times \frac{\beta_{0}^{\alpha_{0}}}{\Gamma\left(\alpha_{0}\right) \sigma^{2 \alpha_{0}+2}} \times \exp \left(-\frac{\left(\boldsymbol{w}-\mu_{0}\right)^{T} \boldsymbol{V}_{0}^{-1}\left(\boldsymbol{w}-\mu_{0}\right)+2 \beta_{0}}{2 \sigma^{2}}\right)$

$\left.p(\boldsymbol{y} \mid \boldsymbol{f})=\mathcal{N} (\boldsymbol{f}, \sigma^{2} \boldsymbol{I}\right)$

Filling strategy: probability of improvement, expected improvement, upper confidence bound, Thompson sampling. Among them, the Gaussian process and EI used in Bayesian optimization are called EGO (Efficient Global Optimization).

Bayesian optimization conditions: the input range is known, the objective function can be fitted by an alternative model, and the observed data (less). Disadvantages: It strongly depends on the alternative model, that is, the prior and posterior distributions. If a Gaussian process with a high degree of fitting can be produced in the end, the Gaussian process model determines that the data is Gaussian distributed, and it can indeed describe the distribution of a specific output. is to find the input within the input range. If the dimension is high and the amount of data is large, the construction of the probability model will be difficult and inaccurate; at the same time, the selection of candidate points is meaningless, and the construction of the probability model and the solution speed will be slowed down. This leads to two directions for the development of Bayesian optimization: the enhancement of probabilistic models (high latitude, multitasking, freeze-thaw) and the enhancement of filling strategies (parallelization, constraint and sensitivity, distance sensitivity).

Popular fields of application: recommendation systems, companies such as Google and Microsoft recommend relevant news articles for subscribers based on the content of websites, videos, music, etc. they subscribe to; Among them, help design and improve products through click-through rate, etc.; biochemical material design.

Likelihood/posterior distributions are difficult to obtain

Approximate Bayesian computation (ABC)

When the likelihood distribution $p\left(D_{1: t} \mid f\right)$ is unavailable or unsolvable (black box problem in the engineering field), then the posterior distribution $p\left(f \mid D_{1: t}\right)$ cannot be obtained, making it extremely difficult to describe the target distribution. ABC maintains the basic framework of Bayesian analysis and probability interpretation, and at the same time removes the mandatory dependence on the exact analytical form of data likelihood. Its notable feature is that the calculation of the likelihood function is replaced by the simulation method. Especially for complex model estimation, this method has obvious advantages.

The Bayesian approximation has two approximation parts:

(1) Data approximation: observation data $\boldsymbol{y}_{n}$ is considered to be high-dimensional and high-complexity, and low-dimensional sufficient statistics (summary statistics) of observation data are required

(2) Simulation approximation: for the known observation data, the generated data maintains a certain range of approximation, $D\left(\eta(z), \eta\left(\boldsymbol{y}_{n}\right)\right) \leqslant \delta$

Among them, the prior distribution and the data generator adopt different methods for different problems, and find the input within the range of the output. The ABC requirements are: the observed data are (mostly) within the solution distribution, not just arbitrary data in the input space; reasonable data generation models and prior distributions are required; true evaluation/statistical calculations are not expensive. It is often used in parameter estimation, structural damage identification analysis, and uncertainty factor analysis. In the case of parametric analysis of a mathematical model, the goal is to make the output of the model as consistent as possible with the observed output.

A Survey of Bayesian Theoretical Frameworks