Maximum Posterior Probability (MAP) for Bayesian Inference

Maximum Posterior Probability (MAP) for Bayesian Inference

This article records the mathematical principles of Bayesian posterior probability distribution in detail, implements a binary classification problem based on Bayesian posterior probability, and talks about my understanding of Bayesian inference.

1. Binary classification problem

Given a dataset of N samples, represented by \(X\) , each sample \(x_n\) has two attributes, which ultimately belong to a certain category


\(t=\left\{0,1\right\}\)



\(\mathbf{x_n}=\begin{pmatrix}x_{n1} \\ x_{n2} \\ \end{pmatrix}\) , assuming model parameters \(w=\begin{pmatrix} w_1 \\ w_2\ end{pmatrix}\)



\(\mathbf{X}=\begin{bmatrix} x_1^T \\ x_2^T \\. \\. \\ x_n^T\end{bmatrix}\)


The sample set is drawn as follows:

According to the Bayesian formula:

\[p(w|t,X)=\frac {p(t|X,w)p(w)} {p(t|X)} \] (公式1)

\(p(w | t,X)\) tells us: in the known training sample set \(X\) and a certain classification of these samples \(t\) , the model parameters \(w\) need to be solved . Therefore, \(w\) is unknown and needs to be solved by the Bayesian probability formula according to the sample. The distribution of \(p(w|t,X)\) is obtained , and the model parameters \(w\)

When we have obtained the optimal model parameters \(w^*\) , given a sample to be predicted \(\mathbf{x_{new}}\) According to the formula \[P(T_{new}=1| x_{new}, w^*)\] to calculate the probability that the new sample \(\mathbf{x_{new}}\) is classified as 1, which is the prediction of the model.

There are three parts on the right side of Equation 1, \(p(t|X,w)\) is called the likelihood probability (likelihood), \(p(w)\) is called the prior probability, these two parts are generally compared easy to solve. The most difficult thing to solve is the denominator: \(p(t|X)\) is called the marginal likelihood function. But fortunately, the boundary likelihood function is independent of the model parameters \(w\) , so the denominator can be treated as a constant with respect to \(w\) .

Mathematically, if the prior probability and the likelihood probability are conjugate, then the posterior distribution probability \(p(w|t, X)\) and the prior probability obey the same distribution. For example, the prior probability obeys the Beta distribution, and the likelihood probability obeys the binomial distribution. At this time, the prior probability distribution and the likelihood probability distribution are conjugated, so the posterior probability also obeys the Beta distribution.

Therefore, when using the Bayesian formula, if the selected prior probability distribution is conjugate with the likelihood probability distribution, the posterior probability distribution can be easily calculated (or can be accurately calculated as a specific /exact of model parameters \(w^*\) ), that is: can compute posterior analytically . But the reality is that they are not conjugated, resulting in three commonly used approximate methods:

  • Point estimation (Point Estimate--MAP method)
  • Laplace approximation
  • Sampling--Metropolis-Hastings

This paper only introduces the point estimation method.

Returning to formula 1, first look at the prior probability \(p(w)\) , the prior probability is similar to the existing experience when making a decision. Because we already have training samples \(X\) , based on experience, we assume that the prior probability follows a Gaussian distribution. That is, \[p(w)=N(0,\sigma^2I)\] . where \(w\) is a vector and \(I\) is the identity matrix.

Next is the likelihood probability \(p(t|X,w)\) , assuming that given the model parameters \(w\) and the sample set \(X\) , the classification results of each sample are independent of each other , so:


\[p(t|X,w)=\prod_{n=1}^N p(t_n|x_n, w)\] (Formula 2)


For example, when the model parameters \(w\) are known , \(w\ ) predicts \(x_1\) as a positive class, \(x_2\) as a negative class... and \(x_n\) To predict the positive class, the prediction results of each sample are independent of each other, and the prediction results of \(w\) to \(x_1\) will not affect the prediction results of \(x_2\) .

Since it is a binary classification problem, \(t_n=\left\{0,1\right\}\) , formula 2 can be further written as: \[p(t|X,w)=\prod_{n=1} ^N p(T_n=t_n|x_n, w)\] , where \(T_n\) represents a random variable that the sample \(x_n\) is classified as a certain class, \(t_n\) is the value of the random variable . For example , \(t_n=0\) means that the sample \(x_n\) is classified as a positive class, \(t_n=1\) means it is classified as a negative class.

2. sigmod function

Since the probability of a random variable taking a certain value is between [0,1], we need to solve \(p(t|X,w)\) , our goal is: to find a function \(f(\mathbf{x_n };w)\) This function can generate a probability value. To simplify the discussion, choose \(sigmod(w^T*x)\) , so:

\[P(T_n=1|x_n,w)=\frac{1}{1+exp(-w^T*x_n)}\]

So:

\[P(T_n=0|x_n,w)=1-P(T_n=1|x_n,w)=\frac{exp(-w^T*x_n)}{1+exp(-w^T*x_n)}\]

Combine the above two formulas into one:

\[P(T_n=t_n|x_n,w)=P(T_n=1|x_n,w)^{t_n}P(T_n=0|x_n,w)^{1-t_n}\]

For N samples, Equation 2 can be written as:

\[p(t|X,w)=\prod_{n=1}^N (\frac{1}{1+exp(-w^T*x_n)})^{t_n}(\frac{exp(-w^T*x_n)}{1+exp(-w^T*x_n)})^{1-t_n}\] (公式3)

So far, the prior probability obeys the Gaussian distribution, and the likelihood probability is given by formula 3, the posterior probability can be solved \[p(w|X,t,\sigma^2)\]

As long as the posterior probability is required, the following formula can be used to calculate the probability that the new sample is classified into the negative class:


\[P(t_{new}=1|x_{new}, X, t)=E_{p(w|X,t,\sigma^2)}\left(\frac{1}{1+exp(-w^T*x_{new})}\right)\]


Explain this formula: because the expression for the posterior probability \(p(w|X,t,\sigma^2)\) has now been obtained , it is a function of \(f(x_n;w)\) , calculate the expected value of this function \(E\) , this expected value is the probability of predicting a new sample \(x_{new}=1\) .

Well, the next step is to solve the posterior probability.

3. Solve the posterior probability

As mentioned earlier, the prior probability obeys the Gaussian distribution \(N(0,\sigma^2I)\) , the likelihood distribution is given by Equation 3, and the denominator-boundary likelihood function is a function of \(w\) , so define a function \(g(w;X,t,\sigma^2)=p(t|X,w)p(w|\sigma^2)\) , the function \(g\) is obviously It is proportional to the posterior probability \(p(w|X,t,\sigma^2)\) . Therefore, finding the maximum value of the function \(g\) is equivalent to finding the optimal parameter \(w^*\) of the posterior probability .

There is a question here: why can the function g be maximized? \(g\) is the function of \(w\), which value of \(w\) takes the function \(g\) to reach the maximum value?

A method called Newton-Raphson method is needed here. Newton's method can be used to find zeros in a function . It passes the following formula:

\[x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}\]

Iterate continuously, and finally find the point where the function value is 0.

In mathematics, when a function is judged to take an extreme value at a certain point, there are the following theorems:

Taking the univariate differentiable function \(h(x)\) as an example, the point where the derivative of \(h(x)\) is 0 is an extreme point, but is this extreme point a minimum value or a maximum value? At this time, it can be judged whether the extreme point is a minimum value or a maximum value by judging that \(h(x)\) is the second derivative. If \(h'(x_n)=0\) and \(h''(x_n)<0\) , then \(h(x)\) takes the maximum value at \(x_n\) .

Therefore, if you can judge that the second derivative of \(g(w;X,t,\sigma^2)\) with respect to \(w\) is less than 0, then you can use Newton's method to solve \(g(w;X ,t,\sigma^2)\) The first derivative of the zero point of \(w\) , that is, when \(g'(w;X,t,\sigma^2)=0\) \ (w\) The value of \(w_0\) , this \(w_0\) is the optimal solution \(w^*\) .

Well, let's prove that the second derivative of \(g(w;X,t,\sigma^2)\) with respect to \(w\) is less than 0. Since \(w\) is a vector, in a multivariate function, it is equivalent to prove that: \(g(w;X,t,\sigma^2)\ ) The Hessian matrix of \ (w\) is negative determined.

Take the logarithm of the function \(g\) to maximize \(log (g(w;X,t,\sigma^2))\)

\(log (g(w;X,t,\sigma^2))=log({p(t|X,w)p(w|\sigma^2}))\)

\[=log(p(t|X,w)+log(p(w|\sigma^2)\]


To simplify the formula, make the following conventions:

Suppose \(w\) is a \(D\) dimension vector:

The first three terms are the results obtained by simplifying the prior distribution following a Gaussian distribution. Derivation formula according to vector: \[\frac{\partial w^Tw}{\partial w}=w\]

By the chain derivation rule:

get:

So: \(log (g(w;X,t,\sigma^2))\ ) The first-order partial number of \(w\) is as follows:

The second order partial derivatives are as follows:

\(I\) is the identity matrix, \(0=<P_n<=1\) is the probability value, and the obtained second-order partial derivative is the Hessian matrix, which is negative definite.

The proof is complete.

At this point, you can safely use Newton's method to iterate continuously to find the value of the parameter \(w\) when \(g(w;X,t,\sigma^2)\) takes the maximum value, and this value is \ (w^*\)

Now that \(w^*\) is calculated, you can use the following formula to predict the probability that the new sample \(x_{new}\) is predicted to be a negative class ( \(T_{new}\) takes a value of 1) span

\[P(T_{new}=1|x_{new},w^*)=\frac{1}{1+exp(-w^{*T}x_{new})}\]

decision boundary

Since it's a binary classification problem, let's see what the decision boundary looks like for classification using Bayesian posteriors. Since the output is a probability value, it is obvious that \(P(T_{new}=1|x_{new},w^*)>0.5\) is predicted to be a negative class, \(P(T_{new}=1| The positive class is predicted when x_{new},w^*)<0.5\) . What about when it's equal to 0.5?

According to: \[P(T_{new}=1|x_{new},w^*)=\frac{1}{1+exp(-w^{*T}x_{new})}=0.5\] inferred:

\[-w^{*T}*x=0=w_1^*x_1+w_2^*x_2\]

\[x_2=\frac{w_1^*}{w_2^*}*(-x_1)\]

That is to say, the two properties of the sample \(\mathbf{x}=\begin{pmatrix}x_{1} \\ x_{2} \\ \end{pmatrix}\) \(x_1\) and \(x_2\ ) is linearly proportional. And this line is the decision boundary

Summarize

The Bayesian method is a commonly used method in machine learning. There are three parts in the Bayesian formula, the prior probability distribution function, the likelihood probability distribution function, and the boundary likelihood probability distribution function (the denominator of the Bayesian formula) . After finding these three parts, the posterior probability distribution is obtained, and then the expected value of the posterior probability distribution is calculated for a new sample \(x_{new}\) , which is the prediction result of the Bayesian model.

Since the calculation of the posterior probability distribution depends on the prior probability distribution function and the likelihood probability distribution function, when the two are conjugate, the posterior probability and the prior probability obey the same distribution function, so the posterior probability can be derived and calculated. Distribution (posterior could be computed analytically). However, when the two are not conjugate, an approximation of the posterior probability distribution is calculated. There are three methods to calculate the approximate value, point estimate method (point estimate --- MAP), Laplace approximation method, Metropolis-Hastings sampling method. This article mainly introduces the first method: point estimate (point estimate --- maximum a posteriori).

Where is the maximization in maximum a posteriori? In fact, it is reflected in the maximization of the likelihood distribution function. The negative characterization of the Hessian matrix proves that \(g(w;X,t,\sigma^2)\) has a maximum value, and then uses Newton's method to iterate to find this maximum value.

References

Newton's method: https://zh.wikipedia.org/wiki/Newton's method
Blog Park Markdown formula garbled: http://www.cnblogs.com/cmt/p/markdown-latex.html

Original text: http://www.cnblogs.com/hapjin/p/8834794.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324411862&siteId=291194637