Maximum Likelihood Estimation (MLE) Maximum Posterior Probability (MAP)

The article is reproduced from: Leavingseason  http://www.cnblogs.com/sylvanas2012/p/5058065.html 

1) Maximum Likelihood Estimation MLE

Given a bunch of data, if we know that it is randomly taken from a certain distribution, but we do not know the specific parameters of this distribution, that is, "the model is determined, the parameters are unknown". For example, we know the distribution is normal, but we don't know the mean and variance; or it's a binomial distribution, but we don't know the mean. Maximum Likelihood Estimation (MLE, Maximum Likelihood Estimation) can be used to estimate the parameters of the model. The goal of MLE is to find a set of parameters that maximizes the probability that the model will produce observed data:

where is the likelihood function, which represents the probability of the observed data appearing under the parameters. We assume that each observation is independent, then we have

For the convenience of derivation, log is generally taken for the target. So optimizing the log-likelihood function is equivalent to optimizing the log-likelihood function:

Take a simple example of tossing a coin. Now there is a coin whose heads and tails are not very symmetrical. If the heads are up, it is recorded as H, and the heads are up as T. The result of tossing 10 times is as follows:

What is the probability that the coin will land heads?

Obviously this probability is 0.2. Now we use the idea of ​​MLE to solve it. We know that each coin toss is a binomial distribution, and the probability of heads is set to be , then the likelihood function is:

x=1 means face up, x=0 means face up. Then there are:

Guidance:

Let the derivative be 0, it is easy to get:

That is 0.2.

2) Maximum a posteriori probability MAP

The above MLE seeks to find a set of parameters that can maximize the likelihood function, ie . Now the problem is a little more complicated. What if this parameter has a prior probability? For example, in the example of tossing a coin above, if our experience tells us that coins are generally symmetrical, that is , the probability of =0.5 is the greatest, and the probability of =0.2 is relatively small, then how should the parameters be estimated? That's what MAP takes into account. MAP optimizes a posterior probability, which maximizes the probability given the observed value:

Expand the above formula according to the Bayesian formula:

We can see that the first term is the likelihood function, and the second term is the prior knowledge of the parameters. After taking the log it is:

Going back to the coin toss example just now, suppose the parameters have a prior estimate that obeys the Beta distribution, namely:

And every coin toss still obeys the binomial distribution:

Then, the derivative of the objective function is:

The first term of the derivation is given in the MLE above, and the second term is:

Let the derivative be 0, the solution is:

Among them, represents the number of times the face is up. It can be seen here that the difference between MLE and MAP is that the results of MAP have more parameters of prior distribution.

 

Supplementary knowledge: Beta distribution

Beat distribution is a common prior distribution whose shape is controlled by two parameters and its domain is [0,1]

The maximum value of the beta distribution is when x equals :

So in a coin toss, if the prior knowledge is that the coin is symmetrical, then let . But it is clear that even if they are equal, the values ​​of the two have a great influence on the final result. The larger the value of both, the less likely it is to deviate from symmetry:

1) Maximum Likelihood Estimation MLE

Given a bunch of data, if we know that it is randomly taken from a certain distribution, but we do not know the specific parameters of this distribution, that is, "the model is determined, the parameters are unknown". For example, we know the distribution is normal, but we don't know the mean and variance; or it's a binomial distribution, but we don't know the mean. Maximum Likelihood Estimation (MLE, Maximum Likelihood Estimation) can be used to estimate the parameters of the model. The goal of MLE is to find a set of parameters that maximizes the probability that the model will produce observed data:

where is the likelihood function, which represents the probability of the observed data appearing under the parameters. We assume that each observation is independent, then we have

For the convenience of derivation, log is generally taken for the target. So optimizing the log-likelihood function is equivalent to optimizing the log-likelihood function:

Take a simple example of tossing a coin. Now there is a coin whose heads and tails are not very symmetrical. If the heads are up, it is recorded as H, and the heads are up as T. The result of tossing 10 times is as follows:

What is the probability that the coin will land heads?

Obviously this probability is 0.2. Now we use the idea of ​​MLE to solve it. We know that each coin toss is a binomial distribution, and the probability of heads is set to be , then the likelihood function is:

x=1 means face up, x=0 means face up. Then there are:

Guidance:

Let the derivative be 0, it is easy to get:

That is 0.2.

2) Maximum a posteriori probability MAP

The above MLE seeks to find a set of parameters that can maximize the likelihood function, ie . Now the problem is a little more complicated. What if this parameter has a prior probability? For example, in the example of tossing a coin above, if our experience tells us that coins are generally symmetrical, that is , the probability of =0.5 is the greatest, and the probability of =0.2 is relatively small, then how should the parameters be estimated? That's what MAP takes into account. MAP optimizes a posterior probability, which maximizes the probability given the observed value:

Expand the above formula according to the Bayesian formula:

We can see that the first term is the likelihood function, and the second term is the prior knowledge of the parameters. After taking the log it is:

Going back to the coin toss example just now, suppose the parameters have a prior estimate that obeys the Beta distribution, namely:

And every coin toss still obeys the binomial distribution:

Then, the derivative of the objective function is:

The first term of the derivation is given in the MLE above, and the second term is:

Let the derivative be 0, the solution is:

Among them, represents the number of times the face is up. It can be seen here that the difference between MLE and MAP is that the results of MAP have more parameters of prior distribution.

 

Supplementary knowledge: Beta distribution

Beat distribution is a common prior distribution whose shape is controlled by two parameters and its domain is [0,1]

The maximum value of the beta distribution is when x equals :

So in a coin toss, if the prior knowledge is that the coin is symmetrical, then let . But it is clear that even if they are equal, the values ​​of the two have a great influence on the final result. The larger the value of both, the less likely it is to deviate from symmetry:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324970728&siteId=291194637