Parameter Estimation Method Machine Learning

Original: https://blog.csdn.net/yt71656/article/details/42585873

 

  Machine learning lesson a few days ago, the teacher talked about the parameter estimation of the three methods: ML, MAP and Bayesian estimation. Class, then check some relevant information, as well as teacher recommended paper LDA aspects "Parameter estimation for text analysis". Three parameter estimation methods This paper describes the analysis of the text - maximum likelihood estimation MLE, maximum a posteriori estimate MAP and Bayesian estimation, and the differences between the three .

1, the maximum likelihood estimation MLE

First look at Bayes' formula

 

 

 

This formula also called an inverse probability formula, the posterior probability can be converted based on the calculation expression of the likelihood function and the prior probability like, i.e.

 

 

 

Maximum likelihood estimation is to use the likelihood function takes the parameter values ​​as an estimate of the maximum likelihood function can be written as

 

 

 

Because even the multiplication, the likelihood function takes usually simple logarithm, i.e. the log-likelihood function pair. Maximum likelihood estimation problem can be written as

 

 

 

This is a function about solving this optimization problem usually, we get the extreme point derivative is zero. This function has its maximum value is the value that corresponds to the model parameters we estimated.

Bernoulli experiment with a coin toss as an example, the results of experiments N obedience binomial distribution parameter P, that is, the probability of each experiment events, it may be set to get a positive probability. In order to estimate P, using the maximum likelihood estimation, likelihood function can be written as

 

 

 

Where i represents the number of experimental results. Extremum points following likelihood function with

 

 

 

Parameter p obtained maximum likelihood estimate is

 

 

 

As can be seen in each event occurs binomial probability p is equal to N times the probability of doing independent replicates in randomized trials events.

If we do the experiment 20 times, 12 times appear positive and negative 8

Then the maximum likelihood estimation parameter value p obtained is 12/20 = 0.6.

 

2, maximum a posteriori estimate MAP

Maximum a posteriori estimation and maximum likelihood estimation is similar, except that the added function allows the estimated a priori, that is to say at this time is not the maximum likelihood function requirements, but requires the entire posterior calculated by the Bayesian formula The maximum probability, that is,

 

 

 

Note that P (X) independent of the parameters, and therefore make the molecule equivalent to the maximum. Compared with maximum likelihood estimation, now we need to add the number to more than an a priori probability distribution. In practice, this can be used to describe a priori have been known or accepted universal law. For example, in testing a coin toss, the probability of each throw should be subject to a positive occurrence probability distribution, the probability of obtaining the maximum value at 0.5, this distribution is the prior distribution. Parameter prior distribution parameter we call super (hyperparameter) that is

 

 

 

By the same token, when the above posterior probability obtain the maximum value, we get the parameter values ​​estimated from the MAP. A sample data set observed, the probability of occurrence of a new value is

 

 

 

Here we are still at an example to illustrate the coin toss, we expect a priori probability distribution has its maximum value at 0.5, we can use the Beta distribution that is

 

 

 

Which is the Beta Function Expansion

 

 

 

When x is a positive integer

 

 

 

Beta distribution random variable range [0,1], it is possible to generate normalised probability values. The following figure shows the probability density function of the Beta distribution parameters under different circumstances

 

We take such prior distribution at 0.5 to obtain the maximum, we solved now extreme points MAP estimation function, similarly to the derivative we have p

 

 

 

To obtain a maximum a posteriori estimation value of parameter p

 

 

 

And the results of maximum likelihood estimation of the comparison results can be found in more than such pseudo-counts, which is a priori at work. And the larger the hyper-parameters, in order to change the distribution of observed values ​​transmitted prior belief required the more, this time corresponding to the Beta Function aggregation, both at its maximum compression.

If we do the experiment 20 times, 12 times appear positive, negative eight times, then

Then estimated according to the MAP parameter to 16/28 p = 0.571, is less than the maximum likelihood estimates obtained 0.6, which also shows the "coin is generally uniform on both sides" on the a priori influence the parameter estimation.

 

3 Bayesian estimation

Bayesian estimation is to further expand on the MAP, then do not directly estimate the value of the parameter, but allows parameters obey certain probability distribution . Recall Bayesian formula

 

 

 

Now the maximum posterior probability is not required, so request that the probability to observe the evidence, launched by the total probability formula available

 

 

 

When new data is observed, the posterior probability can be adjusted automatically. But usually this is the total probability of Bayesian estimation method for finding places are more tricky.

So how do Bayesian estimates predict it? If we want to find the probability of a new value, and can be made

 

 

 

Calculated. Note that at this time the second factor in the integration is no longer equal to 1, and this is the MLE and MAP big difference.

We continue to experiment with Bernoulli coin toss example to illustrate. And the MAP, we assume a priori distribution of the Beta distribution, Bayesian estimation, but when the configuration, the parameter is not required after a posteriori maximum approximated as a parameter value, but to meet the requirements of the Beta distribution parameter p desired, there

 

 

 

Note the use of the formula

 

 

 

When T is two-dimensional case can be applied for Beta distribution; T is a multi-dimensional case can Dirichlet distribution applications

According to the results can be known, based on Bayesian estimation, parameter p follows a new Beta distribution. Recall that, prior distribution p we choose a Beta distribution, then p is a binomial distribution with the parameters of Bayesian posterior probability distribution of the resulting still obey Beta, Beta and thus we say binomial distribution conjugate distribution. In the probabilistic language model, it is typically chosen as a priori distribution conjugate can bring convenience in calculation. Topic is most typical LDA word in each document Multinomial distribution obeys its conjugate a priori distribution i.e. select Dirichlet distribution; distribution for each Topic Multinomial distribution obeys the word, which is also selected a priori distribution of conjugated i.e. Dirichlet distribution.

According to expectation and variance of the Beta distribution formula, we have

 

 

 

It can be seen at this time to estimate the p expectations and MLE, the estimated value obtained in MAP is different, this time If you still do 20 experiments, 12 positive, negative eight times, then we get based on Bayesian estimation of p parameter is satisfied and Beta 12 + 5 8 + 5 distribution, the mean and variance are 17/30 = 0.567, 17 * 13 / (31 * 30 ^ 2) = 0.0079. Can be seen at this time were calculated desired p value is smaller than the MLE and MAP estimation obtained, closer to 0.5.

In conclusion we can visualize MLE, MAP and Bayesian estimation of parameter estimation results are as follows

 

Personal understanding is, from MLE to MAP and then Bayesian estimation, more and more accurate representation of the parameters, parameter estimation results obtained are also getting closer to 0.5 the prior probability, based on a sample of more and more able to reflect the true parameter Happening.


4. The differences between the three

    First, we can see that the maximum likelihood estimation and the maximum a posteriori estimation is based on the assumption that the parameters to be estimated π seen as a fixed value, but its value is unknown. The maximum likelihood is the simplest form, although it is assumed that the parameter is unknown, but is to determine the value, so that the sample is to find the maximum number of parameters of the likelihood distribution. The maximum a posteriori, but optimization function for the posterior probability form, more a priori probability terms. Bayesian estimation and maximum difference between the two is that it assumes that the parameter is a random variable, the value is not determined. Sample distribution P (π | χ) a, [pi] it is possible to take an arbitrary from a value of 0 to 1, but to take a different probability. MLE and MAP and just take the entire probability distribution P | a point on the (π χ), lost some of the observed data χ given information (which is the biggest difference between classical and Bayesian statistics school is located.)

 

 

references:

1.Gregor Heinrich, Parameter estimation for test analysis, technical report 

2. The text language model parameter estimation - maximum likelihood estimation, MAP and Bayesian estimation http://blog.csdn.net/yangliuy/article/details/8296481

3. "Gibbs Sampling for the UniniTiated" reading notes (on) --- parameter estimation method and Gibbs Sampling Introduction http://crescentmoon.info/2013/06/29/Gibbs%20Sampling%20for%20the%20UniniTiated-1/
----------------
Disclaimer: this article is CSDN blogger "yt71656 'original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement. .
Original link: https: //blog.csdn.net/yt71656/article/details/42585873

Guess you like

Origin www.cnblogs.com/baitian963/p/12043500.html