Maximum Likelihood Summary

In machine learning, we often want to take advantage of great overall data distribution approximate likelihood, this article by introducing maximum likelihood method and some of its nature, it aims to explain in simple terms the maximum likelihood method.

0. Bayesian probability

First look at the classic Bayesian formula:
\ [the p-(the Y-| the X-) = \ {the p-FRAC (the X-| the Y-) the p-(the Y-)} {the p-(the X-)} \]

Which, \ (the p-(the Y-) \) is called the prior probability ( \ (Prior, \) ), that is, about variable obtained based on a priori knowledge \ (Y \) distribution, \ (the p-(the X-| the Y-) \ ) is called the likelihood function ( \ (likelihoodpuv \) ), \ (P (X-) \) as a variable (X-\) \ probability, \ (P (the Y | X-) \) is called the conditional probability (given variable (X \) \ in the case of \ (Y \) probability, \ (posterior \) , the posterior probability).

1. The likelihood function

The likelihood that possibility; the name suggests, it is like a function of the possibility of a likelihood function. In statistics, it shows the likelihood of the model parameters, namely as a function of statistical parameters in the model. The general form is as follows:

\[ L(\omega)=p(D | \omega) = p(x_1, x_2, \cdots ,x_n| \omega) \]

Wherein, \ (D \) represents a sample set \ (\ {x_1, x_2, \ cdots, x_n \} \) ,   \ (\ Omega \) represents a vector of parameters.

Likelihood function represents a different parameter vector \ (\ Omega \) , the size of the possibility of occurrence of the observed data, which is a parameter vector \ (\ Omega \) function. In a sense, we can say that it is a conditional probability of reverse \ (^ {[1]} \) .

Here using Wikipedia by \ (^ {[. 1]} \) a brief description of an example of what the likelihood function, but also raises maximum likelihood estimation.

Consider the experimental quality of a coin, generally speaking, our coins are "fair" (even texture), that is face up (Head) of probability \ (P_H = 0.5 \) , thus we can know the probability of throwing a number of the possibility (probability, or probability) of various times after the outcome.

For example, a coin throwing twice, both on a probability of 0.25, expressed using the conditional probability, i.e.:
\ [P (HH | P_H = 0.5) = 0.25 = 0.5 ^ 2 \]
If a coin is not uniform texture, then it may be a "non-equity". In statistics, we are concerned that is known to throw a series of results, information on the possibility of a face-up coin throwing . We can build a statistical model: assume there will be a coin cast \ (p_H \) the probability of face-up, there \ (1-p_H \) probability of tails. At this time by observation has occurred twice throwing, the conditional probability can be rewritten as a likelihood function:
\ [L (P_H) = P (HH | P_H = 0.5) = 0.25 \]

That is, to take given the likelihood function, observed in throwing two are face up, \ (P_H \) the likelihood is 0.25. Note that the converse is not true, that is, when the likelihood function is 0.25, can not deduce \ (P_H = 0.25 \) .

Considering \ (P_H = 0.6 \) , that will change the likelihood function:
\ [L (P_H) = P (HH | P_H = 0.6) = 0.36 \]
As shown, an approximate value noted likelihood function becomes big. This means, if the parameter \ (p_H \) value becomes 0.6, then the probability of observing a positive result up twice in a row than the assumption that \ (p_H = 0.5 \) larger, that is, the parameters \ (p_H \ ) is 0.6 to 0.5 than to take more convincing, more " reasonable ."

img

In short, like the importance of the likelihood function is not its specific value, but when the parameter variation function in the end become smaller or larger .

For the same likelihood function, it represents the model, the parameter value of an item with a variety of possible, but if there is a parameter value, making it a favorite function value, then the value of this parameter is the most "reasonable" the parameter values.

In this example, \ (P_H \) when taken 1, maximum likelihood function. That is, when the continuous observation up to the front twice, assuming that the probability of a coin thrown face up to 1 is the most reasonable.

In the above quote, we see an extreme conclusion that all future will be throwing up front, which is widely used in a way that maximum likelihood method at frequencies school of thought. In view of the above (frequency faction), \ (\ Omega \) is considered to be a fixed parameter, its value by estimating determined. However, Bayer Spikes aspect, only one data set \ (D \) (i.e., the observed data set), the uncertainty of the parameter by \ (\ Omega \) to express the probability distribution. Bayesian view is included on the prior probability is very natural thing, including the prior probabilities of the Bayesian approach will not be above the extreme conclusions.

In addition there are two points to note, first, the likelihood function is not (\ omega \) \ probability distribution on \ (\ omega \) integration is not necessarily equal to 1; second, the likelihood \ (\ ne \ ) probability, the probability (or likelihood) for predicting the outcome in the next case of the known parameters, the likelihood is known that when some of the results, the relevant parameter estimates. On the second point, for example, if I have a coin, if it is uniform texture ( known parameters ), so it appears right side up probability is 0.5 ( result ); Similarly, if a coin toss me 100 times, 52 times face up (result), then I think the coin is uniform in all likelihood texture ( estimation parameters ).

2. The maximum likelihood estimation (maximum likelihood estimation, MLE)

Learn the likelihood function, then the maximum likelihood estimate of what it is well understood, and it is a method used to estimate a probability model parameter method. According to the formula (2), once we get a data set \ (D \) , then we will be able to obtain about \ (\ omega \) estimation, maximum likelihood estimation will find a most likely value (here is probably the most likely \ (\ omega \) , the \ (\ omega \) can occur sampling \ (D \) maximize the likelihood).

Mathematically, we can \ (\ omega \) for all values of, look for a value such that the maximum likelihood function, this estimation method called maximum likelihood estimation. Maximum likelihood estimate is the same sample, about \ (\ omega \) function. Maximum likelihood estimation need not be present, not necessarily unique.

Forecast texture coins in Section 1 \ (\ Omega \) , is a classic example of maximum likelihood estimation. Other examples see references \ (^ {[2]} \) .

Now we look at the great application in the normal distribution likelihood estimation:

Assume now that we have an observation data set \ (\ mathbf {X} = (x_1, \ cdots, x_n) ^ T \) , represents a scalar variable \ (X \) of N observations. We assume that each of the observations independently drawn from a Gaussian distribution, the mean of the distribution \ (\ MU \) and variance \ (\ sigma ^ 2 \) is unknown, we would like to determine these parameters based on the data set. Joint probability of two independent events can be obtained from the product of the marginal probability of each event. Our data set \ (\ mathbf {x} \ ) are independent and identically distributed, thus given \ (\ MU \) and \ (\ Sigma ^ 2 \) , we can give a Gaussian distribution likelihood function:
\ [p (\ mathbf {x} | \ mu, \ sigma ^ 2) = \ prod_ {n = 1} ^ {N} \ mathcal {N} (x_n | \ mu, \ sigma ^ 2) \]

To simplify the analysis and numerical computation help, we like to take a few of the likelihood function ( maximizing the log-likelihood is equivalent to maximize the likelihood function, it is easy to prove ):
\ [LN (\ mathbf the X-| \ MU, \ sigma ^ 2) = - \ frac {1} {2 \ sigma ^ 2} \ sum_ {n = 1} ^ {N} (x_n- \ mu) ^ 2- \ frac {N} {2} ln \ sigma ^ 2- \ frac {N} {
2} ln (2 \ pi) \] on \ (\ MU \) , maximizes the log likelihood function, to give (\ MU \) \ maximum likelihood solution:
\ [ \ mu_ {ML} = \ frac
{1} {N} \ sum_ {n = 1} ^ {N} x_n \] can be seen as a solution sample mean. Similarly, the variance \ (\ sigma ^ 2 \) maximum likelihood solution is:
\ [\ sigma_} {^ ML = 2 \ FRAC. 1 {N} {} \ sum_. 1} = {n-N} ^ {(x_n - \ mu_ {ML}) ^
2 \] thereby completing the normal maximum likelihood estimation.

3. The maximum likelihood Bias

Maximum likelihood estimation method has some limitations solution parameters \ (^ {[3]} \) , maximum likelihood method will come in addition to section 1 of the coin on the outer extreme case, there will be a situation, biased estimation , is expected \ (\ ne \) ideal value. The maximum likelihood approach would systematically underestimate the variance of the distribution. Proof below:

Estimation of the mean \ (\ mu_ {ML} \ ) of the desired \ (E [\ mu_ {ML }] \) is:
\ [E (\ mu_ {ML}) = E (\ FRAC {. 1} {N} \ sum_ {n = 1} ^ { N} x_n) = \ frac {1} {N} E ({\ sum_ {n = 1} ^ {N} x_n}) = \ frac {1} {N} \ sum_ { n = 1} ^ {N} E (x_n) = \ mu \]

Variance estimation \ (\ sigma ^ 2 \) of the desired \ (E [\ sigma_ {ML } ^ 2] \) is:
\ [E [\ sigma_ {ML} ^ 2] = E (\ FRAC {. 1} { N} \ sum_ {n = 1 } ^ {N} (x_n- \ mu_ {ML}) ^ 2) = E (\ frac {1} {N} \ sum_ {n = 1} ^ {N} x_n ^ 2 - \ mu_ {ML} ^ 2 ) = \ frac {1} {N} \ sum_ {n = 1} ^ {N} E (x_n ^ 2) -E (\ mu_ {ML} ^ 2) \]

Then seek then two, the normal second moment
\ [E (x_n ^ 2)
= \ mu ^ 2 + \ sigma ^ 2 \] and

\[ E(\mu_{ML}^2)=E((\frac{x_1+x_2+x_3+\cdots+x_n}{n})^2)=\frac{1}{n^2}(n^2\mu^2+n\sigma^2) \]

Therefore:
\ [E [\} ^ 2 sigma_ {ML] = \ {n-FRAC. 1-n-} {} \ Sigma ^ 2 \]
thus demonstrating that there is maximum likelihood bias. Wherein the proof equation (12) and equation (13) may be self-reference basics normal distribution.

Here, the PRML \ (^ {[. 3]} \) gives a more intuitive explanation, as shown below:

1570894486589

Wherein the green curve represent the true Gaussian distribution, distribution of data points is generated based on this probability, three red fitting a Gaussian probability distribution of the three, each data set contains two blue data points, averaging three data sets , it is clear that the variance is undervalued. Because it is a relative measurement of the sample mean, real mean not a relative measure

4. Postscript

As one kind of machine learning in the most commonly used method of maximum likelihood, a deep understanding of its meaning is very necessary and useful, it should be like this for some common understanding of probability theory and models of great help. Of course, there are some properties of maximum likelihood ML method, such as functional invariance , the wave line behavior , limited time and energy level of the individual, proof is not given, the reader may refer to its own Wikipedia \ (^ {[2]} \ ) . Most of the article is a summary of the contents and excerpt, encourage each other.

references:

  1. https://zh.wikipedia.org/wiki/%E4%BC%BC%E7%84%B6%E5%87%BD%E6%95%B0
  2. https://zh.wikipedia.org/wiki/%E6%9C%80%E5%A4%A7%E4%BC%BC%E7%84%B6%E4%BC%B0%E8%AE%A1
  3. Pattern Recognition and Machine Learning 》(即PRML)
  4. 《Theory of Point Estimation》
  5. https://www.zhihu.com/question/35670078

Guess you like

Origin www.cnblogs.com/moonwanderer/p/11664619.html