Likelihood, maximum likelihood, log likelihood, maximum a posteriori, etc.

1 likelihood

Let the overall X obey the distribution P(x; θ) (probability density when X is a continuous random variable, probability distribution when X is a discrete random variable), θ is the parameter to be estimated (or system parameter), X1 , X2,...Xn are samples from the population X, x1, x2...xn is an observation value of samples X1, X2,...Xn, then the joint distribution L(θ) of the samples is called the likelihood function, and the form of discrete variables is as follows Shown:

2 Maximum Likelihood

Therefore, we need to take the maximum value of L so that the x1, x2...xn observations are most likely to occur. When the likelihood function takes the maximum value, it means that the parameter θ fits the given data distribution to a certain extent, that is, under the parameter θ , the value predicted by the model is relatively close to the real value, which is the loss function smaller.

3 log likelihood (log likelihood)

The reason for using log-likelihood instead of plain old likelihood is mathematical convenience, as it lets you turn multiplication into addition. At the same time, the probabilities in the joint probability are all numbers below 1, so the value of the probability multiplication like the joint probability will become smaller and smaller. If the value is too small, there will be precision problems when programming.

4 MAP and MLE

Frequentist - Frequentist - Maximum Likelihood Estimation (MLE, Maximum Likelihood Estimation)

Bayesian - Bayesian - Maximum A Posteriori (MAP, maximum a posteriori estimation)

4.1 Bayesian school

In the Bayesian school, there are two major inputs and one output. The input is the prior (prior) p(θ) and the likelihood (likelihood) p(X|θ), and the output is the posterior (posterior) p(θ| X).

4.2 MAP

The third row to the fourth row P(x) can be discarded because it has nothing to do with θ.

Assuming the prior is a Gaussian distribution,

where log p(X| θ ) is the MLE. Using a Gaussian prior in MAP is equivalent to using a L2 regularizaton in MLE!

5. Maximum likelihood (maximum likelihood) is equivalent to the minimum KL divergence


The picture below shows that maximum likelihood is equivalent to minimum KL divergence.

 

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/130804248