From Maximum Likelihood to EM Algorithms: A Consistent Approach to Understanding

Recently, I have been thinking about the unsupervised learning of NLP and some content related to probability maps, so I have re-arranged some parameter estimation methods.

In deep learning, parameter estimation is one of the most basic steps, which is what we call the model training process. In order to train the model, there must be a loss function, and if readers who have not systematically studied probability theory, the most natural loss function estimate that can be thought of is the mean squared error, which corresponds to what we call the Euclidean distance.

In theory, the best match for the probability model should be the "cross-entropy" function, which is derived from the maximum likelihood function in probability theory.

maximum likelihood

reasonable existence

What is maximum likelihood? There is a saying in philosophy that "existence is reasonable", and maximum likelihood means "existence is the most reasonable". Specifically, if the probability distribution of event X is p(X), and if the values ​​specifically observed in an observation are X1, X2,...,XN, and they are assumed to be independent of each other, then:

I have a few Alibaba Cloud lucky coupons to share with you. There will be special surprises for purchasing or upgrading Alibaba Cloud products with the coupons! Take all the lucky coupons for the products you want to buy! Hurry up, it's about to be sold out.

af3d467762d0525e18ed83abc2c654958baef48c

is the largest. If p(X) is a probability distribution pθ(X) with parameter θ, then we should find a way to choose θ such that L maximizes, that is:

9287adcd0e764bd8c23cd8283c4071d52a1e8beb

Taking the logarithm of the probability gives the equivalent form:

4d65db575f2a52378de30ef6c00a35ae0b9af3ab

If we divide the right-hand side by N, we get a more refined expression:

4605a3b547cad9a709badf505b195e6853b5ee67

where we call −L(θ) the cross-entropy.

theoretical form

In theory, according to the existing data, we can get the statistical frequency p̃(X) of each X, then we can get the equivalent form of the above formula:

c41b3251a0b8fadaf7fece4f20fcb9b78d0a8408

But in practice, it is almost impossible for us to get p̃(X) (especially for continuous distributions), and what we can directly calculate is the mathematical expectation about it, which is (4), because to find the expectation, we only need to put each sample Calculate the value of , then sum and divide by N. Therefore, formula (5) has only theoretical value, and it can facilitate the following derivation.

It should be noted that the above description is very general, where X can be any object, and it may also be a continuous real number. In this case, the summation should be replaced by an integral, and p(X) should be turned into a probability density function. Of course, this is not inherently difficult.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325373392&siteId=291194637