Recently, I have been thinking about the unsupervised learning of NLP and some content related to probability maps, so I have re-arranged some parameter estimation methods.
In deep learning, parameter estimation is one of the most basic steps, which is what we call the model training process. In order to train the model, there must be a loss function, and if readers who have not systematically studied probability theory, the most natural loss function estimate that can be thought of is the mean squared error, which corresponds to what we call the Euclidean distance.
In theory, the best match for the probability model should be the "cross-entropy" function, which is derived from the maximum likelihood function in probability theory.
maximum likelihood
reasonable existence
What is maximum likelihood? There is a saying in philosophy that "existence is reasonable", and maximum likelihood means "existence is the most reasonable". Specifically, if the probability distribution of event X is p(X), and if the values specifically observed in an observation are X1, X2,...,XN, and they are assumed to be independent of each other, then:
is the largest. If p(X) is a probability distribution pθ(X) with parameter θ, then we should find a way to choose θ such that L maximizes, that is:
Taking the logarithm of the probability gives the equivalent form:
If we divide the right-hand side by N, we get a more refined expression:
where we call −L(θ) the cross-entropy.
theoretical form
In theory, according to the existing data, we can get the statistical frequency p̃(X) of each X, then we can get the equivalent form of the above formula:
But in practice, it is almost impossible for us to get p̃(X) (especially for continuous distributions), and what we can directly calculate is the mathematical expectation about it, which is (4), because to find the expectation, we only need to put each sample Calculate the value of , then sum and divide by N. Therefore, formula (5) has only theoretical value, and it can facilitate the following derivation.
It should be noted that the above description is very general, where X can be any object, and it may also be a continuous real number. In this case, the summation should be replaced by an integral, and p(X) should be turned into a probability density function. Of course, this is not inherently difficult.