Moved from friendly-intro-to-cross-entropy-loss
Overview
......
In this paper we focus on those cases where the classifications are mutually exclusive. For example, if we are interested in whether there is a landscape in a picture, or a horse or something, then our model will take the picture as an output, and then output three numbers, where each number represents the corresponding category probability.
During training, let's say we put a landscape map, we want our probability output to be close to. If our model predicts three different distributions, then output, say, then we want to continue to train the parameters of the network to make it as close to the accurate output as possible 。
But how to judge "closeness", how do we judgeandWhat is the difference between
Cross entropy
Let's say you're standing on the highway in the city during rush hour, and you want every type of car you see to speed up your roommate on the other side of the highway. You have to send the binary code, but you can't send it randomly, because it will cost 10 yuan to send one.
We assume that the road hasFor all kinds of But it doesn't seem good enough.
Consider the civic and tesla, because there are far more Civic vehicles on the road than Teslas. If you are informed of more specific distribution information, or do a sample survey in advance, will you be able to save a lot of communication costs.
We assume that the number of Civics is 128 times that of Tesla, then we will do this when using symbol encoding
After studying the principles of communication, you will understand that this is actually the definition of the amount of information, all from Shannon's theory.
Then the information entropy can be written as
Maybe you can see something here, cross entropy and information entropy themselves are actuallyanddifference between That is to say, we use the wrong tool combined with the correct tool to calculate no less than the average number of symbols in the accurate case.
So the cross entropy is expressed as
KL divergence is the interpolation of cross entropy and information entropy, that is, we have a few more symbols than the completely accurate case. Then we seem to have found the answer at the top, if we sayandIf we want to
......
Ability to predict
After our discussion above, perhaps we can be excited to use cross-entropy to estimate two distributionsand, and use the cross-entropy based on the total training samples as our loss function (loss). In particular, if we useTo index the training examples, then the total loss can be written as,
Let's look at another approach: what if we used a function to directly measure the predictive power of the model.
The more common method is to adjust the parameters so that the likelihood of the data results obtained by our model is optimal. ...if we assume that our samples are independently and identically distributed, then the maximum likelihood estimate for our example can be written as the product of the individual example likelihood estimates
So how to represent the likelihood function of the nth example? it is corresponding to the sampleEntropy for a particular。
Let's say we have a distribution, then the likelihood functionis the first entropy value estimated by the model we get (for example,, in which, please note that although it looks like the first value here, it is not. It should be noted that it is equal to 0.4 because the correspondingThe value of is only 1, and the likelihood function corresponds to the exponential function where e is low, see later)
If we can maximize the likelihood function, then we can find suitable parameters to make our model have the strongest predictive ability. In fact, maximizing the likelihood function is also minimizing, and minimizing the logarithm of the reciprocal of the likelihood function, which is。
So
That is to say