Understanding Cross entropy

Moved from friendly-intro-to-cross-entropy-loss

Overview

......

In this paper we focus on those cases where the classifications are mutually exclusive. For example, if we are interested in whether there is a landscape in a picture, or a horse or something, then our model will take the picture as an output, and then output three numbers, where each number represents the corresponding category probability.

During training, let's say we put a landscape map, we want our probability output to be close to​. If our model predicts three different distributions, then output, say​, then we want to continue to train the parameters of the network to make it as close to the accurate output as possible

But how to judge "closeness", how do we judge​andWhat is the difference between

Cross entropy

Let's say you're standing on the highway in the city during rush hour, and you want every type of car you see to speed up your roommate on the other side of the highway. You have to send the binary code, but you can't send it randomly, because it will cost 10 yuan to send one.

We assume that the road has​For all kinds of But it doesn't seem good enough.

Consider the civic and tesla, because there are far more Civic vehicles on the road than Teslas. If you are informed of more specific distribution information, or do a sample survey in advance, will you be able to save a lot of communication costs.

We assume that the number of Civics is 128 times that of Tesla, then we will do this when using symbol encoding

After studying the principles of communication, you will understand that this is actually the definition of the amount of information, all from Shannon's theory.

Then the information entropy can be written as

Maybe you can see something here, cross entropy and information entropy themselves are actually​anddifference between That is to say, we use the wrong tool combined with the correct tool to calculate no less than the average number of symbols in the accurate case.

So the cross entropy is expressed as

KL divergence is the interpolation of cross entropy and information entropy, that is, we have a few more symbols than the completely accurate case. Then we seem to have found the answer at the top, if we say​and​If we want to

......

Ability to predict

After our discussion above, perhaps we can be excited to use cross-entropy to estimate two distributions​and​, and use the cross-entropy based on the total training samples as our loss function (loss). In particular, if we use​To index the training examples, then the total loss can be written as,

Let's look at another approach: what if we used a function to directly measure the predictive power of the model.

The more common method is to adjust the parameters so that the likelihood of the data results obtained by our model is optimal. ...if we assume that our samples are independently and identically distributed, then the maximum likelihood estimate for our example can be written as the product of the individual example likelihood estimates

So how to represent the likelihood function of the nth example? it is corresponding to the sample​Entropy for a particular

Let's say we have a distribution​, then the likelihood function​is the first entropy value estimated by the model we get (for example,​, in which​, please note that although it looks like the first value here, it is not. It should be noted that it is equal to 0.4 because the correspondingThe value of ​is only 1, and the likelihood function corresponds to the exponential function where e is low, see later)

If we can maximize the likelihood function, then we can find suitable parameters to make our model have the strongest predictive ability. In fact, maximizing the likelihood function is also minimizing, and minimizing the logarithm of the reciprocal of the likelihood function, which is

So

That is to say

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324746118&siteId=291194637