One article to understand what is the cross-entropy loss function and its role

Today we’ll look at a boring but important concept in deep learning—the cross-entropy loss function.

As a loss function, its important role is to compare the "predicted value" and the "real value (label)", thereby outputting the loss value. Until the loss value converges, it can be considered that the neural network model training is completed.

So what exactly is this so-called "cross entropy" and why can it be used as a loss function?

1. Entropy and cross-entropy

"Cross entropy" includes the two parts of "cross" and "entropy".

The description of "entropy" is explained in more detail inUnderstanding the Nature of Entropy. In general, entropy can be used to measure the uncertainty of a random variable, which can be expressed mathematically as:

H(i) = - ∑ P(i) * log(P(i))

For the above formula, we slightly change the shape and treat the negative sign and log(P(i)) as a variable to get:

PP(i) = -log(p(i))

Then the formula of entropy can be written as: H(i) = ∑ P(i) * PP(i)

In the entropy formula at this time, P(i) and PP(i) obey the same probability distribution. Therefore, entropy H(i) becomes the mathematical expectation of the occurrence of event PP(i), which is commonly understood as the mean.

The greater the entropy, the greater the uncertainty of the occurrence of the event. Cross entropy is used to compare the difference between two probability distributions. For two probability distributions P and Q,

Cross entropy is defined as:

H(i) = ∑ P(i) * Q(i)

At this time, P(i) and Q(i) obey two different probability distributions, and the "crossover" of cross entropy is reflected here.

Among them, P(i) is the real distribution, which is the distribution of labels during the training process; Q(i) is the prediction distribution, which is the distribution of prediction results output by the model in each iteration.

The smaller the cross entropy, the closer the two probability distributions are.

As a result, the model prediction results are closer to the real label results, indicating that the model training has converged.

For more detailed mathematical principles, you can viewThe Nature of Entropy, but we don’t need to delve into it and just understand the above conclusion.

2. Cross entropy as loss function

Suppose there is a dataset of animal images with five different animals and only one animal in each image.

Source: https://www.freeimages.com/

We label animals using one-hot encoding for each image. If you are unclear about one-hot coding, you can move hereHere is a one-hot that you can definitely understand.

The above picture is a table after encoding animal classification. We can regard a one-hot encoding as the probability distribution of each image, then:

The probability distribution that the first image is a dog is 1.0 (100%).

For the second picture, the probability distribution of being a fox is 1.0 (100%).

By analogy, at this time, the entropy of each image is zero.

In other words, one-hot encoded labels tell us with 100% certainty which animals are in each image: the first image cannot be 90% dog and 10% cat because it is 100% dog.

Because this is the training label, it is a fixed and certain distribution.

Now, imagine you have a neural network model making predictions on these images. After the neural network performs one training iteration, it may classify the first image (dog) as follows:

This classification shows that the first image has a 40% probability of being a dog, a 30% probability of being a fox, a 5% probability of being a horse, a 5% probability of being an eagle, and a 20% probability of being a squirrel.

However, judging from the image label alone, it is 100% a dog, and the label provides us with an accurate probability distribution of this image.

So, how to evaluate the effect of model prediction at this time?

We can calculate the cross entropy using the one-hot encoding of the labels as the true probability distribution P and the model predictions as Q:

The result is significantly higher than the zero entropy of the label, indicating that the prediction result is not very good.

Let's move on to another example.

Assume that the model has been improved, and after completing an inference or a round of training, the following prediction is obtained for the first picture. In other words, this picture has a 98% probability of being a dog, and the 100% probability of this label is already very poor. not enough.

We still calculate the cross entropy:

You can see that the cross-entropy becomes very low, as the prediction becomes more and more accurate, the cross-entropy decreases, and if the prediction is perfect, it goes to zero.

Based on this theory, many classification models will use cross entropy as the loss function of the model.

In machine learning, for various reasons (such as easier calculation of derivatives), the logarithm log is calculated using base e instead of base 2 in most cases. Changing the logarithmic base does not cause any problems because it only changes amplitude.

Guess you like

Origin blog.csdn.net/dongtuoc/article/details/134888586