Derivation of classification loss function

The role of the loss function

The loss function is a function used to measure the difference between the predicted value of the model and the real value. It is a non-negative real-valued function and should be derivable everywhere.
In the training phase of the model, the predicted value of the model is obtained through forward propagation. At this time, the difference between the predicted value and the real value is obtained by using the loss function, and then the gradient of all weight parameters to the deviation is obtained by backpropagation, and the optimization function is used to update The gradient makes the predicted value closer to the real value, so as to achieve the purpose of learning.

Analysis of cross entropy from the perspective of entropy

Since the real distribution of human-labeled data may be inconsistent with the probability model of the predicted distribution of the model, a unified system is needed to quantitatively evaluate each model, and the specific method is entropy. Entropy is a concept in information theory. To understand how entropy evaluates the difference between two probabilistic models (systems), you need to understand the following concepts.

amount of information

The basic idea of ​​the amount of information is the degree of difficulty of an event from uncertainty to certainty. The smaller the probability of an event occurring, the greater the amount of information it carries.
Assuming that X is a discrete random variable whose value set is , then the amount of information defining the event is:
image.png
where is the probability of the value of the variable , and the range of values ​​is [0, 1]. Since the event probability is smaller , the greater the amount of information, so the constant factor is -1, so that when the probability tends to 0, the amount of information tends to infinity, and when the probability tends to 1, the amount of information tends to 0.

information entropy

Information entropy, also called entropy, is used to represent the expectation of all information in the system . In a system, the greater the entropy, the higher the uncertainty of the system and the greater the degree of chaos. The expression of entropy is as follows:
image.png
In order to further deepen the understanding of entropy, we take the team winning the championship as an example. Two probability systems are shown below. The first one has the same probability of each team winning the championship, and the second one The probability of winning the championship is not equal. As can be seen from the figure below, the value of entropy is the sum of the probability of occurrence of all events in the system multiplied by the corresponding amount of information.
image.png

KL divergence

If there are two independent probability distributions sum for the same random variable , then the KL divergence can be used to measure how different the two distributions are.
In machine learning, it is often used to represent the real distribution of samples and to represent the predicted distribution of the model. The smaller the KL divergence, the closer the representation sum is, and the infinite approximation can be made through training and updating parameters . The calculation formula of KL divergence is as follows: KL divergence has the following two properties:
image.png

insert image description here

cross entropy

The above KL divergence formula is further simplified:
image.png
the former H(p(xi ) ) is the information entropy of the real distribution, and the latter is the cross entropy. So KL divergence = cross entropy - information entropy, that is, cross entropy is equal to:

in the training phase of the model, the input data and labels are often determined, then the real probability distribution P(X) is determined, so the
information entropy is a constant here
.
Since the value of KL divergence represents the difference between the real probability distribution P(x) and the predicted probability distribution Q(x), the smaller the value, the smaller the difference between the two distributions, so the goal becomes to minimize the KL divergence Spend. However, since cross entropy is equal to KL divergence plus a constant, and the formula is easier than KL divergence, minimizing KL divergence is consistent with minimizing cross entropy, and optimizing cross entropy requires less computation than optimizing KL divergence.

Cross entropy loss function

Cross-entropy is a concept in information theory that is closely related to the probability distribution of events. Therefore, before using the cross-entropy loss function, you need to use softmax or sigmoid to convert the output of the model into a probability value. But when to use softmax and when to use sigmoid need to be discussed in the following two situations:

Single-label classification tasks

The single-label classification task refers to the target of only one category in each input sample, such as the classification task of ImageNet, Cifar, MNIST and other data sets.
As shown in the figure below, taking the classification task of the MNIST handwritten data recognition dataset as an example, explain the specific calculation process of the loss:
Assuming that the number of samples is "1", then the real distribution should be:

[1, 0,0,0,0,0,0,0,0,0]

, if the output distribution of the network is:

[7, 1, 4, 3, 4, 2, 1, 4, 5, 2]

, and then send it into softmax, the probability distribution of the output is:

[0.7567, 0.0019, 0.0377, 0.0139, 0.0377, 0.0051, 0.0019, 0.0377,
0.1024, 0.0051]

The probability of softmax output represents the probability of each category included in the sample, and the appearance of each category is not independent of each other, so the sum of the probability values ​​​​of all categories is 1 .
For a single sample with a single label, assuming that the true distribution is , the network output distribution is , and the total number of categories is , the calculation formula of the cross-entropy loss function is as follows:

Therefore, in this example, the calculation loss function is:
image.png
image.png
Therefore, for a batch , the cross-entropy loss function for the single-label n classification task is:

Multi-label classification task

Multi-label classification tasks, that is, a sample contains multiple labels, that is, a variety of different categories of targets. For example, the samples in VOC, COCO and other data sets contain multiple targets of different categories. Unlike single-label classification tasks using softmax, the multi-label classification task uses the sigmoid function to convert the output of the network into a probability, indicating the probability that the image contains this type of object. However, for multi-label classification tasks, each class is independent of each other, so the sum of the final output probability values ​​of the network may not necessarily be equal to 1. In
image.png
multi-label classification tasks, if each class is analyzed separately, the true distribution is a 0-1 distribution, while the model prediction distribution can be understood as the probability that the label is 1. Therefore, for a multi-label single sample, assuming that the true distribution is , the network output distribution is , and the total number of categories is , the calculation formula of the cross-entropy loss function is as follows:
image.png
Therefore, for a batch, the cross-entropy loss function of the multi-label n classification task for:

Analysis of the loss function of multi-label classification tasks from the perspective of likelihood function

From the above introduction, we know that the output value of each neuron is converted into a probability through the sigmoid function, indicating the probability that the image contains this type of object, and the predictions of each category are independent of each other .
In a multi-label single sample, if each category is analyzed independently, each category is a binary classification task, and the true distribution is a 0-1 distribution (Bernoulli distribution). Therefore, we can derive the following formula:
image.png
Combining the above two cases, we get:
image.png
At the same time, since the predictions of each category are independently and identically distributed, the likelihood function is obtained as follows: take the
image.png
logarithm of the likelihood function, and then add The negative sign becomes the minimized log-likelihood, that is, the cross-entropy loss function of a single sample is as follows:
image.png

reference link

Guess you like

Origin blog.csdn.net/hello_dear_you/article/details/128892040