Summary and Derivation of Loss Functions

1. Concept distinction

Loss Function
Loss Function Loss Function

Loss function is also divided into empirical risk loss function and structural risk loss function . The empirical risk loss function refers to the difference between the predicted result and the actual result, usually for a single training sample, given a model output y^ and a real y , the loss function outputs a real-valued loss y - y^ . It is used to evaluate the degree to which the predicted value of the model is different from the true value . The better the loss function, the better the performance of the model. The loss functions used by different models are generally different. The structural risk loss function refers to the empirical risk loss function plus a regular term .

Cost Function

The Cost Function is usually the total loss for the entire training set (or a mini-batch when using mini-batch gradient descent).

Objective Function

Objective Function is a more general term for any function you wish to optimize, used in both machine learning and non-machine learning fields (such as operations research optimization)

That is: the loss function and the cost function are only different for the sample set.

2. Regression common loss function

2.1 Mean Squared Loss (MSE)

  • Mean Squared Error (MSE) loss is the most commonly used loss function in machine learning and deep learning regression tasks, also known as L2 Loss. Its basic form is as follows

    [External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-VRM8bTF1-1629389073434) (https://www.zhihu.com/equation?tex=J_%7BMSE%7D+ %3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7Bi%3D1%7D%5E%7BN%7D%28y_i+-+%5Chat%7By_i%7D%29%5E2+%5C%5C)]

  • Principle derivation

    In fact, under certain assumptions, we can use the maximized likelihood to get the form of the mean squared loss. Assuming that the error between the model prediction and the true value obeys the standard Gaussian distribution , then given an x, the probability of the model outputting the true value is
    [External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-StaI4wEY-1629389073447) (https://www.zhihu.com/equation?tex=p%28y_i%7Cx_i %29+%3D+%5Cfrac%7B1%7D%7B%5Csqrt%7B2%5Cpi%7D%7D%5Cmathbb%7Bexp%7D%5Cleft+%28-%5Cfrac%7B%28y_i-%5Chat%7By_i%7D%29% 5E2%7D%7B2%7D%5Cright+%29+%5C%5C)]
    further. We assume that the N sample points in the data set are independent of each other, then given all X output all true values The probability of the value y, the likelihood function, is the cumulative product of all p(y|x)
    insert image description here

    Usually for the convenience of calculation, we usually maximize the log-likelihood (by going to log, convert the opportunity to the form of sum, which is convenient for calculation)
    insert image description here

    Remove the first term that has nothing to do with y, convert the maximized log-likelihood to minimize the negative log-likelihood, and then use methods such as gradient descent to find the minimum value.
    [External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-EoZhaFtA-1629389073454) (https://www.zhihu.com/equation?tex=NLL%28x%2C +y%29+%3D+%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bi%3D1%7D%5E%7BN%7D%28y_i+-+%5Chat%7By_i%7D%29%5E2+%5C%5C) ]

    You can see that this is actually in the form of a mean squared loss. That is to say, under the assumption that the error between the model output and the true value follows a Gaussian distribution, minimizing the mean square error loss function is essentially consistent with the maximum likelihood estimation , so in scenarios where this assumption can be satisfied (such as regression) , the mean squared loss is a good choice of loss function; when this assumption is not satisfied (such as classification), the mean squared loss is not a good choice.

2.2 Mean Absolute Loss (MAE)

  • Mean Absolute Error (MAE) is another commonly used loss function, also known as L1 Loss. Its basic form is as follows
    insert image description here
    . The minimum value of the MAE loss is 0 (when the prediction is equal to the true value), and the maximum value is infinity. The MAE loss grows linearly as the absolute error between the prediction and the true value increases.

  • Principle derivation

Similarly, we can obtain the form of MAE loss by maximizing the likelihood under certain assumptions. Assuming that the error between the model prediction and the true value obeys the Laplace distribution , then a given X model outputs the true value of y. The probability is
insert image description here
Similar to when deriving the MSE above, the negative log-likelihood we can get is actually in the form of the MAE loss
insert image description here

Comparison of MAE and MSE

1. MSE generally converges faster than MAE
insert image description here

2. MAE is more robust for outliers .
We can understand this from two perspectives:

The first angle is to understand intuitively. The following figure shows the MAE and MSE losses drawn into the same figure. Since the MAE loss and the absolute error are linear, the MSE loss and the error have a square relationship. When the error is very large, The MSE loss will be much larger than the MAE loss. Therefore, when there is an outlier with a very large error in the data, MSE will produce a very large loss, which will have a greater impact on the training of the model.

img

The second angle is based on the assumption of two loss functions. MSE assumes that the error obeys a Gaussian distribution, and MAE assumes that the error obeys a Laplace distribution. The Laplace distribution itself is more robust to outliers. When outliers appear on the right side of the right image, the Laplace distribution is much less affected than the Gaussian distribution. Therefore, the MAE assuming the Laplace distribution is more robust to the outlier than the MSE assuming the Gaussian distribution.

img

2.3 Huber Loss

Above we have introduced the MSE and MAE losses and their respective advantages and disadvantages. The MSE loss converges quickly but is easily affected by outliers. MAE is more robust to outliers but has slow convergence. Huber Loss is a combination of MSE and MAE, taking two The loss function of the advantage, also known as Smooth Mean Absolute Error Loss. The principle is very simple, that is, MSE is used when the error is close to 0, and MAE is used when the error is large. The formula is
insert image description here

  • Features of Huber Loss

Huber Loss combines MSE and MAE loss. When the error is close to 0, MSE is used to make the loss function derivable and the gradient more stable; when the error is large, the use of MAE can reduce the influence of the outlier and make the training more robust to the outlier. The disadvantage is that an additional hyperparameter needs to be set.

  • Huber Loss python implementation
# huber 损失
def huber(true, pred, delta):
    loss = np.where(np.abs(true-pred) < delta , 0.5*((true-pred)**2), delta*np.abs(true - pred) - 0.5*(delta**2))
    return np.sum(loss)

3. Common loss functions for classification

3.1 Cross-entropy loss

3.1.1 Binary classification

  • Consider binary classification. In binary classification, we usually use the Sigmoid function to compress the output of the model into the (0, 1) interval, which is used to represent the probability that the model is judged to be a positive class for a given input. The figure below is a visualization of the cross-entropy loss function for binary classification, the blue line is the loss of outputting different outputs when the target value is 0, and the yellow line is the loss when the target value is 1. It can be seen that the loss is smaller around the target value, and the loss increases exponentially as the error gets worse.
    insert image description here
    img

  • Principle derivation

Since there are only two types, the probability of positive and negative classes is also obtained, and it can be assumed that the samples obey Bernoulli distribution (0-1 distribution) .

[External link image transfer failed, the source site may have an anti-leech mechanism, it is recommended to save the image and upload it directly (img-CRXjlf4W-1629389073473) (https://www.zhihu.com/equation?tex=p%28y_i%3D1 %7Cx_i%29+%3D+%5Chat%7By_i%7D%5C%5C+++p%28y_i%3D0%7Cx_i%29+%3D+1-%5Chat%7By_i%7D++%5C%5C)]
Combine the two expressions into one
[External link image transfer failed, the source site may have an anti-leech mechanism, it is recommended to save the image and upload it directly (img-2hZzzRgj-1629389073474) (https://www.zhihu.com/equation?tex=p%28y_i%7Cx_i %29+%3D+%28%5Chat%7By_i%7D%29%5E%7By_i%7D+%281-%5Chat%7By_i%7D%29%5E%7B1-y_i%7D+%5C%5C)]

Assuming that the data points are independent and identically distributed, the likelihood function, that is, the probability of each sample by chance, can be expressed as

[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-qTuHR9jD-1629389073475) (https://www.zhihu.com/equation?tex=L%28x%2C +y%29%3D%5Cprod_%7Bi%3D1%7D%5EN%28%5Chat%7By_i%7D%29%5E%7By_i%7D+%281-%5Chat%7By_i%7D%29%5E%7B1-y_i% 7D+%5C%5C)]
Take the logarithm of the likelihood, and then add a negative sign to minimize the negative logarithmic likelihood, which is the form of the cross-entropy loss function
insert image description here

3.1.2 Multi-classification

In the multi-classification task, the derivation idea of ​​the cross-entropy loss function is the same as that of the two-class classification, and there are two changes:

  1. The true value y is now a One-hot vector, each value represents the corresponding probability of each class, and the sum is 1

  2. The activation function of the model output is changed from the original Sigmoid function to the Softmax function

The Softmax function limits the output range of each dimension between 0 and 1, and the sum of the outputs of all dimensions is 1, which is used to represent a probability distribution.

Example: Suppose a 5 classification problem, and then the label y=[0,0,0,1,0] of a sample I, which means that the real label of sample I is 4.

Assuming that the result probability predicted by the model (the output of softmax) p=[0.1, 0.15, 0.05, 0.6, 0.1], it can be seen that the prediction is correct, then the corresponding loss L=-log(0.6), that is, when this When a sample passes through such network parameters to produce such a prediction p, its loss is -log(0.6).

Then suppose p=[0.15, 0.2, 0.4, 0.1, 0.15], this prediction result is outrageous, because the real label is 4, and you think the probability of this sample is 4 is only 0.1 (much less than other probabilities, if it is In the testing phase, then the model will predict that the sample belongs to category 3), corresponding to the loss L=-log(0.1).

Suppose p=[0.05, 0.15, 0.4, 0.3, 0.1]. Although this prediction result is wrong, it is not as outrageous as the previous one, and the corresponding loss L=-log(0.3). We know that the log function is negative when the input is less than 1, and the log function is an increasing function, so -log(0.6) < -log(0.3) < -log(0.1). Simply put, it is more expensive to predict wrong than to predict right, and more costly to predict wildly wrong than to predict slightly wrong.

  • Principle derivation

We know that the Softmax function limits the output range of each dimension between 0-1, and the output sum of all dimensions is 1, which is used to represent a probability distribution. Then the probability of the corresponding class can be expressed as .

[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-VF0k71cR-1629389073478) (https://www.zhihu.com/equation?tex=p%28y_i%7Cx_i %29+%3D+%5Cprod_%7Bk%3D1%7D%5EK%28%5Chat%7By_i%7D%5Ek%29%5E%7By_i%5Ek%7D+%5C%5C)]
Assuming that the model output is [0.1, 0.15, 0.05, 0.6, 0.1] and the real sample is [0, 0, 0, 1, 0], the calculated p is exactly 0.6, and the powers of non-1 items are all 1.

Where k represents one of the K categories, and the same assumption is made that the data points are independent and identically distributed, the negative log-likelihood can be obtained as
insert image description here

Since the output y is a one-hot vector, except the target class is 1, the output on other categories is 0, so the above formula can also be written as

[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-K9P8dGsR-1629389073482) (https://www.zhihu.com/equation?tex=J_%7BCE%7D+ %3D+-%5Csum_%7Bi%3D1%7D%5EN+y_i%5E%7Bc_i%7D%5Cmathbb%7Blog%7D%28%5Chat%7By_i%7D%5E%7Bc_i%7D%29+%5C%5C)]

where C is the target class of sample X. Usually this cross entropy loss function applied to multi-classification is also called Softmax Loss or Categorical Cross Entropy Loss.

Why is classification cross entropy?

Why not use mean square error loss in classification? When introducing the mean square error loss above, I mentioned that the mean square error loss actually assumes that the error obeys the Gaussian distribution. This assumption cannot be satisfied under the classification task, so the effect will be very poor. Why cross entropy loss? There are two angles to explain this.

One perspective is from the perspective of maximum likelihood, which is our derivation above , and the other perspective is that information theory can be used to explain the cross entropy loss

The following is explained from the perspective of information theory.
insert image description here

It can be seen that the results derived from the angle of minimizing the cross-entropy are consistent with the results obtained using the maximized likelihood .

Summarize

  • The connection and difference between cross entropy function and maximum likelihood function

Difference: The cross entropy function is used to describe the difference between the predicted value of the model and the actual value. The larger the value, the less similar it is; the essence of the likelihood function is to measure the probability that the overall estimate is the same as the real situation under a certain parameter. The larger the representative is, the closer it is.

Connection: The cross-entropy function can be derived from the maximum likelihood function under the condition of Bernoulli distribution , or the essence of minimizing the cross-entropy function is the maximization of the log-likelihood function .

  • The cross-entropy loss in classification is equivalent to the log loss when the data distribution is Bernoulli distribution, the mean square loss is equivalent to the log loss of the data under the Gaussian distribution, and the absolute value loss is under the Laplace distribution. log loss.

  • Usually there are regular terms (L1/L2 regular) in the loss function. These regular terms are used as part of the loss function to reduce the complexity of the model by constraining the absolute value of the parameters and increasing the parameter sparsity to prevent the model from overfitting. .

Guess you like

Origin blog.csdn.net/weixin_41744192/article/details/119813803