Deep learning - loss function (loss)

All learning algorithms in deep learning must have a function that minimizes or maximizes a function, called a loss function, or "objective function", "cost function". The loss function is a measure of the performance evaluation of the model. For example: the most commonly used method to find the minimum point of a function is the gradient descent method: detailed explanation of gradient descent

(For example: full batch gradient descent Batch GD, stochastic gradient descent SGD, small batch gradient descent mini-batch GD, Adagrad method, Adadelta method, Adam method, etc.). The loss function is like a rolling hill, and gradient descent is like sliding down the hill to the bottom point.

Obviously, there is no one loss function that is suitable for all tasks. The choice of loss function depends on many factors, including the handling of outliers, the choice of deep learning algorithm, the time efficiency of gradient descent, etc. The purpose of this article is to introduce loss functions, and their rationale.

Loss function (Loss_function):

In the previous study, we have studied machine learning models (classification, regression, clustering, dimensionality reduction): introduction to basic machine learning models , in deep learning: loss functions can be strictly divided into two categories: classification loss and regression loss , The classification loss can be divided into binary classification loss and multi-classification loss according to the number of categories . When using it, it should be noted that: the regression function predicts the number, and the classification function predicts the label.

1. Mean square error loss (MSE)

Mean Squared Error (MSE) loss is the most commonly used loss function in machine learning and deep learning regression tasks. Intuitively understand the mean square error loss, the minimum value of this loss function is 0 (when the prediction is equal to the true value), and the maximum value is infinity. MSE is to calculate the Euclidean distance between the predicted value and the actual value . The closer the predicted value is to the real value, the smaller the mean square error between the two. The mean square error function is often used in linear regression, that is, function fitting.

The mean square error loss function is the mean value of the sum of squares of the corresponding point errors of the predicted data and the original data, and the formula is:

MSE=\frac{1}{N}(\hat{y}-y)^{2}

N: the number of samples, \hat{y}: the actual value of the sample, y is the output value

MSE code implementation:

import torch.nn as nn
import torch
import random
#MSE损失参数
# loss_fun=nn.MSELoss(size_average=None, reduce=None, reduction='mean')
input=torch.randn(10)#定义输出(随机的1,10的数组)可以理解为概率分布
#tensor([-0.0712,  1.9697,  1.4352, -1.3250, -1.1089, -0.5237,  0.2443, -0.8244,0.2344,  2.0047])
print(input)
target= torch.zeros(10)#定义标签
target[random.randrange(10)]=1#one-hot编码
#tensor([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.])
print(target)
loss_fun= nn.MSELoss()
output = loss_fun(input, target)#输出,标签
print(output)#loss:tensor(0.8843)

#=============不用nn.MSELoss实现均方差===================
result=(input-target)**2
result =sum(result)/len(result)#求和之后求平均
print(result)#tensor(0.8843)
#其结果和上面一样

Note: If you want to use the mean square error as the loss function in the multi-classification task, you need to convert the label into a one-hot encoding form: one-hot encoding , which is exactly the opposite of the cross-entropy loss function. The mean square error should not be used together with the sigmoid function (the sigmoid function has a characteristic, that is, the farther the abscissa is from the origin of the coordinate, the closer the derivative is to 0, and when the output is closer to 1, the smaller the derivative is, which will eventually cause the gradient to disappear.)

2. Cross Entropy Loss ( Cross Entropy )

I have elaborated on the origin and introduction of the cross-entropy loss function in this article: Introduction to Deep Learning - from the perspective of probability and information (postscript to mathematics)

We can now understand it in this way: in machine learning, the difference between the real label and the predicted label can be evaluated by using the KL divergence , but since the first item of the KL divergence is a fixed value , only the cross entropy is paid attention to during the optimization process That's it. Generally, most machine learning algorithms will choose cross entropy as the loss function.

According to the formula of cross entropy:

H(p,q)=-\sum _{i=1}^{n}p(x_{i})\log q(x_{i})

p(x_{i})Represents the real label. In the real label, the probability of other categories except the corresponding category is 0, so it can be abbreviated as:

H(p,q)=-logq(x_{class})

The representative label here x_{class}(label)

Cross Entropy code implementation:

#交叉熵
import torch.nn as nn
nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
import torch
import torch.nn as nn
# #假设batch size为4,待分类标签有3个,隐藏层的输出为:
input = torch.tensor([[ 0.8082,  1.3686, -0.6107],#1
        [ 1.2787,  0.1579,  0.6178],#0
        [-0.6033, -1.1306,  0.0672],#2
        [-0.7814,  0.1185, -0.2945]])#取1
target = torch.tensor([1,0,2,1])#假设标签值(不是one-hot编码,直接代表类别)
loss = nn.CrossEntropyLoss()#定义交叉熵损失函数(自带softmax输出函数)
output = loss(input, target)
print(output)#tensor(0.6172)

#======不用 nn.CrossEntropyLoss(),解释其计算方法======
net_out=nn.Softmax(dim=1)(input)#input前面已经定义
print(net_out)
out= torch.log(net_out)#对每个size(每行求log)
"""
tensor([[-1.0964, -0.5360, -2.5153],
        [-0.6111, -1.7319, -1.2720],
        [-1.2657, -1.7930, -0.5952],
        [-1.6266, -0.7267, -1.1397]])
"""
print(out)
#根据公式求和
out=torch.max(out,dim=1)#我们默认取每个size的输出最大值为类别
print(out)
"""
torch.return_types.max(
values=tensor([-0.5360, -0.6111, -0.5952, -0.7267]),
indices=tensor([1, 0, 2, 1]))
"""
#tages=(1,0,2,2),这里的输出out就是每一个size(样本)的交叉熵损失
#别忘了除以batch size(4),我们最后求得的是整个batch(批次,一批次4个样本)的平均loss
#out求和再除以size(4)————————求平均
out_Input=-torch.mean(out.values)#这里求平均时只取max输出的values
print(out_Input)#tensor(0.6172)

Note: In addition to torch.nn.CrosEntropyLoss()the function, there is a function to calculate the cross-entropy torch.nn.BCELoss(),that is different from the former. This function is used to calculate the cross-entropy of the binomial distribution (0-1 distribution) (including the sigmiod output function).


Summarize:

Due to my limited ability at this stage, I can't make a detailed comparison of these two functions, and I don't know more about loss functions. I hope you will forgive me. I will gradually improve it through learning later.

In general, in order to obtain a linear gradient for MSE, the output must not pass through the activation function. In this case, there is only linear regression, so SE is more suitable for regression problems, and CE is more suitable for classification problems. In classification problems, CE can obtain linear gradients, which can effectively prevent the disappearance of gradients; MSE is used as the loss of classification problems. When , due to the influence of activation function derivation, continuous multiplication of numbers less than 1 and greater than 0 causes the gradient to disappear, and also makes the error surface rough, and the local optimum is reached more slowly.

Guess you like

Origin blog.csdn.net/GWENGJING/article/details/126772128