[Deep Learning] Loss function (mean absolute error, mean square error, smoothing loss, cross entropy, cross entropy with weights, dice loss, FocalLoss)

Table of contents

1. Regression Loss

Mean Absolute Error (Mean Absolute Error, MAE)

mean square error

Smooth L1 Loss

2. Classification loss

cross entropy

Cross entropy with weights

Dice Loss

Focal Loss


The loss function is used to measure the degree of deviation between the prediction made by the model and the ground truth. Usually, we will minimize the objective function, and the most commonly used algorithm is "Gradient Descent". As the saying goes, everything must have its two sides. Therefore, there is no universal loss function that can be applied to all machine learning tasks, so here we need to know the advantages and limitations of each loss function in order to be better use them to solve practical problems. Loss functions can be roughly divided into two types: regression loss (for continuous variables) and classification loss (for discrete variables).

1. Regression Loss

 

Mean Absolute Error ( Mean Absolute Error , MAE )

As shown in the above figure L1 curve

 

Concept: Also known as Mean Absolute Error, the mean absolute error (MAE), which measures the average error margin of the distance between the predicted value and the true value, with a range of 0 to positive infinity.

Advantages: The convergence speed is fast, and appropriate penalty weights can be given to the gradient instead of "equal treatment", so that the direction of the gradient update can be more accurate.

Disadvantages: It is very sensitive to outliers, and the direction of gradient update is easily dominated by outliers, which is not robust.

mean square error

L2 curve as shown above

MSE=\frac{1}{n}\sum_{i=1}^{n}[f(x_{i})-y_{i}]^{2}

Concept: Also known as Mean Squred Error, or mean square error (MSE), it measures the sum of the squares of the distance between the predicted value and the true 1 value, and the range of action is the same from 0 to positive infinity.

Advantages: More robust to outliers or outliers.

Disadvantages: The derivative at point 0 is discontinuous, which makes the solution inefficient and leads to slow convergence; and for small loss values, its gradient is as large as that of other interval loss values, so it is not conducive to network learning.

Smooth L1 Loss

As shown in the red curve above

That is, smooth L1 loss (SLL), from Fast RCNN [7]. By combining the advantages of L1 and L2 loss, SLL uses the square function in L2 loss near the 0 point, which solves the problem that the gradient of L1 loss at 0 point is not derivable, making it smoother and easier to converge. In addition, in the interval of |x|>1, it uses the linear function in the L1 loss, so that the gradient can drop quickly.

Smooth_{L1}(x)=\begin{cases} 0.5x^{2} & \text{ if } |x|<1\\ |x|-0.5& \text{ otherwise} \end{cases}

By deriving these three loss functions, it can be found that the derivative of L1 loss is a constant. If the learning rate is not adjusted in time, then when the value is too small, it will make it difficult for the model to converge to a higher accuracy, but tends to fluctuate around a fixed value. Conversely, for the L2 loss, when the initial value of the training is large, the derivative value will be correspondingly large, resulting in unstable training. Finally, it can be found that Smooth L1 can be relatively stable at a certain value when the input value is large in the early stage of training, and it can also accelerate the return of the gradient when it tends to converge in the later stage, which solves the problems of the previous two.

2. Classification loss

cross entropy

The cross-entropy loss function can measure the similarity between p and q, p represents the distribution of the real mark, its value is either 0 or 1, and q is the predicted mark distribution of the trained model,

H(p,q)=-\sum_{i=1}^{n}p(x_{i})log(q(x_{i}))

Limitations of cross-entropy: When using cross-entropy loss, the statistical distribution of labels plays an important role in training accuracy. The more imbalanced the label distribution, the harder it is to train. In the cross-entropy loss, the loss is calculated as the average value of each pixel loss, and each pixel loss is calculated discretely without knowing whether its neighboring pixels are boundaries or not. As a result, the cross-entropy loss only considers the loss in a microscopic sense, but not globally, which is insufficient for image-level prediction. Each sample has no weight, but for different categories, there are corresponding weights

Cross entropy with weights

 

p is the output value after softmax processing;

l(x) represents the true label of each pixel

pl(x)(x): Point x is the activation value of the output of the category given by the corresponding label

w: the weight given to each pixel during training

Dice Loss

Dice loss is a loss function that prevents some of the limitations present in ordinary cross-entropy losses.

The statistical distribution of labels plays an important role in training accuracy when using cross-entropy loss. The more imbalanced the label distribution, the harder it is to train. Although weighted cross-entropy loss can alleviate the difficulty, the improvement is not significant and does not solve the inherent problems of cross-entropy loss.

Dice Loss is the dice loss, from V-Net [3], which is a function used to evaluate the similarity measure between two samples. The value range is 0~1. The larger the value, the more similar the two values ​​are. High, its basic definition (two classifications) is as follows:

 

Among them, |X∩Y| represents the intersection between X and Y, |X| and |Y| represent the number of pixels in the sets X and Y respectively, and the numerator is multiplied by 2 to ensure that the domain value ranges between 0 and 1 , because when the denominators are added, one more overlapping interval will be calculated, as shown in the following figure:

 

It can also be seen from the formula on the right that the Dice coefficient is actually equivalent to the F1 score, and optimizing Dice is equivalent to optimizing the F1 value. In addition, in order to prevent the denominator term from being 0, generally we will add a small number to both the numerator and the denominator as a smoothing coefficient, also known as the Laplace smoothing term. The Dice loss consists of the following two main properties:

It is beneficial to the situation where the positive and negative samples are unbalanced, focusing on the mining of the foreground; during the training process, it is prone to oscillation when there are many small targets; in extreme cases, gradient saturation will occur. So in general, we will optimize together with cross-entropy loss or other classification losses.

Focal Loss

Focal loss, from He Kaiming's "Focal Loss for Dense Object Detection" [4], the starting point is to solve the problem of low accuracy of one-stage algorithms such as YOLO series algorithms in the field of object detection. The author believes that the category imbalance of samples (such as foreground and background) is the main reason for this problem. For example, in many input images, we use grids to divide small windows, most of which do not contain targets. In this way, if we directly use the original cross-entropy loss, the proportion of negative samples will be very large, leading to the optimization direction of the gradient, that is, the network will tend to predict the foreground as the background . Even if we can use the OHEM (Online Difficult Example Mining) algorithm to deal with the imbalance problem, although it increases the weight of misclassified samples, it is easy to ignore easy-to-class samples. Focal loss focuses on training a sparse set of difficult samples. By directly improving on the standard cross-entropy loss, two penalty factors are introduced to reduce the weight of easy-to-classify samples, making the model more efficient during training. Focus on hard samples. Its basic definition is as follows:

 in:

Parameters α and (1-α) are used to control the proportion of positive/negative samples respectively, and their value range is [0, 1]. The value of α can generally be selected through cross-validation. The parameter γ is called the focusing parameter, and its value range is [0, +∞), the purpose is to reduce the weight of easy-to-classify samples, so that the model can focus more on difficult samples during training. When γ = 0, Focal Loss degenerates into cross-entropy loss. The larger γ, the greater the penalty for easy-to-classify samples.

In the experiment, the author selected (α=0.25, γ=0.2) for the best effect, and it needs to be adjusted according to the task. It can be seen that the application of Focal-loss will also introduce two more hyperparameters that need to be adjusted, and generally speaking, it requires experience to adjust well.

Guess you like

Origin blog.csdn.net/weixin_51781852/article/details/125732964