[Machine Learning] Loss Function DLC

1. The concept of loss function

        Loss Function (Loss Function) is a formula used to evaluate the gap between the predicted results\hat{y} and the real results , indicating the direction for model optimization. In the process of model optimization, it is generally expressed as: oryL(y_i,f(x_i;\theta))L(y_i,\hat{y_i})

        Unlike the cost function (Cost Function) for the entire training set , the loss function is usually only for a single training sample . It can be summarized as A loss function is a part of a cost function . (loss function is part of cost function)

2. Common loss functions and their detailed explanations

        1. Mean square error loss

                The Mean Squared Error (MSE) loss function is generally used for regression tasks , also known as L2 Loss

                        J_{MSE}=\frac{1}{N}\sum(y_i-\hat{y_i})^2

                        When using the mean square error loss function, it can be considered that the error between the model output and the real value obeys a Gaussian distribution

        2. Mean absolute error loss

                Mean Absolute Error Loss (MAE), also known as L1 Loss

                        J_{MAE}=\frac{1}{N}\sum|y_i-\hat{y_i}|

                        When using the mean absolute error loss function, it can be considered that the error between the model output and the true value obeys the Laplace distribution

        3.HUber Loss

                Also known as Smooth L1 Loss , the derivative of L1 Loss at 0 point is not unique, which may affect the convergence; while Smooth L1 Loss uses a square function near 0 point to make it smoother

                        Smooth L_1 = 0.5x^2,|x|<1

                                           =|x|-0.5, x<-1||x>1

        The difference between MAE and MSE

                ①The convergence speed of L2 Loss is faster than that of L1 Loss, and L2 Loss is generally used in most cases

                ②The growth of L1 Loss is relatively slow (linear growth with error, not square growth), that is, it is not sensitive to outliers; for border prediction regression problems (such as Faster RCNN), the gradient changes are smaller, and it is more difficult to run fly

        4. Cross entropy loss function

                Cross Entropy Loss, generally applied to classification problems , can be divided into binary classification and multi-classification

                4.1 Two classifications

                        For binary classification, we usually use the sigmoid function to compress the model to (0,1), and the output of the model is a probability. For a given input x_i, the probabilities of being a positive example and a negative example are:

                                p(y_i=1|x_i)=\hat{y_i}

                                p(y_i=0|x_i)=1-\hat{y_i}

                        Combining these two formulas gives:p(y_i|x_i)=(\hat{y_i})^{y_i}(1-\hat{y_i})^{1-y_i}

                        Assuming that the data points are independent of each other, the likelihood distribution can be expressed as:L(x,y)=\prod (\hat{y_i}^{y_i})(1-\hat{y_i})^{1-\hat{y_i}}

                        Taking the logarithm of the likelihood and adding a negative sign to minimize the negative log likelihood, the form of the crossover loss function can be obtained

                                J_{CE}=-\sum y_i log(\hat{y_i})+(1-y_i)log(1-\hat{y_i})

                 4.2 Multi-category

                        The idea of ​​multi-classification is similar to binary classification, the real y_ivalue a one-hot vector; the function used for compression is changed to softmax , the output range of all dimensions is compressed to (0,1), and the sum is 1, which can be expressed as:p(y_i|x_i)=\prod (\hat{y}^k_i)^{y_i^k}

                         Taking the logarithm of the likelihood and adding a negative sign to minimize the negative log likelihood, the form of the crossover loss function can be obtained

                                J_{CE}=-\sum y_i^{c_i}log(y_i^{\hat{(c_i)}})

                4.3Focal Loss

                        Focal Loss is based on the cross-entropy loss function and is used to solve the following problems in the traditional cross-entropy loss function:

                                ① Too many negative samples (Negative example) cause the Loss of the positive sample (Postive example) to be overwritten

                                ②There are too many easy samples (Easy example) leading to it dominating the convergence direction of a certain batch

                        Focal Loss can be expressed as:

                                        FL(p_t)=-\alpha_t(1-p_t)^\gamma log(p_t)

                                where \alpha_tand γ are used to solve the problem of positive and negative sample imbalance and difficult and easy sample imbalance respectively

                        Taking the binary classification as an example, expand it

                                        p_t=p,y=1

                                             =1-p,otherwise

                                4.3.1 a

                                        Used to solve the problem of imbalance between positive and negative samples ; classify different weight values ​​for positive and negative samples α [0,1]

                                        \alpha_t=\alpha,y=1

                                             =1-\alpha,otherwise

                                        The value of α often needs to be adjusted according to the conclusion (0.25 in the Faster RCNN paper)

                                4.3.2 c

                                        It is used to solve the problem of imbalance between difficult and easy samples ; let each sample be multiplied (1-p_t)^\gamma, because the score of simple samples  p_tis generally close to 1, then its (1-p_t)^\gammavalue will be smaller, which can suppress the weight of simple samples

3. Implementation of Focal Loss

        Taking YOLO V4 as an example, the loss function of YOLO V4 consists of three parts: loc ( regression loss ), conf ( target confidence loss ), and cls ( category loss ), among which the target confidence loss needs to be distinguished between positive and negative samples. It can be handled in the following way.

        ① Extraction probability p

conf = torch.sigmoid(prediction[..., 4])

        ② Balance positive and negative samples, set parameter α

torch.where(obj_mask, torch.ones_like(conf) * self.alpha, torch.ones_like(conf) * (1 - self.alpha))

        ③ Balance difficult and easy samples, set parameters γ

torch.where(obj_mask, torch.ones_like(conf) - conf, conf) ** self.gamma

        ④ Multiply back the cross entropy loss

ratio       = torch.where(obj_mask, torch.ones_like(conf) * self.alpha, torch.ones_like(conf) * (1 - self.alpha)) * torch.where(obj_mask, torch.ones_like(conf) - conf, conf) ** self.gamma
loss_conf   = torch.mean((self.BCELoss(conf, obj_mask.type_as(conf)) * ratio)[noobj_mask.bool() | obj_mask])

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/128770572