[CV] Common loss functions and application examples: cross entropy, contrast, cosine, Dice, Focal Loss

foreword

The function of the loss function is to measure the difference between the predicted value of the model and the real value, so as to evaluate the performance of the model, and adjust the model parameters through an optimization algorithm (such as gradient descent) to minimize the value of the loss function, thereby improving the prediction accuracy of the model.
Specifically, the loss function is usually used in supervised learning. Given the features and labels of a sample, the model predicts the label according to the features, and compares the predicted value with the real value to calculate the loss value. The optimization process is to continuously adjust the model parameters to make the loss value smaller and smaller. Therefore, the loss function is an important part of the optimization algorithm, which determines the direction and speed of model optimization.
Different loss functions are suitable for different tasks and scenarios, such as mean square error for regression problems, cross-entropy loss for classification problems, comparison loss for similarity measurement problems, and so on. Therefore, choosing an appropriate loss function is crucial for both model training and performance improvement.

Mean Squared Error (MSE)

M S E = 1 n ∑ i = 1 n ( y i − y i ^ ) 2 MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 MSE=n1i=1n(yiyi^)2 of whichyi y_iyiis the real value, yi ^ \hat{y_i}yi^is the predicted value, nnn is the number of samples.

Applicable to regression problems, the goal is to minimize the squared difference between the predicted value and the true value, and is not directly related to other loss functions.

Cross-Entropy Loss

C E = − 1 n ∑ i = 1 n ∑ j = 1 m y i j log ⁡ y i j ^ CE = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m}y_{ij}\log\hat{y_{ij}} CE=n1i=1nj=1myijlogyij^Among them, yij y_{ij}yijfor secondThe jjth of the i sampleThe true labels of j categories,yij ^ \hat{y_{ij}}yij^for secondThe jjth of the i samplePredicted values ​​for j categories,nnn is the number of samples,mmm is the number of categories.

Equivalent to maximum likelihood estimation: Maximizing the likelihood function is equivalent to minimizing the cross-entropy loss function, so the cross-entropy loss function can also be used for maximum likelihood estimation of model parameters.

It is suitable for classification problems, and the goal is to minimize the cross entropy between the predicted value and the real value. It is often used in combination with the Softmax function to calculate the probability distribution of each category. Unlike contrastive loss, cosine similarity loss, and triplet loss, cross-entropy loss does not involve a similarity measure between samples.

Suppose we want to classify a picture of a handwritten digit. The label of the picture is the number 1. We want to train a model to correctly identify this picture. First, we input this picture into the model, and the model will output a vector of length 10, indicating the probability that this picture belongs to each of the 10 numbers.
Suppose the vector output by the model is [0.2, 0.6, 0.05, 0.02, 0.01, 0.01, 0.01, 0.05, 0.01, 0.04], where the second element 0.6 is the largest, so the model predicts that the picture has the highest probability of belonging to the number 2. However, we know that the real label of this picture is the number 1, so we need to calculate the gap between the model's predicted value and the real value, and use the cross entropy loss function to measure this gap.
The calculation formula of the cross entropy loss function is: LCE = − ∑ i = 1 nyi log ⁡ ( pi ) L_{CE}=-\sum_{i=1}^{n}y_i\log(p_i)LCE=i=1nyilog(pi) , wherennn represents the number of categories,yi y_iyiIndicates the iiTrue labels (0 or 1) for i categories,pi p_ipiIndicates that the model predicts that this sample belongs to the iiProbabilities for i classes. In this example, the true label is the number 1, soy 1 = 1 y_1=1y1=1 , the restyi y_iyiBoth are 0, the probability of the model predicting number 1 is p 1 = 0.2 p_1=0.2p1=0.2 , so the cross-entropy loss isLCE = − ( 1 × log ⁡ ( 0.2 ) + 0 × log ⁡ ( 0.6 ) + 0 × log ⁡ ( 0.05 ) + . . . + 0 × log ⁡ ( 0.04 ) ) = − log ⁡ ( 0.2 ) ≈ 1.61 L_{CE}=-(1\times\log(0.2)+0\times\log(0.6)+0\times\log(0.05)+...+0\times\log(0.04))=-\log(0.2)\approx 1.61LCE=(1×log(0.2)+0×log(0.6)+0×log(0.05)+...+0×log(0.04))=log(0.2)1.61 .
We hope that the smaller the gap between the predicted value of the model and the real value, the better, so we need to adjust the model parameters through an optimization algorithm (such as gradient descent) to minimize the cross-entropy loss. During the training process, we will accumulate the cross-entropy loss of each sample to obtain the average loss on the entire training set as the performance indicator of the model. Through continuous iteration, we can let the model gradually learn better feature representations and improve classification accuracy.

Contrastive Loss

L = 1 2 n ∑ i = 1 2 n y i d i 2 + ( 1 − y i ) max ⁡ ( m a r g i n − d i , 0 ) 2 L=\frac{1}{2n}\sum_{i=1}^{2n}y_{i}d_{i}^2+(1-y_{i})\max(margin-d_{i},0)^2 L=2 n1i=12 nyidi2+(1yi)max(margindi,0)2 of whichyi y_{i}yifor secondWhether the i samples are similar to the label,di d_{i}difor secondThe distance between i samples, margin marginma r g in is the marginal value, which is a pre-set threshold, and usually indicates the limit of the similarity.

Applicable to similarity measurement problems, the goal is to encourage the distance between similar samples to be as small as possible, and the distance between dissimilar samples to be as large as possible. Similar to the Triplet loss, the similarity measurement is performed by comparing the distance between samples, but the Triplet loss calculates the distance between triplet samples, while the comparison loss calculates the distance between binary sample samples.

Suppose we want to train a face recognition model. Given a face picture, the model needs to judge whether it belongs to a certain person. We can input each face picture into the model, and the model will output a vector representing the features of the face. If two faces belong to the same person, their eigenvectors should be relatively close; if two faces belong to different people, their eigenvectors should be relatively far away. Therefore, we can use contrastive loss to measure the similarity or difference between two feature vectors.

y i = 0 y_i=0 yi=When 0 , it means theiiThe labels corresponding to the i samples are different, so the first item of the loss function isdi 2 d_i^2di2;当 y i = 1 y_i=1 yi=When 1 , it means theiii samples correspond to the same label, so the second term of the loss function ismax ⁡ ( m − di , 0 ) 2 \max(m-d_i,0)^2max(mdi,0)2 . By adjusting the thresholdmmm , we can control the sensitivity of the model to similarity.

Specifically, when mmWhen m is larger, the sensitivity of the model to similarity is lower, that is, the model is more inclined to regard samples with farther distances as dissimilar; whenmmWhen m is small, the model is more sensitive to similarity, that is, the model is more inclined to regard samples with closer distances as similar. Therefore, by properly adjusting the thresholdmmm , which can make the model more accurately judge the similarity or difference between two samples, and improve the classification performance of the model.

During the training process, we accumulate the contrastive losses of each sample to obtain the average loss on the entire training set as the performance index of the model. Through continuous iteration, we can let the model gradually learn better feature representations and improve the accuracy of face recognition. Compared with the cross-entropy loss function, the comparison loss function is more suitable for measuring the similarity or difference between two vectors, so it is widely used in face recognition, image retrieval and other fields.

Cosine Similarity Loss

L = 1 n ∑ i = 1 n ( 1 − cos ⁡ ( θ i ) ) L = \frac{1}{n}\sum_{i=1}^{n}(1 - \cos(\theta_i))L=n1i=1n(1cos ( ii)) Among them,θ i \theta_iiifor secondAngle between i samples, nnn is the number of samples.

Applicable to similarity measurement problems, the goal is to encourage the cosine similarity between similar samples to be as close to 1 as possible. Unlike contrastive loss, Triplet loss, cosine similarity loss calculates the cosine similarity between samples, not the distance.

Dice loss weighted by cross entropy loss (Dice Loss)

L = − 1 n ∑ i = 1 n 2 ∑ j m y i j y i j ^ + c ∑ j m y i j + ∑ j m y i j ^ + c L = -\frac{1}{n}\sum_{i=1}^{n}\frac{2\sum_{j}^{m}y_{ij}\hat{y_{ij}}+c}{\sum_{j}^{m}y_{ij}+\sum_{j}^{m}\hat{y_{ij}}+c} L=n1i=1njmyij+jmyij^+c2jmyijyij^+cAmong them, yij y_{ij}yijfor secondThe jjth of the i sampleThe true labels of j categories,yij ^ \hat{y_{ij}}yij^for secondThe jjth of the i samplePredicted values ​​for j categories,nnn is the number of samples,mmm is the number of categories,ccc is the smoothing coefficient.

Applicable to image segmentation problems, the goal is to maximize the overlap between the predicted result and the true result. Unlike cross-entropy loss, Dice loss does not consider the relationship between categories, but only focuses on the overlap between predicted results and real results.

In image segmentation, we need to assign each pixel in an image to a different class. For each pixel, we can represent its true label as a one-hot encoded vector, where iiThe i position indicates that the pixel belongs to theiiProbabilities for i classes. Similarly, the predicted label of the model can also be represented as a one-hot encoded vector. We can define the Dice coefficient between the true label and the predicted label as:
D ice = 2 ∣ X ∩ Y ∣ ∣ X ∣ + ∣ Y ∣ Dice=\frac{2|X \cap Y|}{|X|+|Y|}Dice=X+Y2∣XY
Among them, XXXYYY denote the binary masks of the true label and the predicted label, respectively,∣ ⋅ ∣ |\cdot| indicates the number of 1s in the mask. The value of the Dice coefficient ranges from 0 to 1, where 0 means no match at all and 1 means a perfect match.
In order to convert the Dice coefficient into a loss function, we can convert it into the form of 1-Dice coefficient, namely:
D iceloss = 1 − D ice Dice_loss=1-DiceDiceloss=1
The advantage of D i ce is that when the Dice coefficient is larger, the Dice Loss is smaller. Therefore, the training goal of the model is to minimize the Dice Loss, thereby increasing the Dice coefficient, thereby improving the accuracy of image segmentation .
It should be noted that Dice Loss is not a convex function, so it may fall into a local optimal solution during the optimization process. In order to avoid this situation, we usually use some regularization techniques, such as L1 or L2 regularization, or use other optimization algorithms, such as Adam.

Triplet Loss (Triplet Loss)

L = max ⁡ ( 0 , d a , p − d a , n + m a r g i n ) L = \max(0, d_{a,p}-d_{a,n}+margin) L=max(0,da,pda,n+ma r g in )其中,da , p d_{a,p}da,pis the distance between the anchor sample and the positive sample, da , n d_{a,n}da,nis the distance between the anchor sample and the negative sample, margin marginma r g in is the marginal value.

Applicable to problems such as face recognition, the goal is to encourage the distance between photos of the same person to be as small as possible and the distance between different people to be as large as possible by comparing the distance between different photos of the same person and the distance between different people. Similar to the contrastive loss, the similarity measure is performed by comparing the distance between samples, but the Triplet loss calculates the distance between triplet samples, while the contrastive loss calculates the distance between binary group samples.

In triplet loss (Triplet Loss), the anchor sample refers to the sample we want to learn the similarity. Specifically, we represent each sample as a vector, and measure the similarity between vectors by computing the distance between them. In triplet loss, we divide each sample into three parts: anchor samples, positive samples and negative samples. Among them, the anchor sample is the sample that we want to learn similarity, the positive sample is the sample that belongs to the same category as the anchor sample, and the negative sample is the sample that belongs to a different category from the anchor sample.
Specifically, for each anchor sample aaa , we need to find a positive sampleppp and a negative samplennn , so that the distance between the anchor sample and the positive sample is smaller than the distance between the anchor sample and the negative sample. The purpose of this is to make the distance between samples of the same category closer and the distance between samples of different categories longer, thereby improving the effect of similarity learning. Therefore, the calculation formula of triplet loss can be expressed as:
L = max ( d ( a , p ) − d ( a , n ) + m , 0 ) L=max(d(a,p)-d(a,n)+m,0)L=max(d(a,p)d(a,n)+m,0 )
Among them,d ( a , p ) d(a,p)d(a,p ) represents the anchor point sampleaaa and positive sampleppThe distance between p , d ( a , n ) d(a,n)d(a,n ) represents the anchor sampleaaa and negative samplesnndistance between n , mmm is a hyperparameter, representing margin, which is used to control the distance difference between anchor samples and positive and negative samples. Ifd ( a , p ) − d ( a , n ) + m > 0 d(a,p)-d(a,n)+m>0d(a,p)d(a,n)+m>0 , the loss is positive, indicating that the model needs to adjust parameters so thatd ( a , p ) − d ( a , n ) + md(a,p)-d(a,n)+md(a,p)d(a,n)+m is as small as possible; otherwise, the loss is 0, indicating that the model has met the requirements and there is no need to adjust the parameters.
It should be noted that in practical applications, we usually choose some representative anchor samples, such as the center samples in each category or some samples that are difficult to classify, to improve the effect of similarity learning. At the same time, we can also use some techniques, such as online mining (online mining) or offline mining (offline mining), to select appropriate positive and negative samples, so as to further improve the performance of the model.

Focal Loss (based on 2 classifications, can be extended to multiple classifications)

Focal Loss is a loss function for class imbalance and is widely used in tasks such as target detection and image segmentation. Its main idea is to give greater weight to samples that are difficult to classify (that is, samples whose prediction probability is close to 0 or 1), so as to concentrate on optimizing these samples that are difficult to classify. Let's use an example to illustrate how Focal Loss works.
Suppose we have a binary classification problem where the distribution ratio of positive samples to negative samples is 1:9. We use cross-entropy (Cross-Entropy) as the loss function for training, but due to the large number of negative samples, it is easy for the model to pay too much attention to negative samples and ignore positive samples. At this point, we can use Focal Loss to solve this problem.

The calculation formula of Focal Loss (α variant) is to introduce the α coefficient to solve the problem of representing the balance of positive and negative samples:
FL ( pt ) = − α t ( 1 − pt ) γ log ⁡ ( pt ) FL(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t)FL(pt)=at(1pt)clog(pt)
among them,pt p_tptIndicates the probability that the model predicts that the sample belongs to the positive category, α t \alpha_tatIs a weight coefficient used to balance the number of positive and negative samples, γ \gammaγ is a tuning parameter used to control the weight of hard and easy samples. In the binary classification problem, we can putα t \alpha_tatFor example:
α t = { α , if y = 1 1 − α , if y = 0 \alpha_t = \begin{cases} \alpha, &\text{if } y=1 \\ 1-\alpha, &\text{if } y=0 \end{cases}at={ a ,1a ,if y=1if y=0
Among them, yyy represents the true label of the sample,α \alphaα is a hyperparameter used to balance the number of positive and negative samples. In practice, usuallyα \alphaα is set to the proportion of positive samples, ieα = 0.1 \alpha=0.1a=0.1 , which is 0.1 in the example above.

Next, we look at γ \gammaThe role of γ . Whenγ = 0 \gamma=0c=When 0 , Focal Loss degenerates into a standard cross-entropy loss function; whenγ > 0 \gamma>0c>When 0 , for samples that are easy to classify,γ \gammaThe increase of γ will make the weight of the loss function smaller, thereby reducing the model's attention to samples that are easy to classify; and for samples that are difficult to classify, γ \gammaThe increase of γ will make the weight of the loss function larger, so as to focus on optimizing samples that are difficult to classify. Therefore, by adjustingγ \gammaWith the size of γ , we can control how much the model pays attention to samples of different difficulty levels.
insert image description here
As shown in the figure, although the loss of easy-to-segment samples is small, the number is large. After introducing the balance coefficient, this part of loss is reduced. For difficult-to-segment samples,pt p_tptClose to 0, it has little effect on the loss.

In short, the main function of Focal Loss is to give greater weight to samples that are difficult to classify, so as to concentrate on optimizing samples that are difficult to classify. In tasks such as target detection and image segmentation, due to the huge difference in the proportion of positive and negative samples, Focal Loss can help us better balance the number of positive and negative samples, thereby improving the performance of the model.

Relevant pytorch code:

class WeightedFocalLoss(nn.Module):
    "Non weighted version of Focal Loss"    
    def __init__(self, alpha=.25, gamma=2):
            super(WeightedFocalLoss, self).__init__()        
            self.alpha = torch.tensor([alpha, 1-alpha]).cuda()        
            self.gamma = gamma
            
    def forward(self, inputs, targets):
            BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')        
            targets = targets.type(torch.long)        
            at = self.alpha.gather(0, targets.data.view(-1))        
            pt = torch.exp(-BCE_loss)        
            F_loss = at*(1-pt)**self.gamma * BCE_loss        
            return F_loss.mean()

Guess you like

Origin blog.csdn.net/hh1357102/article/details/131379460