Unbalanced process LOSS-Focal Loss

0 Introduction

Focal LossIn order to deal with issues raised by the unbalanced sample, time-tested, on a variety of tasks, the effect is good. In understanding Focal Lossbefore, we need to cross a deep haircut entropy loss, and with the weight of the cross-entropy loss. We then sample from the perspective of rights, to understand Focal Lossis how to distribute the weight of the sample. Focal is the adjective form of the verb Focus, Focus on what it exactly where it?

1 cross entropy

1.1 cross-entropy loss (Cross Entropy Loss)

There \ (N \) samples, inputs a \ (C \) classifier, the output obtained as a \ (X-\ in \ mathcal {R & lt} ^ {N \ Times C} \) , it consisted \ (C \) class; mind for a sample in which the output is \ (X \ in \ R & lt mathcal {^ {}. 1 \ C} Times \) , i.e. \ (x [j] \) is \ (X-\) is a row vector , then a cross-entropy loss can be written as the following equation :

\ [\ Text {loss} \ left (x, \ text {class} \ right) = - \ log \ left (\ frac {\ exp \ left (x \ left [\ text {class} \ right] \ right) } {\ sum_j {\ exp \ left (x \ left [j \ right] \ right)}} \ right) = -x \ left [\ text {class} \ right] + \ log \ left (\ sum_j {\ exp \ left (x \ left [
j \ right] \ right)} \ right) \ tag {1-1} \] where \ (\ text {class} \ in [0, \ C) \) is for this sample 类标签If the class label is given the weight vector \ (W is \ in \ R & lt mathcal {^ {}. 1 \ C} Times \) , then the cross entropy with weight loss can be changed to the following equation :

\[ \operatorname{loss}(x, \text {class})=W[\text {class}]\left(-x[\text {class}]+\log \left(\sum_{j} \exp (x[j])\right)\right) \tag{1-2} \]

The final of this \ (N \) loss of samples 求和or 求平均:

\[ \ell = \begin{cases} \sum_{i}^{N}{\text{loss}(x^{(i)},\ \text{class}^{(i)})}&\text{, sum}\\ \dfrac{1}{N}\sum_{i}^{N}{\text{loss}(x^{(i)},\ \text{class}^{(i)})}&\text{, mean} \end{cases} \tag{1-3} \]

This is what we usually often used in cross-entropy loss.

1.2 Binary cross-entropy loss (Binary Cross Entropy Loss)

Cross entropy loss mentioned above is suitable for multi-classification (binary and above), but it seems the usual formula in the book or seen the paper is not the same with us, we have a common general cross-entropy loss formula is as follows:

\ [L = -y \ log {\ hat {y}} - (1-y) \ log {(1- \ hat {y})} \]

This is a typical two classification cross entropy loss, where \ (y \ in \ {0 , \ 1 \} \) represents the label value, \ (\ Hat {Y} \ in [0, \. 1] \) represents the classification model 类别1预测值. The above formula is a comprehensive formula that is equivalent to:

\[ l = \begin{cases} -\log{\hat{y}_0} &y=0 \\ -\log{\hat{y}_1} &y=1 \end{cases}; \quad \text{where}\quad \hat{y}_0+\hat{y}_1 = 1 \]

Wherein \ (\ hat {y} _0 , \ hat {y} _1 \) is a two dichotomous model output 伪概率值.

Example: If the model is a binary neural network, and the last one is: 2 + neurons in the Softmax , then \ (\ hat {y} _0 , \ hat {y} _1 \) to correspond to these two neurons output value. Of course, it can also bring the right type of weight.

Likewise, there are \ (N \) samples, enter a second sorter, the output obtained as a \ (X-\ in \ mathcal {R & lt} ^ {N \ Times 2} \) , then after Softmax function, \ (\ Hat {the Y} = \ Sigma (X-) \ in \ mathcal {R & lt} ^ {N \ Times 2} \) , labeled \ (the Y \ in \ mathcal {R & lt} ^ {N \ Times 2} \) , each loss binary samples referred to as \ ({^ L (I)}, I = 0,1,2, \ cdots, N \) , this final \ (N \) loss of samples 求和or求平均 :

\[ \ell = \begin{cases} \sum_{i}^{N}l^{(i)}&\text{, sum}\\ \dfrac{1}{N}\sum_{i}^{N}l^{(i)}&\text{, mean} \end{cases}; \ \ \ l^{(i)} = -y^{(i)}\log{\hat{y}^{(i)}}-(1-y^{(i)})\log{(1-\hat{y}^{(i)})} \]

Note: If only one training sample, namely \ (N = 1 \) , then the above category weight with weight loss in weight is invalid. Because 权重is relative, right to a sample of a major, then necessarily need to have the right to another small sample weights, so as to reflect the importance of this group of samples in some samples. \ (N = 1 \) , the concept has no weight, it is the only and the most important. \ (N = 1 \) , or that batch_size=1this situation during training video \ article data will often occur. Since we display / memory constraints, and video / article data and relatively large, can only train a sample, then we need to pay attention to heavy weight problem.

2 Focal Loss

2.1 The basic idea

In general, Focal Loss (hereinafter referred to as FL) [1] is to solve 样本不平衡the problem, but more accurately, it is to solve 难分类样本(Hard Sample)and 易分类样本(Easy Sampel)imbalances. For a sample of imbalance, in fact, by weight of the above weighted cross entropy loss can solve this problem to some extent, but in practical problems, the effect is to weight the sample to solve the problem of imbalance is not ideal, then we should think, the surface on our sample imbalance, but in fact lead to bad results perhaps the reason is not simply because the sample is unbalanced, but because there are some samples Hard Sample, while there are many Easy Sample, easy sample although easy to distinguish classifier, minor damage , but due to their large number, they still add up to a value greater than the Loss of Hard Sample, so we need to give greater weight Hard Sample, and Easy Sample less weight .

So what is called Hard Sample, Easy Sample what is it? Look at the following chart to know.

fig2-1 fig2-2 fig2-3 fig2-4
FIG. 2-1 Hard Sample Figure 2-2 Easy Sample1 Figure 2-3 Easy Sample2 图2-4 Sample Space

Assumptions, our mission is to train a classifier, the classification and the horse, for the above three figures, Figure 2-2 and 2-3 should be very easy to judge them, but 2-1 is not so easy , and it's someone that is characteristic, but also the characteristics of the horse, very confusing. Although the frequency of such samples in the data set may not appear high, but want to enhance its performance, we need to make efforts to solve this sample classification.

After Hard Sample raised and Easy Sample, sample space may be divided into a sample space as shown in FIG. 2-4. Where the vertical axis 多数类样本(Majority Class)and 少数类样本(Minority Class)cross entropy loss with weight above can only solve the problem of unbalanced sample Majority Class and Minority Class, and did not consider Hard Sample and Easy Sample questions, Focal Loss proposed solution to this difficulty is to sample classification problems.

2.2 Focal Loss Solutions

To solve the problem of difficulty classification of samples, first of all need to find Hard Sample and Easy Sample. This neural network, it should be a relatively easy matter. 2-6, this is the last layer output network, a neural network classification 5, a plus Softmaxor Sigmoida pseudo output probability value will be obtained, representing the probability of each category of model predictions,

fig2-5 fig2-6
图2-6 Easy Sample Classifier Output 图2-7 Hard Sample Classifier Output

图2-6中,样本标签为1,分类器输出值最大的为第1个神经元(以0开始计数),这刚好预测准确,而且其输出值2也比其它神经元的输出值要大不少,因此可以认为这是一个易分类样本(Easy Sample);图2-7的样本标签是3,分类器输出值最大的为第4个神经元,并且这几个神经元的输出值都相差不大,神经网络无法准确判断这个样本的类别,所以可以认为这是一个难分类样本(Hard Sample)。其实说白了,判断Easy/Hard sample的方法就是看分类网络的最后的输出值。如果网络预测准确,且其概率较大,那么这是一个Easy Sample,如果网络输出的概率较小,这是一个Hard Sample。下面用数学公式严谨地表达来Focal Loss的表达式。

令一个\(C\)类分类器的输出为\(\boldsymbol{y}\in \mathcal{R}^{C\times 1}\),定义函数\(f\)将输出\(\boldsymbol{y}\)转为伪概率值\(\boldsymbol{p}=f(\boldsymbol{y})\),当前样本的类标签为\(t\),记\(p_t=\boldsymbol{p}[t]\),它表示分类器预测为\(t\)类的概率值,再结合上面的交叉熵损失,定义Focal Loss为:

\[ \text{FL} = -(1-p_t)\log(p_t) \tag{2-1} \]
这实质就是交叉熵损失前加了一个权重,只不过这个权重有点不一样的来头。为了更好地控制前面权重的大小,可以给前面的权重系数添加一个指数\(\gamma\),那么更改式(2-1):

\[ \text{FL} = -(1-p_t)^\gamma\log(p_t) \tag{2-2} \]

其中\(\gamma\)一值取值为2就好,\(\gamma\)取值为0时与交叉熵损失等价,\(\gamma\)越大,就越抑制Easy Sample的损失,相对就会越放大Hard Sample的损失。同时为解决样本类别不平衡的问题,可以再给式(2-2)添加一个类别的权重\(\alpha_t\)(这个类别权重上面的交叉熵损失已经实现):

\[ \text{FL} = -\alpha_t(1-p_t)^\gamma\log(p_t) \tag{2-3} \]

到这里,Focal Loss理论就结束了,非常简单,但是有效。

3 Focal Loss实现(Pytorch)

3.1 交叉熵损失实现(numpy)

为了更好的理解Focal Loss的实现,先理解交叉熵损失的实现,我这里用numpy简单地实现了一下交叉熵损失。

import numpy as np

def cross_entropy(output, target):
    out_exp = np.exp(output)
    out_cls = np.array([out_exp[i, t] for i, t in enumerate(target)])
    ce = -np.log(out_cls / out_exp.sum(1))
    return ce

代码中第5行,可能稍微有点难以理解,它不过是为了找出标签对应的输出值。比如第2个样本的标签值为3,那它分类器的输出应当选择第2行,第3列的值。

3.2 Focal Loss实现

10 to 12 the following lines of code: output based computation of a probability, which is then converted focal_weight; lines 15 to 16, the weight classes and focal_weightadded to the loss of cross-entropy, to give a final focal_loss; lines 18 to 21, realize mean, and sumtwo kinds of reduction the method, attention is not simple direct averaging average, but a weighted average .

class FocalLoss(nn.Module):
    def __init__(self, gamma=2, weight=None, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.weight = weight
        self.reduction = reduction

    def forward(self, output, target):
        # convert output to presudo probability
        out_target = torch.stack([output[i, t] for i, t in enumerate(target)])
        probs = torch.sigmoid(out_target)
        focal_weight = torch.pow(1-probs, self.gamma)

        # add focal weight to cross entropy
        ce_loss = F.cross_entropy(output, target, weight=self.weight, reduction='none')
        focal_loss = focal_weight * ce_loss

        if self.reduction == 'mean':
            focal_loss = (focal_loss/focal_weight.sum()).sum()
        elif self.reduction == 'sum':
            focal_loss = focal_loss.sum()

        return focal_loss

Note: the above implementation, output dimensions should be satisfied output.dim==2, and the shape of (batch_size, C), and target.max()<C.

to sum up

Focal Loss from 2017 put forward so far, the paper has more than 2000 references, enough to explain its effectiveness. In fact, in essence, it is just a re-distribution rights to the sample weight, it is relatively category weight assignment method, except that the sample space is a more detailed classification, it is easy to manage 2-6 corner from the figure, the weight category method, except that the sample space is divided into two parts at the blue line, and the ease of dividing the sample was added, and the space can be divided into two left and right portions, so, the sample space is divided into 4 sections will, so that more detailed . In fact, by means of this idea, if we can according to the needs of different tasks, more detailed division of our sample space, then the appropriate allocation of different weights do?

references

[1] Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Paper presented at the Proceedings of the IEEE international conference on computer vision.

Guess you like

Origin www.cnblogs.com/endlesscoding/p/12155588.html
Recommended