Loss of function of the depth of learning Summary

In depth study, the loss function plays a vital role. By minimizing the loss of function, so that the model reaches a converged state, reducing the value of the error prediction model. Therefore, different loss function, the impact of the model is significant. Next, to sum up, the job losses frequently used functions:

  • Image classification: cross entropy
  • Target Detection: Focal loss, L1 / L2 loss function, IOU Loss, GIOU, DIOU, CIOU
  • Image Recognition: Triplet Loss, Center Loss, Sphereface, Cosface, Arcface

Image Classification

Cross entropy

In the image classification is often used as a softmax + cross entropy loss function, specific derivation can refer to my previous blog .

$$Cross Entropy=-\sum_{i=1}^{n}p(x_i)ln(q(x_i))$$

Which, $ p (x) $ represents the true probability distribution, $ q (x) $ represents the predicted probability distribution. Cross entropy loss function by reducing the differences between two probability distributions to predict the probability distribution of the probability distribution to achieve real as possible.

Later, Google on the basis of cross-entropy on the proposed label smoothing (smooth label) . For details, refer to this blog .

In practice, it is necessary to fit the predicted probability of the true probability, while fitting one-hot real probability function of two problems:

  1. Can not guarantee the generalization ability of the model, easily lead to over-fitting;
  2. 0 total probability and the probability of encouragement gap between the respective categories and other categories increased as much as possible, and can be seen from the gradient bounded, this situation is difficult to adapt, will cause too much faith in the model category prediction.

Therefore, to reduce this over-confidence, while slowing the impact of human error caused marked, the need for $ p (x) $ changes:

$$ p '{(x)} = (1- \ epsilon) \ delta _ {(k, s)} + \ in the epsilon (k) $$

Wherein, $ \ delta _ {(k, y)} $ is the Dirac function, $ u (k) $ uniform distribution. In simple terms, reducing confidence in label y improve the remaining categories confidence. Thus, the cross entropy becomes:

$$H(p',q)=-\sum_{i=1}^{n}p'(x_i)ln(q(x_i))=(1-\epsilon )H(p,q)+\epsilon H(p,u)$$

 


Target Detection

More recently, see a good blog post is to introduce the target detection loss function, you can refer to: https://mp.weixin.qq.com/s/ZbryNlV3EnODofKs2d01RA

Detecting moving objects, the general loss function consists of two parts, classification loss and bounding box regression loss. calssification loss goal is to make the correct classification as possible; the purpose of bounding box regression loss is to make the prediction frame as much as possible with the GT box on the match.

Focal loss

The Focal loss from the loss of function in the paper " Focal Loss for Dense Object Detection ", mainly to resolve the imbalance between positive and negative samples. By reducing the value of loss easy example indirectly increases the value of the right hard example weight loss. Focal loss is based on cross entropy for improvement:

$$Focal loss=-\alpha _t(1-p_t)^\gamma log(p_t)$$

It can be seen in front of the cross-entropy increase of $ (1-p_t) ^ \ gamma $, when the picture is wrong division, $ p_t $ will be very small, $ (1-p_t) $ is close to 1, so the loss suffered little effect; and increase the parameter $ \ gamma $ is smooth to reduce esay example weights. When $ \ gamma = 0 $, Focal loss degenerate into cross entropy. For different $ \ gamma $, its impact as shown in FIG.

 

 L1, L2, smooth L1 loss function

Using L1, L2 or smooth L1 loss function, to return the four coordinate values. smooth L1 loss function is Fast R-CNN proposed. Three loss function, as follows:

$$L1=\left | x \right |$$

$$ L2 = x ^ {2} $$

$$smoothL1=\left\{\begin{matrix} 0.5x^2\qquad if\left |x \right |<1\\  \left |x \right |-0.5 \qquad otherwise \end{matrix}\right.$$

Losses from derivative of a function of x shows: $ L1 $ loss of function of the derivative of x is constant, in the late train, x is very small, if the learning rate unchanged, the loss function will fluctuate around a stable value, it is difficult to converge to higher accuracy. $ L2 $ loss of function derivative of x when x is a great value, its derivative is also very large, is unstable in the early stages of training. perfectly smooth L1 and avoiding the disadvantages $ $ $ L2 of loss of $ L1.

 In a typical target detection, typically calculating a difference between values ​​of the coordinates of the four GT box, then carried out by adding these four loss, constituting regression loss.

However, using the above three loss function, there will be the following deficiencies:

  • The above three kinds of time for calculating the target Loss detection Bounding Box Loss, calculated independent Loss of four points, and then obtained by adding a final Bounding Box Loss, this approach is assumed that the four points are independent actually there is a certain correlation;
  • Indicator Actual evaluation frame detection is the use of IOU, the two are not equivalent, a plurality of detection Loss box may have the same size, but the IOU can vary widely, in order to solve this problem on the introduction of IOU LOSS

 IOU Loss

The IOU Loss desert view is put forward in 2016, " UnitBox: An Advanced Object Detection Network ." One of the main points of the paper are:

  • Ln loss function based Euclidean distance, which is the assumption that coordinates of the four variables are independent, but in reality, these coordinate variables having some relevance.
  • Evaluation using the IOU, and return the coordinate frame and use the four coordinate variables, both of which are not equivalent.
  • European block having the same distance, which is not the only value IOU.

Therefore, the proposed IOU loss, IOU used directly as a loss function:

$$ Loss_ {IOU} = - ln (IOU) $$

At the same time, people may use are:

$$ Loss_ {IOU} = 1-IOU $$

GIOU

The GIOU Loss Stanford loss function is proposed in 2019, " Generalized Intersection over of Union: A Metric and A Loss for Bounding Box Regression ." IOU Loss in the above, can not be optimized for two non-overlapping block, and IOU Loss not reflect how far from the two blocks in the end. To solve this problem, the authors mention GIOU as a loss function:

$$ GIOU = IOU- \ frac {C- (A \ bigcup B)} {C} $$

Wherein, $ C $ matrix represents the minimum area of ​​the two external boxes. I.e., the first two frames is obtained IOU, and the area and the external matrix C, A and B minus the area. GIOU finally get the value of.

 

GIOU have the following properties:

  • Giou embodiment can be used as a measure of distance, $ Loss_ {GIOU} = 1-GIOU $
  • GIOU scale invariant
  • Giou the IOU is the lower limit, $ GIOU (A, B) \ leq IOU (A, B) $. When the rectangle A, B similar, $ \ lim_ {A \ rightarrow B} GIOU (A, B) = IOU (A, B) $
  • When the rectangle A, B overlap, $ GIOU (A, B) = IOU (A, B) $
  • When the rectangle A, B do not intersect, $ GIOU (A, B) = - 1 $

Overall, GIOU contains all of the advantages of IOU, while overcoming the shortcomings of IOU's.

Diou 和 Ciou

DIOU和CIOU是天津大学于2019年提出的《Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression》。为了解决GIOU收敛慢和提高回归精度,提出DIOU来加速收敛。同时考虑到框回归的3个几何因素(重叠区域,中心点距离,宽高比),基于DIOU,再次提出CIOU,进一步提高收敛速度和回归精度。另外,可以将DIOU结合NMS组成DIOU-NMS,来对预测框进行后处理。

当出现下图的情况(GT框完全包含预测框)时,IOU与GIOU的值相同,此时GIOU会退化成IOU,无法区分其相对位置关系。同时由于严重依赖于IOU项,GIOU会致使收敛慢。

 基于上述问题,作者提出两个问题

 

  1. 直接最小化预测框与目标框之间的归一化距离是否可行,以达到更快的收敛速度。
  2. 如何使回归在与目标框有重叠甚至包含时更准确、更快

好的目标框回归损失应该考虑三个重要的几何因素:重叠面积,中心点距离,长宽比。基于问题一,作者提出了DIoU Loss,相对于GIoU Loss收敛速度更快,该Loss考虑了重叠面积和中心点距离,但没有考虑到长宽比;针对问题二,作者提出了CIoU Loss,其收敛的精度更高,以上三个因素都考虑到了。

 

首先,定义一下基于IOU Loss的损失函数:

$$Loss=1-IOU+R(B,B^{gt})$$

其中,$R(B,B^{gt})$表示预测框与GT框的惩罚项。在IOU Loss中,$R(B,B^{gt})=0$;在GIOU中,$R(B,B^{gt})=\frac{C-A\bigcup B}{C}$。

而在DIOU中,该惩罚项$R(B,B^{GT})=\frac{\rho ^2(b,b^{gt})}{c^2}$,其中$b$和$b^{gt}$表示预测框与GT框的中心点,$\rho ()$表示欧式距离,$c$表示预测框$B$与GT框$B^{gt}$的最小外接矩阵的对角线距离,如下图所示。

因此,$Loss_{DIOU}$定义为:

$$Loss_{GIOU}=1-IOU+\frac{\rho ^2(b,b^{gt})}{c^2}$$

所以,$Loss_{DIOU}$具有如下性质:

  1. DIOU依然具有尺度不变性;
  2. DIOU直接最小化两个框的距离,因此收敛会更快;
  3. 对于目标框包裹预测框的这种情况,DIoU Loss可以收敛的很快,而GIoU Loss此时退化为IoU Loss收敛速度较慢

DIOU同时考虑了重叠面积和中心点之间的距离,但是没有考虑到宽高比。进一步提出了CIOU,同时考虑了这3个因素,在DIOU的惩罚项中加入了$\alpha \upsilon $:

$$R(B,B^{gt})=R_{CIOU}=\frac{\rho ^2(b,b^{gt})}{c^2}+\alpha \upsilon $$

其中,$\alpha $表示trade-off参数,$\upsilon $表示宽高比一致性参数。

$$\upsilon =\frac{4}{\pi ^2}\left ( arctan\frac{w^{gt}}{h^{gt}}-arctan\frac{w}{h} \right )^2$$

$$\alpha =\frac{\upsilon }{(1-IOU)+\upsilon }$$

这里的$\upsilon $为什么会有$\frac{4}{\pi ^2}$呢?这里$arctan$的范围是$[0,\frac{\pi }{2})$。

所以,CIOU的损失函数为:

$$Loss_{CIOU}=1-IOU+\frac{\rho ^2(b,b^{gt})}{c^2}+\alpha \upsilon $$

而在实际操作中,$w^2+h^2$是很小的数字,在后向传播时,容易造成梯度爆炸。通常用1来代替$w^2+h^2$。

另外,提醒一点的是,GIOU、CIOU、DIOU都是衡量方式,在使用时可以代替IOU。但是这里需要考虑的一个问题是,预测框与GT框的匹配规则问题。并不是说anchor一定会去匹配一个不重叠的GT框。类似于SSD中所说,anchor会选择一个重叠最大的GT框进行预测,而这个重叠最大可以使用IOU、GIOU、CIOU、DIOU来进行衡量。

 


图像识别

图像识别问题,包含了行人重识别,人脸识别等问题。此类损失都是通用的,因此放在一起汇总。同样,也看到一篇很好的博客介绍了大量人脸识别的损失函数:https://mp.weixin.qq.com/s/wJ-JNsUv60vXtGIV-mDrTA

Triplet Loss

该Triplet Loss损失函数提出于2015年的《FaceNet: A Unified Embedding for Face Recognition and Clustering》。该损失函数的主要想法是,拉近同一id之间的距离,扩大不同id之间的距离。如下图所示,图中的anchor与positive属于同一id,即$y_{anchor}=y_{positive}$;而anchor与negative属于不同的id,即$y_{anchor} \ne y_{negative}$。通过不断学习后,使得anchor与positive的欧式距离变小,anchor与negative的欧式距离变大。其中,这里的anchor、Positive、negative是图片的d维嵌入向量(embedding)。

 使用数学公式进行表达,triplet loss想达到的效果是:

$$d(x^a_i,x^p_i)+\alpha \leq d(x^a_i,x^n_i)$$

其中,$d()$表示两个向量之间的欧氏距离,$\alpha $表示两个向量之间的margin,防止$d(x^a_i,x^p_i)=d(x^a_i,x^n_i)=0$。因此,可以最小化triplet loss损失函数来达到此目的:

$$triplet\quad loss=[d(x^a_i,x^p_i)-d(x^a_i,x^n_i)+\alpha ]_+$$

在实际中,通常使用在线训练方式,选择P的不同的id,每个id包含K张图片,形成了$batch_size=PK$的mini-batch。从而在这mini-batch种选择hard/easy example构成loss,具体可以参考这篇博客

Center Loss

该Center Loss损失函数提出于《A Discriminative Feature Learning Approach for Deep Face Recognition》。为了提高特征的区分能力,作者提出了center loss损失函数,不仅能缩小类内差异,而且能扩大类间差异。

作者首先在MNIST数据集上进行试验,将隐藏层的最后输出维度改为2,使用softmax+交叉熵作为损失函数,将其结果可视化出来,如下图所示。可以看出,交叉熵可以使每一类分开,数据分布呈射线形,但却不够区分性,即类内差异大。

 

 因此,作者想要在保持数据的可分性前提下,进一步缩小类内之间的差异。为了达到这个目的,提出了center loss损失函数:

$$L_C=\frac{1}{2}\sum_{i=1}^{m}\left \| x_i-c_{y_i} \right \|^2_2$$

其中,$c_{y_i}$表示第$y_i$类的中心。因此,通常将center loss和交叉熵进行结合,构成组合损失函数:

$$L=L_S+\lambda L_C=-\sum_{i=1}^{m}log\frac{e^{W^T_{y_i}x_i+b_{y_i}}}{\sum_{j=1}^{n}e^{W^T_j x_j+b_{y_i}}}+\frac{\lambda }{2}\sum_{i=1}^{m}\left \| x_i-c_{y_i} \right \|^2_2$$

其中,$\lambda $表示center loss的惩罚力度。同样在MNIST中,其结果如下图所示。可以看到随着$\lambda $的增加,约束性更强,每一类会更聚集在类内中心处。

 

在使用Center Loss损失函数时,需要引入两个超参:$\alpha $和$\lambda $。其中,$\lambda $表示center loss的惩罚力度;而$\alpha $控制类内中心点$c_{y_i}$的学习率。类内中心点$c_{y_i}$应该随着特征的不同,会产生变化。一般会在每个mini-batch中更新类内中心点$c_{y_i}$:

$$c^{t+1}_j=c^t_j-\alpha \Delta c^t_j$$

Sphereface

该Sphereface提出于《SphereFace: Deep Hypersphere Embedding for Face Recognition》,其也称A-Softmax损失函数。作者认为,triplet loss需要精心构建三元组,不够灵活;center loss损失函数只是强调了类内的聚合度,对类间的可分性不够重视。因此,作者提出了疑问:基于欧式距离的损失函数是否适合模型学习到具有区分性的特征呢?

首先,重新看一下softmax loss损失函数(即softmax+交叉熵):

$$Loss_i=-log\left ( \frac{e^{W^T_{y_i}x_i+b_{y_i}}}{\sum_{j}e^{W^T_jx_i+b_j}} \right )=-log\left (\frac{e^{||W_{y_i}|| \; ||x_i||cos(\theta _{y_i,i})+b_{y_i}}}{\sum_je^{||W_j|| \; ||x_i||cos(\theta _{j,i})+b_j}}  \right )$$

其中,$\theta _{j,i} \quad (0 \leq \theta _{j,i}\leq \pi )$表示向量$W_j$和$x_i$的夹角。可以看到,损失函数与$||Wj||$,$\theta _{j,i}$和$b_j$有关,令$||Wj||=1$和$b_j=0$,则可以得到modified-softmax损失函数,更加关注角度信息:

$$L_{modified-softmax}=-log\left (\frac{e^{||x_i||cos(\theta _{y_i,i})}}{\sum_je^{||x_i||cos(\theta _{j,i})}}  \right )$$

虽然使用modified-softmax损失函数可以学习到特征具有角度区分性,但这个区分力度仍然不够大。因此,在$\theta _{j,i}$上乘以一个大于1的整数,来提高区分度:

$$L_{ang}=-log\left (\frac{e^{||x_i||cos(m\theta _{y_i,i})}}{e^{||x_i||cos(m\theta _{j,i})}+\sum_{j\neq y_i}e^{||x_i||cos(\theta _{j,i})}}  \right )$$

这样,能扩大类间距离,缩小类内距离。

下图是论文的实验结果,从超球面的角度进行解释,不同的m值的结果。其中,不同颜色的点表示不同的类别。可以看出,使用A-Softmax损失函数,会将学习到的向量特征映射到超球面上,$m=1$表示退化成modified-softmax损失函数,可以看出,每个类别虽然有明显的分布,但区分性不够明显。随着$m$的增大,区分性会越来越大,但也越来越难训练。

 

最后,给出该损失函数的实现方式,请参考这篇博客

Cosface

该Cosface损失函数是由腾讯在2018年《CosFace: Large Margin Cosine Loss for Deep Face Recognition》中提出的。Cosface损失函数,也称Large Margin Cosine Loss(LMCL)。从名字可以看出,通过对cos的间隔最大化,来实现扩大类间距离,缩小类内距离。

从softmax出发(与Sphereface类似),作者发现,为了实现有效的特征学习,$||W_j=1||$是十分有必要的,即对权重进行归一化。同时,在测试阶段,测试用的人脸对的得分通常是根据两个特征向量之间的余弦相似度进行计算的。这表明,$||x||$对得分计算影响不大,因此,在训练阶段将$||x||=s$固定下来(在本论文中,$s=64$):

$$L_{ns}=\frac{1}{N}\sum_{i}-log\frac{e^{s \; cos(\theta _{y_i,i})}}{\sum_{j}e^{s \; cos(\theta _{j,i})}}$$

其中ns表示归一化版本的softmax loss,$\theta _{j,i}$表示$W_j$和$x$之间的角度。为了加大区分性,类似Sphereface一样,引入常数m:

$$L_{lmc}=\frac{1}{N}\sum_{i}-log\frac{e^{s \; (cos(\theta _{y_i,i})-m)}}{e^{s \; (cos(\theta _{y_i,i})-m)}+\sum_{j\neq y_i}e^{s \; cos(\theta _{j,i})}}$$

其中,$W=\frac{W^*}{||W^*||},x=\frac{x^*}{||x^*||},cos(\theta _j,i)=W^T_jx_i$。

下图是作者的解释图。第一个表示正常的sotfmax loss,可以看出两个类别的分类边界具有重叠性,即区分性不强;第二个表示归一化版本的softmax loss,此时边界已经很明显,相互没有重叠,但是区分性不足;第三个表示A-softmax,此时横纵坐标变成了$\theta $,从这个角度解释,使用两条线作为区分边界,作者也提出,该损失函数的缺点是不连续;第四个表示Cosface,在$cos(\theta )$下,使用两条线作为区分边界,特征之间没有交集,随着m值增大,区分性也会越来越明显,但训练难度会加大。

 

 Arcface

该Arcface损失函数提出于《ArcFace: Additive Angular Margin Loss for Deep Face Recognition》。类似于Sphereface和Cosface,Arcface同样需要令$||W||=1,||x||=s$,同时也引入常数m,但与前面两者不同的是,这里的m是对$\theta $进行修改:

$$L_{arcface}=\frac{1}{N}\sum_{i}-log\frac{e^{s \; (cos(\theta _{y_i,i}+m))}}{e^{s \; (cos(\theta _{y_i,i}+m))}+\sum_{j\neq y_i}e^{s \; cos(\theta _{j,i})}}$$

下图是Arcface的计算流程图,首先对$x$与$W$进行标准化,然后进行相乘得到$cos(\theta _{j,i})$,通过$arccos(cos(\theta _{j,i}))$来得到角度$\theta _{j,i}$,加上常数$m$来加大间距得到$\theta _{j,i}+m$,之后计算$cos(\theta _{j,i}+m)$并乘上常数$s$,最后进行常规的softmax loss就行。

 

通过对Sphereface、Cosface和Arcface进行整合,得到了统一的形式:

$$L=\frac{1}{N}\sum_{i}-log\frac{e^{s \; (cos(m_1\theta _{y_i,i}+m_2)-m_3)}}{e^{s \; (cos(m_1\theta _{y_i,i}+m_2)-m_3)}+\sum_{j\neq y_i}e^{s \; cos(\theta _{j,i})}}$$

此时,就可以对该损失函数进行魔改了,作者实验得到,对于部分数据集,$m_1=1,m_2=0.3,m_3=0.2$和$m_1=0.9,m_2=0.4,m_3=0.15$的效果较好。

同时,作者也尝试将Arcface融入Triplet loss中,但效果不太明显。


 

至此,我在工作中常用到的损失函数介绍完成了。后续继续跟进和补充该文章,感谢阅读。 

Guess you like

Origin www.cnblogs.com/dengshunge/p/12252820.html