【阅读笔记】Training Deep Neural Networks on Imbalanced Data Sets

发布时间：2016
这篇文章主要介绍数据不平衡时，通过调整loss来使得希望的评价指标变好得方法。
但感觉MFE和把重采样很类似，相当于给样本加权重

Abstract

current studies on deep learning mainly focus on data sets with balanced class labels, while its performance on imbalanced data is not well examined.
Imbalanced data sets exist widely in real world and they have been providing great challenges for classification tasks.
A novel loss function called mean false error together with its improved version mean squared false error are proposed for the training of deep networks on imbalanced data sets.
The proposed method can effectively capture classification errors from both majority class and minority class equally.

I. INTRODUCTION

Classification can be categories as binary classification and multi-class classification.
This paper mainly focuses on the binary-classification problem and the experimental data sets are binary-class ones (A multi-class problem can generally be transformed into a
binary-class one by binarization).

AN EXAMPLE OF CONFUSION MATRIX

True Class\Predicted Class	P	N
P’	86	4	90
N’	5	5	10
	91	9

mean squared error (MSE): $loss_{MSE}=\frac{4+5}{90+10}=0.09$
mean false error(MFE) : $loss_{MFE}=\frac{4}{90}+\frac{5}{10}=0.54$
mean squared false error(MSFE): $loss_{MSFE}=(\frac{4}{90})^2+(\frac{5}{10})^2=0.25$

The contributions of this paper are summarized as:

Two novel loss functions are proposed to solve the data imbalance problem in deep network.
The advantages of these proposed loss functions over the commonly used MSE are analyzed theoretically. 理论上优于MSE。
The effect of these proposed loss functions on the backpropagation process of deep learning is analyzed by examining relations for propagated gradients. 分析了反向传播的过程
Empirical study on real world data sets is conducted to validate the effectiveness of our proposed loss functions. 在真实数据上做了验证

Until now, this issue is solved mainly in three ways, sampling techniques, cost sensitive（成本敏感） methods and the hybrid methods combining these two.

A.Sampling techinique

Random oversampling randomly duplicates（重复） a certain number of samples from the minority class and then augment them into the original data set. 缺点：may lead to overfitting.
Under-sampling randomly remove a certain number of instances from the majority class to achieve a balanced data set. 缺点：lose some important information.
The synthetic minority oversampling technique (SMOTE) creates artificial data based on the similarities between existing minority samples.Although many promising benefits have been shown by SMOTE, some drawbacks still exist like over generalization and variance.

B.Cost sensitive learning

Cost sensitive learning assigns different cost values for the misclassification of the samples.
Though cost sensitive algorithms can significantly improve the classification performance, they can be only applicable when the specific cost values of misclassification are known.（必须有先验知识，知道少数的识别错误会带来更大的损失，才能对loss进行加权）
In addition, it would be quite challenging and even impossible to determine the cost of misclassification in some particular domains.

C.Imbalance problem in neural network

Quite few literatures related to the imbalance problem of deep network can be seen so far.

III. PROBLEM FORMULATION

A.MSE loss:

l = \frac{1}{M} \sum_{i} \sum_{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2}

$l=\frac{1}{M}\sum_i\sum_n\frac{1}{2}(d^{(i)}_n-y^{(i)}_n)^2$

back-propagation: $\frac{\partial l}{\partial o^{(i)}_n}=-(d^{(i)}_n-y^{(i)}_n)\frac{\partial y^{(i)}_n}{\partial o^{(i)}_n}$

B.MFE loss:

F P E = \frac{1}{N} \sum_{i}^{N} \sum_{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2}

$FPE=\frac{1}{N}\sum_i^N\sum_n\frac{1}{2}(d^{(i)}_n-y^{(i)}_n)^2$

F N E = \frac{1}{P} \sum_{i}^{P} \sum_{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2}

$FNE=\frac{1}{P}\sum_i^P\sum_n\frac{1}{2}(d^{(i)}_n-y^{(i)}_n)^2$

l = F P E + F N E

$l=FPE+FNE$

where FPE and FNE are mean false positive error and mean false negative error respectively and they capture the error on the negative class and positive class correspondingly.

back-propagation:

\frac{\partial l}{\partial o_{n}^{(i)}} = - \frac{1}{N} (d_{n}^{(i)} - y_{n}^{(i)}) \frac{\partial y_{n}^{(i)}}{\partial o_{n}^{(i)}} i f i \in N

$\frac{\partial l}{\partial o^{(i)}_n}=-\frac{1}{N}(d^{(i)}_n-y^{(i)}_n)\frac{\partial y^{(i)}_n}{\partial o^{(i)}_n}~if~i\in N$

\frac{\partial l}{\partial o_{n}^{(i)}} = - \frac{1}{P} (d_{n}^{(i)} - y_{n}^{(i)}) \frac{\partial y_{n}^{(i)}}{\partial o_{n}^{(i)}} i f i \in P

$\frac{\partial l}{\partial o^{(i)}_n}=-\frac{1}{P}(d^{(i)}_n-y^{(i)}_n)\frac{\partial y^{(i)}_n}{\partial o^{(i)}_n}~if~i\in P$

C.MSFE loss:

l = F P E^{2} + F N E^{2} = \frac{1}{2} [(F P E + F N E)^{2} - (F P E - F N E)^{2}]

$l=FPE^2+FNE^2=\frac{1}{2}[(FPE+FNE)^2-(FPE-FNE)^2]$
减小loss相当于同时减小FPE和FNE的同时，减小他们的差距

back-propagation:

\frac{\partial l}{\partial o_{n}^{(i)}} = - \frac{2 F P E}{N} (d_{n}^{(i)} - y_{n}^{(i)}) \frac{\partial y_{n}^{(i)}}{\partial o_{n}^{(i)}} i f i \in N

$\frac{\partial l}{\partial o^{(i)}_n}=-\frac{2FPE}{N}(d^{(i)}_n-y^{(i)}_n)\frac{\partial y^{(i)}_n}{\partial o^{(i)}_n}~if~i\in N$

\frac{\partial l}{\partial o_{n}^{(i)}} = - \frac{2 F N E}{P} (d_{n}^{(i)} - y_{n}^{(i)}) \frac{\partial y_{n}^{(i)}}{\partial o_{n}^{(i)}} i f i \in P

$\frac{\partial l}{\partial o^{(i)}_n}=-\frac{2FNE}{P}(d^{(i)}_n-y^{(i)}_n)\frac{\partial y^{(i)}_n}{\partial o^{(i)}_n}~if~i\in P$

V. EXPERIMENTS AND RESULTS

看AUC的话用MSE做loss差很多，MFE和MSFE差不多，都比MSE好不少。

VI. CONCLUSIONS

In future work, we will explore the effectiveness of our proposed loss functions on different network structures like DBN and CNN.