数据不平衡问题小结

1.关于样本不平衡问题:

原文链接:http://blog.csdn.net/heyongluoyao8/article/details/49408131

概念介绍:以二分类为例,A类为90个,B类为10个,预测时全部预测为A,则准确率为90%,但实际上,这是一个无用的分类器。

(1)使用过采样与欠采样

(2)增加惩罚项,对大小样本增加不同的权重,如当网络将小样本预测错误时,增加额外的代价,使网络更关心小样本。--较为复杂

(3)将小类作为异常点----未理解  参考文献:

推荐综述:Learning from class-imbalanced data:Review of methods and applications

推荐论文:

Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks

In classification, how do you handle an unbalanced training set

http://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set

Data Mining for Imbalanced Datasets: An Overview

Learning from Imbalanced Data

Addressing the Curse of Imbalanced Training Sets: One-Sided Selection (PDF) 

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Dat

(4)设超大类中样本的个数是极小类中样本个数的L倍,那么在随机梯度下降(SGD,stochastic gradient descent)算法中,每次遇到一个极小类中样本进行训练时,训练L次。
(5)将大类中样本划分到L个聚类中,然后训练L个分类器,每个分类器使用大类中的一个簇与所有的小类样本进行训练得到。最后对这L个分类器采取少数服从多数对未知类别数据进行分类,如果是连续值(预测),那么采用平均值。
(6)设小类中有N个样本。将大类聚类成N个簇,然后使用每个簇的中心组成大类中的N个样本,加上小类中所有的样本进行训练。
无论你使用前面的何种方法,都对某个或某些类进行了损害。为了不进行损害,那么可以使用全部的训练集采用多种分类方法分别建立分类器而得到多个分类器,采用投票的方式对未知类别的数据进行分类,如果是连续值(预测),那么采用平均值。

Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks

Fully convolutional deep neural networks carry out excellent potential for fast and accurate image segmentation. One of the main challenges in training these networks is data imbalance, which is particularly problematic in medical imaging applications such as lesion segmentation where the number of lesion voxels is often much lower than the number of non-lesion voxels. Training with unbalanced data can lead to predictions that are severely biased towards high precision but low recall (sensitivity), which is undesired especially in medical applications where false negatives are much less tolerable than false positives. Several methods have been proposed to deal with this problem including balanced sampling, two step training, sample re-weighting, and similarity loss functions. In this paper, we propose a generalized loss function based on the Tversky index to address the issue of data imbalance and achieve much better trade-off between precision and recall in training 3D fully convolutional deep neural networks. Experimental results in multiple sclerosis lesion segmentation on magnetic resonance images show improved F2F2 score, Dice coefficient, and the area under the precision-recall curve in test data. Based on these results we suggest Tversky loss function as a generalized framework to effectively train deep neural networks.

猜你喜欢

转载自blog.csdn.net/wu_x_j_/article/details/85301799