样本不均衡问题

医疗数据经常highly biased (比如很少一部分人得心脏病,大部分人不得心脏病) 。即样本在不同类别上的不均衡分布问题( class distribution imbalance problem)

采用什么策略处理数据不均衡问题?当数据不均衡时,采用什么指标来衡量模型的优劣?


1. 当数据样本过少时,Leave One Out Cross Validation or 10-fold Cross Validation

2. 当数据样本很多时,Assuming you have a large data set
假设样本集中25%正例,75%负例。 运行算法10次,每次都从负例中随机挑选,使得新样本集中正负例 1:1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
在每一次运行中 for each of the 10 runs


  • case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
  • case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.
  • within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.


  • average across the  results of 10 runs.




  • 当数据不均衡时,采用什么指标来衡量模型的优劣?AUC:Area Under roc Curve,处于ROC curve下方的那部分面积的大小,较大的AUC代表了较好的performance.

    猜你喜欢

    转载自fenglei.iteye.com/blog/2201853