采用什么策略处理数据不均衡问题?当数据不均衡时,采用什么指标来衡量模型的优劣?
1. 当数据样本过少时,Leave One Out Cross Validation or 10-fold Cross Validation
2. 当数据样本很多时,Assuming you have a large data set
假设样本集中25%正例,75%负例。 运行算法10次,每次都从负例中随机挑选,使得新样本集中正负例 1:1 ( run your algorithm 10 times, where I would select randomly from those not readmitted to make sure the total sample is equal (1:1).)
在每一次运行中 for each of the 10 runs
- case 1:If your algorithm has several competing models. use the validation set to find the best model, and then you test on your test set. divide the sample size into 50/25/25 where you have 50% training, 25% validation and 25% test data.
- case 2: If your algorithm does not have several competing models, then you just have a train and test set (no validation set), in this case divide it into 70/30.
- within each of the cases, case 1 and case 2 you can run 10-fold CV, or leave one out cross validation. But that is only necessary if you have a smaller data set.
当数据不均衡时,采用什么指标来衡量模型的优劣?AUC:Area Under roc Curve,处于ROC curve下方的那部分面积的大小,较大的AUC代表了较好的performance.