处理数据不平衡方法

过采样是补充那些数据量少的样本，使得不同标签的样本量达到均衡。

1.随机过采样

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)

2. SMOTE过采样

from imblearn.over_sampling import SMOTE, ADASYN
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

3. ADASYN过采样

X_resampled, y_resampled = ADASYN().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

这三种方法的差别如下：

随机：对于少数类样本a,随机选择多数类样本b的个数-a的个数个少数类样本a

SMOTE: 对于少数类样本a, 随机选择一个最近邻的样本b, 然后从a与b的连线上随机选取一个点c作为新的少数类样本;

ADASYN: 关注的是在那些基于K近邻分类器被错误分类的原始样本附近生成新的少数类样本

欠采样是将多的样本进行裁剪，已达到样本的均衡。使用过采样，总样本量增多。欠采样则是变少。

欠采样同样有几种方法；

4.随机欠采样

#随机欠采样
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

随机选择目标类的数据子集来平衡数据

5.Prototype generation

#欠采样
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

ClusterCentroids使用K-means来减少样本数量。因此，每个类将用K-means方法的质心而不是原始样本合成。

ClusterCentroids提供了一种有效的方式来表示数据集群的样本数量减少。但是，此方法要求将数据分组到群集中。另外，应该设置质心的数量，使得欠采样的簇代表原始的簇。

6.组合采样

#两种组合采样的方法
#################################
from collections import Counter
 
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

#################################

from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=0)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

SMOTE可以通过在边际异常值和内点之间插入新点来生成噪声样本。这个问题可以通过清除过度采样产生的空间来解决。

用于组合过采样和欠采样方法的两个工具是：SMOTETomek 和SMOTEENN。

SMOTEENN倾向于清除比SMOTETomek更多的噪声样本。

另附数据集生成代码

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
print(sorted(Counter(y).items()))

查看各个标签的样本量

#查看各个标签的样本量
from collections import Counter
Counter(y)
 
#输出样本结果
Counter({2: 4674, 1: 262, 0: 64})