不均衡样本处理

不均衡样本采样方法有两种：过采样和下采样

下采样

先获取数据为异常的个数，再在正常的数据中随机选择异常个数的数据，这样异常数据的个数就和正常数据的个数相同了，最后将选出来的正常样本和异常样本合起来

过采样

过采样就是通过样本生成策略使得，样本少的一方进行扩展，同样使样本变得平衡

采样经验法则

1. 考虑对大类下的样本（超过1万、十万甚至更多）进行欠采样，即删除部分样本；

2. 考虑对小类下的样本（不足1为甚至更少）进行过采样，即添加部分样本的副本；

3. 考虑尝试随机采样与非随机采样两种采样方法；

4. 考虑对各类别尝试不同的采样比例，不一定是1:1，有时候1:1反而不好，因为与现实情况相差甚远；

5. 考虑同时使用过采样与欠采样

下采样方法

1、手动下采样

#利用样本较少的1为基准，随机挑选同等数量的0，组成平衡样本
def downsample(data_train):
    number_records_buy = len(data_train[data_train.target == 1])
    buy_indices = np.array(data_train[data_train.target == 1 ].index)

    # Picking the indices of the normal classes
    nobuy_indices = data_train[data_train.target == 0].index

    # Out of the indices we picked, randomly select "x" number (number_records_fraud)
    random_nobuy_indices = np.random.choice(nobuy_indices, number_records_buy, replace = False)
    random_nobuy_indices = np.array(random_nobuy_indices)

    # Appending the 2 indices
    under_sample_indices = np.concatenate([buy_indices,random_nobuy_indices])

    # Under sample dataset
    x_under_train = data_train.iloc[under_sample_indices,:]
    print(x_under_train.shape)
    return x_under_train

2、利用imblearn库采样

imblearn库安装

pip:        pip install -U imbalanced-learn
anaconda:   conda install -c glemaitre imbalanced-learn

下采样到0，1均为5000
from imblearn.under_sampling import RandomUnderSampler
under = RandomUnderSampler(ratio={0: 5000,1: 5000}, random_state=0)
X_under, y_under = under.fit_sample(X_t, y_t)
print(Counter(y_under))
X_under1,X_under2,y_under1,y_under2=train_test_split(X_under,y_under,random_state=0,test_size=0.2)
print(X_under1.shape)

上采样只能利用imblearn库的smote采样方法

from imblearn.over_sampling import SMOTE
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(train_data,train_lable)

# 可通过radio参数指定对应类别要生成的数据的数量
smo = SMOTE(ratio={1: 300 },random_state=42)
# 使生成后1的个数为300，即生成0和1比例为3比1的数据样本
X_smo, y_smo = smo.fit_sample(X, y)
print(Counter(y_smo))

smote生成样本策略

(1)对于少数样本中的每一个样本x，以欧式距离为标准计算它到少数类样本集中所有样本的距离，得到k近邻。

(2)根据样本不平衡比例设置一个采样比例，即确定你采样倍率，对于每一个少数类样本x，从其k近邻中随机选择若干个样本，假设选择的近邻为xn

(3)对于每一个随机选出的近邻xn，分别与原样本按照如下的公式构建新的样本。

$x_{new} = x +rand(0,1)*(\theta -x)$

$\theta$ 表示x的平均值

下采样

过采样

采样经验法则

下采样方法

1、手动下采样

2、利用imblearn库采样

smote生成样本策略

猜你喜欢