Machine learning - sample imbalance learning

1. Definition of sample imbalance

Generally, in classification machine learning, the samples of each category are balanced, that is, the total number of samples with different target values ​​is close, but in many scenarios, the samples cannot achieve the ideal situation, and even some situations are themselves unbalanced Situation:
(1) In many scenarios, the data set itself is uneven, and some categories of data are more than other data;
(2) In fixed scenarios, such as risk control scenarios, the proportion of negative samples is much smaller than that of positive samples;
(3) During the gradient descent process, when the sample size of different categories is relatively large, it is difficult for the model itself to converge to the optimal solution.

2. Solutions

In different scenarios, the solutions to unbalanced samples have different focuses. The following is an example of financial risk control:
(1) Drop-down method: put rejected users in as negative samples. The disadvantages are also obvious, such as high risk and high cost;
(2) Cost-sensitive: weighting a small number of samples to allow the model to perform balanced training;
(3) Sampling method: under-sampling through multiple positive samples, or negative samples Balance samples by oversampling;
(4) Semi-supervised learning

2.1 Cost Sensitive

By changing the weight of a few samples, the model can be trained in a certain balance. However, the cost-sensitive weighting increases the contribution of negative samples in the model, but it does not add additional information to the model itself, so there is no way to solve the problem of selection bias, and there is no way to bring negative effects.
In logistic regression, the weight of positive and negative samples can be adjusted by the parameter class_weight='balanced'. Let's take the logistic regression scorecard as an example, adjust the parameter of class_weight of logistic regression, and see the results. Link to this example: Logistic regression scorecard

# 导入模块
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score,roc_curve,auc

data = pd.read_csv('Bcard.txt')
feature_lst = ['person_info','finance_info','credit_info','act_info']
# 划分数据
train = data[data.obs_mth != '2018-11-30'].reset_index().copy()
val = data[data.obs_mth == '2018-11-30'].reset_index().copy()
x = train[feature_lst]
y = train['bad_ind']
val_x = val[feature_lst]
val_y = val['bad_ind']

# 查看正负样本的数量
print('训练集:\n',y.value_counts())
print('跨时间验证集:\n',val_y.value_counts())

# 训练模型
lr_model = LogisticRegression(C=0.1)
lr_model.fit(x,y)

# 训练集
print('参数调整前的ks值')
y_pred = lr_model.predict_proba(x)[:,1] #取出训练集预测值
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred) #计算TPR和FPR
train_ks = abs(fpr_lr_train - tpr_lr_train).max() #计算训练集KS
print('train_ks : ',train_ks)

#验证集
y_pred = lr_model.predict_proba(val_x)[:,1] #计算验证集预测值
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) #计算验证集预测值
val_ks = abs(fpr_lr - tpr_lr).max() #计算验证集KS值
print('val_ks : ',val_ks)


# 调整逻辑回归中的class_weight参数
print('参数调整后的ks值')
lr_model = LogisticRegression(C=0.1,class_weight = 'balanced')
lr_model.fit(x,y)
y_pred = lr_model.predict_proba(x)[:,1] #取出训练集预测值
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred) #计算TPR和FPR
train_ks = abs(fpr_lr_train - tpr_lr_train).max() #计算训练集KS
print('train_ks : ',train_ks)
y_pred = lr_model.predict_proba(val_x)[:,1] #计算验证集预测值
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) #计算验证集预测值
val_ks = abs(fpr_lr - tpr_lr).max() #计算验证集KS值
print('val_ks : ',val_ks)

insert image description here
It can be seen from the above that the ks value of the training set verification set after adjusting the parameters has been improved to a certain extent.

2.2 Oversampling

Cost-sensitive is useful, but the effect is not necessarily. Better results can be achieved by oversampling negative samples, that is, introducing more negative samples to the model. The general oversampling method is as follows:

  • Random oversampling: directly copying negative samples, the generalization ability of the model is poor.
  • SMOTE algorithm: Minority category oversampling technique (Synthetic Minority Oversampling Technique)

SMOTE oversampling

SMOTE is an oversampling technique by synthesizing a few samples. It analyzes a few samples, then interpolates between the existing few samples, artificially synthesizes new samples, and merges the samples into the model for training. The basic steps are as follows:
(1) Use the knn algorithm to calculate the k nearest neighbors of each minority sample;
(2) Randomly select N samples from the k nearest neighbors for random linear interpolation;
(3) Construct a new minority sample;
( 4) Merge the new samples with the original data to generate a new training set.
insert image description here

SMOTETomek Comprehensive Sampling

Use oversampling first, and then use the Tomek Link method to delete the points that are in a stalemate state after enlarging the sample. Sometimes even the Tomek Link is not used, and all the pairs that are close to each other are directly deleted, because after oversampling, 0 The sample size of and 1 has reached 1:1.
insert image description here

Sample Code for Random Oversampling, SMOTE Sampling, and Synthetic Sampling

Taking the scorecard data above as an example, the code for reading the data is omitted.
import module

from imblearn.over_sampling import RandomOverSampler,SMOTE
from imblearn.combine import SMOTETomek

Sampling using three oversampling methods

# 随机过采样
ros = RandomOverSampler(random_state=0, sampling_strategy='auto')
x_ros, y_ros = ros.fit_resample(x,y)
print('随机过采样后分类情况:',y_ros.value_counts())

# SMOTE过采样
sot = SMOTE(random_state=0)
x_sot,y_sot = sot.fit_resample(x,y)
print('smote过采样后情况:',y_sot.value_counts())

# SMOTETomek综合采样
sttk = SMOTETomek(random_state=0)
x_sttk,y_sttk = sttk.fit_resample(x,y)
print('综合采样后情况:',y_sttk.value_counts())

insert image description here
Compare the ks value, recall value and auc value after three kinds of sampling

data = [['原始数据',x,y],['随机过采样',x_ros,y_ros],['SMOTE过采样',x_sot,y_sot],['综合过采样',x_sttk,y_sttk]]

# 使用逻辑回归分别对过采样后的数据进行训练,看看情况
lr = LogisticRegression(C=0.1)
for text,train_X, train_y in data:
    lr.fit(train_X, train_y)
    y_pred = lr.predict_proba(val_x)[:,1] #计算验证集预测值
    pred_y = lr.predict(val_x)
    rc_score = recall_score(val_y,pred_y)
    fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) #计算验证集预测值
    val_ks = abs(fpr_lr - tpr_lr).max() #计算验证集KS值
    auc_score = auc(fpr_lr,tpr_lr)
    print(text,'ks',val_ks,'auc',auc_score,'recall',rc_score)

insert image description here
It can be seen from the above that the three methods of oversampling can improve the score of the model, but it is not necessary to use which oversampling method, and different data will have different situations.

Machine Learning and Oversampling

We can also use machine learning to fit the data in the training set first, exclude samples with poor prediction results, and not participate in oversampling. The specific code is as follows: Here
we use the lightGBM algorithm, and the data still uses the above data

 # 使用lightGBM进行数据拟合,去掉预测结果较差的数据,再进行smote过采样
import lightgbm as lgb
import numpy as np
lgb_clf = lgb.LGBMClassifier(learning_rate=0.05,n_estimators=100)
lgb_clf.fit(x, y, eval_set=[(x, y), (val_x, val_y)], eval_metric='auc')
temp = x.copy()
temp['bad_ind'] = y
temp['pred'] = lgb_clf.predict_proba(x)[:,1]
temp=temp.sort_values(by=['pred'], ascending=False).reset_index()
temp['rank'] = np.array(temp.index)/len(temp)
temp

insert image description here
We define a weight function to exclude 20% of the data before and after the prediction is not accurate

def weight(x,y):
    if x==0 and y<0.2:
        return 0.1
    elif x==1 and y>0.8:
        return 0.1
    else:
        return 1
    
temp['weight'] = temp.apply(lambda x:weight(x.bad_ind,x['rank']), axis=1)
smote_sample = temp[temp.weight==1]
print(smote_sample.shape)
train_X_smote = smote_sample[feature_lst]
train_y_smote = smote_sample['bad_ind']

insert image description here
The same three samplings are performed, compared with the unsampled data

# 随机过采样
ros = RandomOverSampler(random_state=0, sampling_strategy='auto')
x_ros, y_ros = ros.fit_resample(train_X_smote,train_y_smote)
print('随机过采样后分类情况:',y_ros.value_counts())

# SMOTE过采样
sot = SMOTE(random_state=0)
x_sot,y_sot = sot.fit_resample(train_X_smote,train_y_smote)
print('smote过采样后情况:',y_sot.value_counts())

# SMOTETomek综合采样
sttk = SMOTETomek(random_state=0)
x_sttk,y_sttk = sttk.fit_resample(train_X_smote,train_y_smote)
print('综合采样后情况:',y_sttk.value_counts())

data = [['原始数据',x,y],['随机过采样',x_ros,y_ros],['SMOTE过采样',x_sot,y_sot],['综合过采样',x_sttk,y_sttk]]

# 使用逻辑回归分别对过采样后的数据进行训练,看看情况
lr = LogisticRegression(C=0.1)
for text,train_X, train_y in data:
    lr.fit(train_X, train_y)
    y_pred = lr.predict_proba(val_x)[:,1] #计算验证集预测值
    pred_y = lr.predict(val_x)
    rc_score = recall_score(val_y,pred_y)
    fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) #计算验证集预测值
    val_ks = abs(fpr_lr - tpr_lr).max() #计算验证集KS值
    auc_score = auc(fpr_lr,tpr_lr)
    print(text,'ks',val_ks,'auc',auc_score,'recall',rc_score)

insert image description here
It can be seen that the samples with poor prediction results are excluded, and then oversampled, the obtained model will be better than .

Guess you like

Origin blog.csdn.net/gjinc/article/details/131963606