Summary of BAT machine learning feature engineering work experience (1) How to solve the problem of data imbalance (with python code)

Many people are actually very curious about the usual work content of machine learning algorithm engineers in BAT? In fact, most people are running data, various map-reduce, hive SQL, data warehouse moving bricks, data cleaning, data cleaning, data cleaning, business analysis, case analysis, feature finding, feature finding... and complex models are all Very few data scientists are doing it. For example, in Ali, algorithm engineers need to mine business scenarios and find efficient features based on the business. A feature iteration can be completed within 2 weeks, and a small optimization of the model can be completed in about a month to improve AUC. Therefore, features are very important. Tencent and Ali's models are so good, mostly due to feature engineering.

Feature iteration means adding new, combined, and revised features to the model and optimizing the results. This is a feature iteration.

The process of feature engineering in work is generally as follows:

1. Data acquisition
2. Data cleaning
3. Data sampling
4. Feature processing
5. Feature selection

The core of this summary is mainly on feature processing and feature selection, especially feature processing. In addition to data processing of different types of features, there is also how to find and build combined features.


Data collection
It is necessary to clarify which data to collect before data collection. The general idea is: which data is helpful for the final result prediction? Can we collect data? Is it fast to obtain it during online real-time calculation?

For example, if I want to recommend products to users, what information do I need to collect?
-Stores: store ratings, store categories...
-Products: product ratings, number of buyers, color, material, collar shape...
-Users: historical information (lowest and highest prices of purchased products), spending power, product stay time... …


Data cleaning
Data cleaning is also an important step in the work. Data cleaning is to remove dirty data.

So how to judge dirty data?

  1. Simple attribute judgment: a person whose height is 3 meters +; a person who buys a 10w card for a month.
  2. Combination or statistical attribute judgment: claiming to be in the United States but the ip has always been a mainland news reader? You want to determine whether a person will buy basketball shoes, and 85% of the samples are female users?
  3. Fill in the corresponding default value: untrustworthy samples are discarded, and fields with a lot of default values ​​are not considered.

Data sampling
After collecting and cleaning the data, the positive and negative samples are unbalanced, so data sampling is required.

Since the proportion of samples in the population is too low, it is natural to resample from the population to increase the proportion of positive samples in the modeling sample.

The so-called positive samples and negative samples, for the face recognition application in a certain environment, such as the face recognition of middle school students in the classroom, the walls, windows, bodies, clothes, etc. of the classroom belong to The category of negative samples.

Oversampling and undersampling are relatively common methods, the former is to increase the number of positive samples, and the latter is to reduce the number of negative samples. If the absolute number of positive samples in the population is too small, all positive samples can be included, and some negative samples can be extracted to construct modeling samples. This idea is actually a combination of oversampling and undersampling.

If the number of positive samples is large, undersampling can be used, also called downsampling.
If the number of positive samples is not large, more methods can be used, such as
oversampling, oversampling and direct copying, and the smote algorithm, which generates more positive samples. .

#用index的方法去实现下采样和过采样
import numpy as np
import pandas as pd
 
 #下采样/欠采样
def lower_sample_data(df, percent=1):
    '''
    percent:正样本与负样本的比例,比如0.6,正样本是6,
    负样本是10
    '''
    data1 = df[df['Label'] == 1]  # 将少数正样本
    放在data1
    data0 = df[df['Label'] == 0]  # 将多数负样本
    放在data0
    index = np.random.randint(len(data0), 
    size=percent * len(data0))  
    # 随机给定下采样取出样本的序号
    lower_data0 = data0.iloc[list(index)]  
    # 下采样
    return(pd.concat([lower_data0, data1]))

#过采样
def over_sample_data(df, percent=1):
    '''
    percent:正样本与负样本的比例,比如0.6,正样本是6,
    负样本是10
    '''
    data1 = df[df['Label'] == 1]  # 将少数正样本
    放在data1
    data0 = df[df['Label'] == 0]  # 将多数负样本
    放在data0
    lack_percent = percent - (len(data1)/len(data0))
    index = np.random.randint(len(data1),
     size=lack_percent * len(data1))  
     # 随机给定下采样取出样本的序号
    over_data1 = data1.iloc[list(index)]  
    # 过采样
    return(pd.concat([over_data1, data1,data0]))

--------------------- 
#用imblearn库去实现过采样,下采样和smote

#过采样
from sklearn.datasets import make_classification
from collections import Counter

X, y = make_classification(n_samples=5000, 
n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, 
n_classes=3,n_clusters_per_class=1,
weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)

Counter(y)
Out[10]: Counter({0: 64, 1: 262, 2: 4674})
 
from imblearn.over_sampling import RandomOverSampler
 
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_sample(X, y)
 
sorted(Counter(y_resampled).items())
Out[13]:
[(0, 4674), (1, 4674), (2, 4674)]

#smote
from imblearn.over_sampling import SMOTE
 
X_resampled_smote, y_resampled_smote = 
SMOTE().fit_sample(X, y)
 
sorted(Counter(y_resampled_smote).items())
Out[29]:
[(0, 4674), (1, 4674), (2, 4674)]

#下采样
和上面一样,用 imblearn.under_sampling的库

Take an application with unbalanced data.
The data set shared this time comes from the historical transaction data of customers in a German telecommunications industry. The data set contains a total of 4,681 records and 19 variables. The dependent variable churn is a binary variable, and yes means Customer churn, no means no customer churn; the remaining independent variables include whether the customer subscribes to an international long-distance package, voice package, number of text messages, phone charges, number of calls, etc. Next, use this data set to explore the effect of unbalanced data after it is transformed into balance.

# 导入第三方包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import tree
from sklearn import metrics
from imblearn.over_sampling import SMOTE

# 读取数据
churn = pd.read_excel(r'C:\Users\Administrator\Desktop\Customer_Churn.xlsx') 

# 中文乱码的处理
plt.rcParams['font.sans-serif']=['Microsoft YaHei']

# 为确保绘制的饼图为圆形,需执行如下代码
plt.axes(aspect = 'equal')
# 统计交易是否为欺诈的频数
counts = churn.churn.value_counts()

# 绘制饼图
plt.pie(x = counts,labels=pd.Series(counts.index).map({'yes':'流失','no':'未流失'}), autopct='%.2f%%')
# 显示图形
plt.show()

Through visualization, it is found that lost users account for only 8.3%, which is quite different from that of non-churned users. It can be considered that the two types of customers are unbalanced. If such data is directly modeled, the results of the model may not be accurate enough. You may wish to build a random forest model on the data first to see if there is any bias.

The state variable and Area_code variable in the original data table indicate the "state" and area code to which the user belongs, which may not be an important reason for the loss of users intuitively, so these two variables are deleted from the table. In addition, whether the user subscribes to the international_plan of the international long-distance service and the voice service voice_mail_plan is a character-type binary value, which cannot be directly substituted into the model, so it needs to be converted into a 0-1 binary value.

# 数据清洗
# 删除state变量和area_code变量
churn.drop(labels=['state','area_code'], axis = 1, inplace = True)

# 将二元变量international_plan和voice_mail_plan转换为0-1哑变量
churn.international_plan = churn.international_plan.map({'no':0,'yes':1}) 
churn.voice_mail_plan = churn.voice_mail_plan.map({'no':0,'yes':1}) 

After cleaning the clean data, the data set is then split to construct a training data set and a test data set respectively, and a classifier is constructed using the training data set, and the test data set is used to test the classifier.

# 用于建模的所有自变量
predictors = churn.columns[:-1]
# 数据拆分为训练集和测试集
X_train,X_test,y_train,y_test = 
model_selection.train_test_split(churn[predictors], churn.churn, random_state=12)

# 构建决策树
dt = tree.DecisionTreeClassifier(n_estimators = 300) 
dt.fit(X_train,y_train)
# 模型在测试集上的预测
pred = dt.predict(X_test)

# 模型的预测准确率
print(metrics.accuracy_score(y_test, pred))
# 模型评估报告
print(metrics.classification_report(y_test, pred))

As shown in the above results, the prediction accuracy of the decision tree is more than 93%, and the coverage recall of the prediction no is 97%, but the coverage recall of the prediction yes is 62%. The difference between the two is far, indicating that the classifier is indeed It is biased toward the category with a large sample size (no).

# 绘制ROC曲线
# 计算流失用户的概率值,用于生成ROC曲线的数据
y_score = 
dt.predict_proba(X_test)[:,1] 
fpr,tpr,threshold = metrics.roc_curve(y_test.map({'no':0,'yes':1}), y_score)

# 计算AUC的值
roc_auc = metrics.auc(fpr,tpr)

# 绘制面积图
plt.stackplot(fpr, tpr, color='steelblue', alpha = 0.5, edgecolor = 'black')
# 添加边际线
plt.plot(fpr, tpr, color='black', lw = 1)
# 添加对角线
plt.plot([0,1],[0,1], color = 'red', linestyle = '--')
# 添加文本信息
plt.text(0.5,0.3,'ROC curve (area = %0.3f)' % roc_auc)
# 添加x轴与y轴标签
plt.xlabel('1-Specificity') plt.ylabel('Sensitivity')
# 显示图形
plt.show()

As shown in the figure above, the area under the ROC curve is 0.795, and the value of AUC is less than 0.8, so the model is considered unreasonable. ( Usually compare AUC with 0.8, if it is greater than 0.8, the model is considered reasonable ). Next, use the SMOTE algorithm to process the data.

# 对训练数据集作平衡处理
over_samples = SMOTE(random_state=1234) 
over_samples_X,over_samples_y = 
over_samples.fit_sample(X_train, y_train)

# 重抽样前的类别比例
print(y_train.value_counts()/len(y_train))
# 重抽样后的类别比例
print(pd.Series(over_samples_y).value_counts()/
len(over_samples_y))

As shown in the above results, for the training data set itself, there is still a large difference in the proportion of its categories, but after the SMOTE algorithm is processed, the two categories can reach a 1:1 balance. Next, we can use this balanced data to rebuild the decision tree classifier.

# 基于平衡数据重新构建决策树模型
dt2 = ensemble.DecisionTreeClassifier(n_estimators = 300) 
dt2.fit(over_samples_X,over_samples_y)

# 模型在测试集上的预测
pred2 =dt2.predict(np.array(X_test))

# 模型的预测准确率
print(metrics.accuracy_score(y_test, pred2))
# 模型评估报告
print(metrics.classification_report(y_test, pred2))

As shown in the above results, after remodeling with balanced data, the accuracy of the model is also very high, which is 92.6% ( compared to the model built with the original unbalanced data, the accuracy rate is only decreased by 1% ), but the coverage of the prediction is yes The rate increased by 10% to 72% , and that's where balance comes in .

# 计算流失用户的概率值,用于生成ROC曲线的数据
y_score = rf2.predict_proba(np.array(X_test))[:,1] 

fpr,tpr,threshold = metrics.roc_curve(y_test.map({'no':0,'yes':1}), y_score)

# 计算AUC的值
roc_auc = metrics.auc(fpr,tpr)

# 绘制面积图
plt.stackplot(fpr, tpr, color='steelblue', alpha = 0.5, edgecolor = 'black')
# 添加边际线
plt.plot(fpr, tpr, color='black', lw = 1)
# 添加对角线
plt.plot([0,1],[0,1], color = 'red', linestyle = '--')
# 添加文本信息
plt.text(0.5,0.3,'ROC curve (area = %0.3f)' % roc_auc)
# 添加x轴与y轴标签
plt.xlabel('1-Specificity') plt.ylabel('Sensitivity')

# 显示图形 电动叉车
plt.show()

Guess you like

Origin blog.csdn.net/weixin_42736194/article/details/83045400