Dry goods|Only 5 steps, teach you to identify customers' purchase intention (with code data)

Do you often encounter such a scenario at work: The business department hopes to increase product sales through marketing activities, but the budget is limited. Within the budget allowed, how to increase the conversion rate more is a problem that every person engaged in data analysis and data mining needs to face.

This article will take the relevant data of bank marketing activities as an example to teach you how to identify whether customers are willing to buy the bank's products, and conduct precision marketing for high-will customers to increase conversion rates. Not much nonsense, let’s introduce our solution in detail.

The original link is as follows:

Dry goods|Only 5 steps, teach you to identify customers' purchase intention (with code data)

the data shows

Insert picture description here

The data contains basic customer information and activity information. In the actual scenario, if there is customer preference information, historical information about participating activities, etc., you can also add it.

Data preprocessing

1. Data View

We can see that there are 25317 rows of data, and there are no empty data. The details are as follows:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
train=pd.read_csv('train_set.csv')
test=pd.read_csv('test_set.csv')
train.info()

Insert picture description here

2. Data preprocessing

Observing the source data, you can find that the classification field has a category of'unknown'. At this time, the category is also regarded as a missing value. Take a closer look

# 对object型数据查看unique
str_features = []
num_features=[]
for col in train.columns:
    if train[col].dtype=='object':
        str_features.append(col)
        print(col,':  ',train[col].unique())
    if train[col].dtype=='int64' and col not in ['ID','y']:
        num_features.append(col)
train.isin(['unknown']).mean()*100

Generally, the most commonly used methods for processing missing values ​​are nothing more than deletion, replacement and imputation.

  • The deletion method refers to deleting the observation row where the missing value is located (provided that the proportion of missing rows is very low, such as within 5%), or deleting the variable corresponding to the missing value (provided that the variable contains a very high proportion of missing values, Such as about 70%)
  • The substitution method refers to directly replacing the missing value in the variable with the mean, median or mode of the missing variable. Its advantage is that the processing speed of the missing value is fast, and the disadvantage is that it is easy to produce biased estimates, leading to the accuracy of missing value replacement decline
  • The imputation rule is to use supervised machine learning methods (such as regression models, tree models, network models, etc.) to predict missing values. Its advantage is that the prediction accuracy is high, but the disadvantage is that it requires a lot of calculations, which leads to the processing of missing values. Speed ​​greatly reduced

It is observed here that the'unknow' categories of contact and poutcome reach 28.76% and 81.67%, respectively. After displaying the data, further processing is considered. The unknown of job and education accounts for a relatively small proportion, so consider not to process the unknow of these two characteristics.

data analysis

Below we analyze the source data. The data fields are divided into discrete variables and continuous variables. Below we will analyze them one by one.

1. Discrete variables

plt.figure(figsize=(15,15))
i=1
for col in str_features:
    plt.subplot(3,3,i)
    # 这里用mean是因为标签是0,1二分类,0*0的行数(即没购买的人数)+1*1的行数(购买的人数)/所有行数=购买率
    train.groupby([col])['y'].mean().plot(kind='bar',stacked=True,rot=90,title='Purchase rate of {}'.format(col))
    plt.subplots_adjust(wspace=0.2,hspace=0.7)  # 调整子图间距
    i=i+1
plt.show()

Insert picture description here

Through the visual view, we can make a preliminary observation of each feature situation, so as to facilitate the analysis of whether these features will affect the purchase rate.

2. Continuous variables

1. age

plt.figure()
sns.boxenplot(x='y', y=u'age', data=train)
plt.show()

Insert picture description here

train[train['y']==0]['age'].plot(kind='kde',label='0')
train[train['y']==1]['age'].plot(kind='kde',label='1')
plt.legend()
plt.show()

Insert picture description here

From the above figure, we can see that there is little difference in the purchase age distribution of the two types of customers;

2. Balance The average balance of the account every year

train['balance'].plot(kind='hist')
plt.show()

Insert picture description here

3. Duration of the last communication time

plt.figure()
sns.boxplot(y=u'duration', data=train)
plt.show()

Insert picture description here
4. The number of times the campaign has communicated with the customer in this event

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'campaign', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'campaign', data=train)
plt.show()

Insert picture description here

5. Pdays How long has passed since the last time the customer was contacted in the last activity (999 means no contact)

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'pdays', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'pdays', data=train)
plt.show()

Insert picture description here

6. The number of previous communications with the customer before this event

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'previous', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'previous', data=train)
plt.show()

Insert picture description here

Feature engineering

Through the above-mentioned data analysis of each feature, we have a general understanding of the data. Next, we will perform feature engineering processing from the perspective of data balance and data standardization.

1. Check whether the data set is balanced from the training set

plt.rc('font', family='SimHei', size=13)
fig = plt.figure()
plt.pie(train['y'].value_counts(),labels=train['y'].value_counts().index,autopct='%1.2f%%',counterclock = False)
plt.title('购买率')
plt.show()

Insert picture description here

We can see that it is 9:1 and the data set is an unbalanced data set

2. Continuous variables, i.e. numerical data, are standardized

def outlier_processing(dfx):
    df = dfx.copy()
    q1 = df.quantile(q=0.25)
    q3 = df.quantile(q=0.75)
    iqr = q3 - q1
    Umin = q1 - 1.5*iqr
    Umax = q3 + 1.5*iqr 
    df[df>Umax] = df[df<=Umax].max()
    df[df<Umin] = df[df>=Umin].min()
    return df
train['age']=outlier_processing(train['age'])
train['day']=outlier_processing(train['day'])
train['duration']=outlier_processing(train['duration'])
train['campaign']=outlier_processing(train['campaign'])
test['age']=outlier_processing(test['age'])
test['day']=outlier_processing(test['day'])
test['duration']=outlier_processing(test['duration'])
test['campaign']=outlier_processing(test['campaign'])

3. Coding of categorical variables

dummy_train=train.join(pd.get_dummies(train[str_features])).drop(str_features,axis=1).drop(['ID','y'],axis=1)
dummy_test=test.join(pd.get_dummies(test[str_features])).drop(str_features,axis=1).drop(['ID'],axis=1)

4. Unbalanced data set processing

X=dummy_train
y=train['y']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.2,random_state=2020)
smote_tomek = SMOTETomek(random_state=2020)  #SMOTETomek
X_resampled, y_resampled = smote_tomek.fit_sample(X_train, y_train)

Data modeling

In order to facilitate the explanation, this article uses logistic regression for data analysis and modeling. In actual work scenarios, we can use random forest, lgb, xgboost, DNN and other models. It is possible to choose according to specific scenarios and modeling effects.

#逻辑回归
param = {"penalty": ["l1", "l2", ], "C": [0.1, 1, 10], "solver": ["liblinear","saga"]}
gs = GridSearchCV(estimator=LogisticRegression(), param_grid=param, cv=2, scoring="roc_auc",verbose=10) 
gs.fit(X_resampled,y_resampled) 
print(gs.best_params_) 
y_pred = gs.best_estimator_.predict(X_valid) 
print(classification_report(y_valid, y_pred))
# 训练集
confusion_matrix(y_resampled,gs.best_estimator_.predict(X_resampled),labels=[1,0])
# 验证集
confusion_matrix(y_valid,y_pred,labels=[1,0])
#画roc-auc曲线
def get_rocauc(X,y,clf):
    from sklearn.metrics import roc_curve
    FPR,recall,thresholds=roc_curve(y,clf.predict_proba(X)[:,1],pos_label=1)
    area=roc_auc_score(y,clf.predict_proba(X)[:,1])
    maxindex=(recall-FPR).tolist().index(max(recall-FPR))
    threshold=thresholds[maxindex]
    plt.figure()
    plt.plot(FPR,recall,color='red',label='ROC curve (area = %0.2f)'%area)
    plt.plot([0,1],[0,1],color='black',linestyle='--')
    plt.scatter(FPR[maxindex],recall[maxindex],c='black',s=30)
    plt.xlim([-0.05,1.05])
    plt.ylim([-0.05,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('Recall')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc='lower right')
    plt.show()
    return threshold
threshold=get_rocauc(X_resampled, y_resampled,gs.best_estimator_)

roc-auc curve
Insert picture description here

Above we performed data training, data prediction, model performance evaluation and other operations.


Recommended reading

For more exciting content, follow the WeChat public account "Python learning and data mining"

In order to facilitate technical exchanges, this account has opened a technical exchange group. If you have any questions, please add a small assistant WeChat account: connect_we. Remarks: The group is from CSDN, welcome to reprint, favorites, codewords are not easy, like the article, just like it! Thanks
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/108295081