Using Python Algorithms to Predict Bank Customer Behavior Practical Cases

This is a data set of a bank on kaggle. Studying this data set can predict whether a customer will subscribe for a fixed deposit y. There are 20 features included here.

1. Analytical framework

picture

2. Data reading, data cleaning

# 导入相关包
import numpy as np
import pandas as pd 
# 读取数据
data = pd.read_csv('./1bank-additional-full.csv')
# 查看表的行列数
data.shape

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Relevant files and codes have been uploaded, and they can be obtained by joining the communication group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends.

Method ①, add WeChat account: dkl88194, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

output:

picture
picture

Only the nr.employed column has missing data here, check it out:

data['nr.employed'].value_counts()

picture

Here there is only the value of 5191.0, nothing else, and only 7763 pieces of data. Here, this column is directly regarded as an outlier, and this column is directly deleted.

# data.drop('nr.employed', axis=1, inplace=True)

3. Exploratory Data Analysis

3.1 View the distribution of the number of people in each age group

It can be seen here that the main users of the bank are mainly concentrated in the age group of 23-60, and the number of people in the age group of 29-39 is more than that of other age groups.

import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.figure(figsize=(20, 8), dpi=256)
sns.countplot(x='age', data=data)
plt.title("各年龄段的人数")

picture

3.2 Some distributions of other features

plt.figure(figsize=(18, 16), dpi=512)
plt.subplot(221)
sns.countplot(x='contact', data=data)
plt.title("contact分布情况")

plt.subplot(222)
sns.countplot(x='day_of_week', data=data)
plt.title("day_of_week分布情况")

plt.subplot(223)
sns.countplot(x='default', data=data)
plt.title("default分布情况")

plt.subplot(224)
sns.countplot(x='education', data=data)
plt.xticks(rotation=70)
plt.title("education分布情况")

plt.savefig('./1.png')

picture

plt.figure(figsize=(18, 16), dpi=512)
plt.subplot(221)
sns.countplot(x='housing', data=data)
plt.title("housing分布情况")

plt.subplot(222)
sns.countplot(x='job', data=data)
plt.xticks(rotation=70)
plt.title("job分布情况")

plt.subplot(223)
sns.countplot(x='loan', data=data)
plt.title("loan分布情况")

plt.subplot(224)
sns.countplot(x='marital', data=data)
plt.xticks(rotation=70)
plt.title("marital分布情况")

plt.savefig('./2.png')

picture

plt.figure(figsize=(18, 8), dpi=512)
plt.subplot(221)
sns.countplot(x='month', data=data)
plt.xticks(rotation=30)

plt.subplot(222)
sns.countplot(x='poutcome', data=data)
plt.xticks(rotation=30)
plt.savefig('./3.png')

picture

3.3 Correlation of each feature

plt.figure(figsize=(10, 8), dpi=256)
plt.rcParams['axes.unicode_minus'] = False
sns.heatmap(data.corr(), annot=True)
plt.savefig('./4.png')

picture

4. Feature Normalization

4.1 Convert the eigenvalues ​​​​of the independent variables into label types

# 特征化数据
from sklearn.preprocessing import LabelEncoder
features = ['contact', 'day_of_week', 'default', 'education', 'housing',
           'job','loan', 'marital', 'month', 'poutcome']

le_x = LabelEncoder()
for feature in features:
    data[feature] = le_x.fit_transform(data[feature]) 

4.2 Convert the resulting y value to 0, 1

def parse_y(x):
    if (x == 'no'):
        return 0
    else:
        return 1
data['y'] = data['y'].apply(parse_y)
data['y'] = data['y'].astype(int)

4.3 Data normalization

# 数据规范化到正态分布的数据
# 测试数据和训练数据的分割
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ss = StandardScaler()
train_x, test_x, train_y, test_y = train_test_split(data.iloc[:,:-1], 
                                                   data['y'], 
                                                   test_size=0.3)
train_x = ss.fit_transform(train_x)
test_x = ss.transform(test_x)

5. Model training

5.1 AdaBoost classifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
ada = AdaBoostClassifier()
ada.fit(train_x, train_y)
predict_y = ada.predict(test_x)
print("准确率:", accuracy_score(test_y, predict_y))

picture

5.2 SVC Classifier

from sklearn.svm import SVC
svc = SVC()
svc.fit(train_x, train_y)
predict_y = svc.predict(test_x)
print("准确率:", accuracy_score(test_y, predict_y))

picture

5.3 K Neighborhood Classifier

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train_x, train_y)
predict_y = knn.predict(test_x)
print("准确率:", accuracy_score(test_y, predict_y))

picture

5.4 Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(train_x, train_y)
predict_y = dtc.predict(test_x)
print("准确率:", accuracy_score(test_y, predict_y))

picture

6 Model evaluation

6.1 AdaBoost classifier

from sklearn.metrics import roc_curve
from sklearn.metrics import auc
plt.figure(figsize=(8,6))
fpr1, tpr1, threshoulds1 = roc_curve(test_y, ada.predict(test_x))
plt.stackplot(fpr1, tpr1,color='steelblue', alpha = 0.5, edgecolor = 'black')
plt.plot(fpr1, tpr1, linewidth=2, color='black')
plt.plot([0,1], [0,1], ls='-', color='red')
plt.text(0.5, 0.4, auc(fpr1, tpr1))
plt.title('AdaBoost分类器的ROC曲线')

picture

6.2 SVC Classifier

plt.figure(figsize=(8,6))
fpr2, tpr2, threshoulds2 = roc_curve(test_y, svc.predict(test_x))
plt.stackplot(fpr2, tpr2, alpha = 0.5)
plt.plot(fpr2, tpr2, linewidth=2, color='black')
plt.plot([0,1], [0,1],ls='-', color='red')
plt.text(0.5, 0.4, auc(fpr2, tpr2))
plt.title('SVD的ROC曲线')

picture

6.3 K Neighborhood Classifier

plt.figure(figsize=(8,6))
fpr3, tpr3, threshoulds3 = roc_curve(test_y, knn.predict(test_x))
plt.stackplot(fpr3, tpr3, alpha = 0.5)
plt.plot(fpr3, tpr3, linewidth=2, color='black')
plt.plot([0,1], [0,1],ls='-', color='red')
plt.text(0.5, 0.4, auc(fpr3, tpr3))
plt.title('K邻近值的ROC曲线')

picture

6.4 Decision Tree Classifier

plt.figure(figsize=(8,6))
fpr4, tpr4, threshoulds4 = roc_curve(test_y, dtc.predict(test_x))
plt.stackplot(fpr4, tpr4, alpha = 0.5)
plt.plot(fpr4, tpr4, linewidth=2, color='black')
plt.plot([0,1], [0,1],ls='-', color='red')
plt.text(0.5, 0.4, auc(fpr4, tpr4))
plt.title('决策树的ROC曲线')

picture

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/132389978