Stacking algorithm predicts bank customer churn rate

Stacking algorithm predicts bank customer churn rate

describe

In order to prevent customer churn in banks, through data analysis, identify and visualize which factors lead to customer churn, and establish a predictive model to identify whether customers will churn and what is the probability of churn. So that the customer service department of the bank can retain these lost customers in a more targeted manner.

The practical content of this assignment includes:

1. Learn and be familiar with the principle of Stacking/Blending algorithm.

2. Use Stacking algorithm to predict bank customer churn rate.

Source code download

environment

  • Operating system: Windows 10, Ubuntu 18.04

  • Tool software: Anaconda3 2019, Python3.7

  • Hardware environment: no special requirements

  • Dependency library list

    scikit-learn	1.0.2
    numpy           1.19.3
    pandas          1.3.5
    

analyze

This task involves the following links:

A) Familiar with the principle of Stacking/Blending algorithm

B) Load and observe bank customers

C) Use the decision tree classifier and KNN classifier model to generate prediction results respectively

D) Connect the above prediction results into a new feature set, the label remains unchanged, and use the original label set

E) Finally, use the logistic regression algorithm to classify and predict the new feature set

implement

1. Principle of Stacking/Blending algorithm

1.1 Stacking algorithm

The idea of ​​the Stacking algorithm is to use the initial training set to learn several base models, and use the prediction results of these base models as the features of the new training set to train a new model. The flow of the Stacking algorithm is shown in the figure below:

Please add a picture description

These base models choose among heterogeneous types, such as decision trees, KNN, SVM or neural networks, etc., which can all be combined together.

The specific steps of Stacking are as follows:

Please add a picture description

The specific steps of Stacking are as follows:

(1) Usually the training set is split into K-folds (please recall the K-fold verification introduced in Lesson 1)

(2) Use the K-fold verification method to train the model on the K-1 fold, and verify it on the K-th fold

(3) After training K times in this way, use the trained model to perform final training on the training set as a whole to obtain a base model

(4) Use the base model to predict the training set, and get the prediction result of the training set

(5) Use the base model to predict the test set, and get the prediction result of the test set

(6) Repeat steps (2) to (5) to generate all base models and prediction results (such as CART, KNN, SVM and neural network, 4 sets of prediction results)

(7) It is only necessary to use the prediction results of the training set as the features of the new training set, and the prediction results of the test set as the features of the new test set to train the new model. The type of the new model does not have to be associated with the base model

1.2 Blending algorithm

The idea of ​​Blending is almost exactly the same as that of Stacking. The only difference is that the k-fold verification is not performed during the Blending process, but only the original sample training set is divided into a training set and a verification set, and then only predictions are made for the verification set. , the generated new training set is only the prediction result for the verification set, not the prediction result generated for the entire training set. The process of Blending integration is shown in the figure:

Please add a picture description

2. Load and analyze bank customer data set

import numpy as np # 基础线性代数扩展包
import pandas as pd # 数据处理工具箱
df_bank = pd.read_csv("../dataset/BankCustomer.csv") # 读取文件
df_bank.head() # 显示文件前5行

The result is as follows:

Please add a picture description

Description of data set characteristics:

  • name: customer name

  • Gender: customer gender

  • Age: customer age

  • City: city

  • Tenure: user duration

  • ProductsNo: the number of products used

  • HasCard: Do you have a credit card

  • ActiveMember: Whether it is an active member

  • Credit: credit score

  • AccountBal: account balance

  • Salary: Salary

  • Exited (label): Whether it is lost, 1 means lost, 0 means no lost

3. Data processing

Textualize binary data to create datasets.

# 把二元类别文本数字化
df_bank['Gender'].replace("Female",0,inplace = True)
df_bank['Gender'].replace("Male",1,inplace=True)

# 显示数字类别
print("Gender unique values",df_bank['Gender'].unique())

# 把多元类别转换成多个二元哑变量,然后贴回原始数据集
d_city = pd.get_dummies(df_bank['City'], prefix = "City")
df_bank = [df_bank, d_city]
df_bank = pd.concat(df_bank, axis = 1)

# 构建特征和标签集合
y = df_bank['Exited']
X = df_bank.drop(['Name', 'Exited', 'City'], axis=1)
X.head() #显示新的特征集

The result is as follows:

Please add a picture description

4. Split the dataset

Use the sklearn.model_selection.train_test_split() method to divide the dataset into training and test sets.

from sklearn.model_selection import train_test_split # 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                   test_size=0.2, random_state=0)

5. Implementation of Stacking algorithm

Define functions to implement the Stacking algorithm process.

from sklearn.model_selection import StratifiedKFold

'''
train:训练集特征
y:训练集标签
test:测试集
'''

def Stacking(model, train, y, test, n_fold):
    folds = StratifiedKFold(n_splits=n_fold, random_state=None)
    test_pred = np.empty((0, 1), float)
    train_pred = np.empty((0, 1), float)

    for train_indices, val_indices in folds.split(train, y.values):   # 将测试集特征和标签划分为n个子集
        X_train, x_val = train.iloc[train_indices], train.iloc[val_indices]   # X_train:训练集特征, x_val:验证集特征
        y_train, y_val = y.iloc[train_indices], y.iloc[val_indices]          # y_train:训练集标签, y_val:验证集标签
        model.fit(X=X_train, y=y_train)

        train_pred = np.append(train_pred, model.predict(x_val))  # 验证集预测
        test_pred = np.append(test_pred, model.predict(test))   # 传入的测试集预测

    return test_pred, train_pred

6. Training base model

Create a decision tree classifier model and a KNN classifier model, and train the two models with the Stacking function just defined:

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

model1 = DecisionTreeClassifier(random_state=1)
test_pred1, train_pred1 = Stacking(model=model1, n_fold=10,
                                   train=X_train, test=X_test, y=y_train)
train_pred1 = pd.DataFrame(train_pred1)
test_pred1 = pd.DataFrame(test_pred1)

model2 = KNeighborsClassifier()
test_pred2, train_pred2 = Stacking(model=model2, n_fold=10,
                                   train=X_train, test=X_test, y=y_train)
train_pred2 = pd.DataFrame(train_pred2)
test_pred2 = pd.DataFrame(test_pred2)

7. Classification prediction

Concatenate the above prediction results into a new feature set, keep the labels unchanged, and use the original label set. Finally, use the logistic regression algorithm to classify and predict the new feature set:

from sklearn.linear_model import LogisticRegression

df = pd.concat([train_pred1, train_pred2], axis=1)    # (8000,2)
df_test = pd.concat([test_pred1, test_pred2], axis=1)   # (20000,2)

a = y_test
for i in range(9):
    y_test = pd.concat([y_test, a], axis=0)

model = LogisticRegression(random_state=1)
model.fit(df, y_train)
print(model.score(df_test, y_test))

The result is as follows:

0.7915

Guess you like

Origin blog.csdn.net/qq_40186237/article/details/130147809