2022 The 5th "Teddy Cup" Data Analysis Skills Competition - Question B - Analysis of Bank Customer Loyalty (Part 1)

Novices to participate in the competition, please understand the shortcomings

Topic link: Link: https://pan.baidu.com/s/1x1U-kobvPjNMm8xnvS9Gdg 
Extraction code: 7id3


Table of contents

Task 1 Data Exploration and Cleaning

Task 1.1 Data Exploration and Preprocessing

Task 1.2 Feature Encoding

Task 2 Visual analysis of product marketing data

Task 2.1 

Task 2.2

Task 2.3

Task 2.4

 Task 3 Visual analysis of customer churn factors

Task 3.1

Task 3.2

Task 3.3

Task 3.4 


Task 1 Data Exploration and Cleaning

        Perform data exploration and cleaning on the short-term customer product purchase data "short-customer-data.csv " (referred to as short-term data) and the training set "long-customer-train.csv " (referred to as long-term data) of long-term customer resource information data.

Task 1.1 Data Exploration and Preprocessing

(1) Explore the missing values ​​of each index data of the short-term data and the duplicate values ​​of the " user_id " column, and delete the row data where the missing values ​​and duplicate values ​​are located. Please give the processing process and necessary results in the report, and the complete results are saved in the file "result1_1.xlsx ".

1. First check the data through the info function

import pandas as pd
data=pd.read_csv("F:\泰迪杯B题\B题:银行客户忠诚度分析赛题数据\B题:银行客户忠诚度分析赛题数据\short-customer-data.csv")
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41176 entries, 0 to 41175
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      41176 non-null  object
 1   age          41176 non-null  int64 
 2   job          40846 non-null  object
 3   marital      41096 non-null  object
 4   education    39446 non-null  object
 5   default      32580 non-null  object
 6   housing      40186 non-null  object
 7   loan         40186 non-null  object
 8   contact      41176 non-null  object
 9   month        41176 non-null  object
 10  day_of_week  41176 non-null  object
 11  duration     41176 non-null  int64 
 12  poutcome     41176 non-null  object
 13  y            41176 non-null  object
dtypes: int64(2), object(12)
memory usage: 4.4+ MB

2. Delete the row data where the missing value is located

data.dropna(inplace=True)

3. View duplicate values ​​in the " user_id " column

(data["user_id"].value_counts()).head(20)
BA2200001    12
BA2200775 7
BA2239485 7
BA22411696
BA2200077 2
BA2241172 2
BA2201983 2
BA2240741 2
BA2200107 2
BA2229101 1
BA2220590 1
BA2208766 1
BA2225619 1
BA2204624     1
BA2227831 1
BA2239381 1
BA2224934 1
BA2224984 1
BA2200497 1
BA2203990 1
Name: user_id, dtype: int64

4. Use the drop_duplicates function to delete the row data where the duplicate value is located

data.drop_duplicates(subset=['user_id'],inplace=True)

5. Write excel file

df1=pd.DataFrame(data)
df1.to_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result1_1.xlsx',sheet_name='sheet1',index=None)
(2) There are abnormal values ​​of -1 , 0 and "-" in the " Age " column of the customer's age in the long-term data , delete the row data in this case; the "Age " column has abnormal characters such as spaces and "years", Delete these abnormal characters but keep the age value, and store the processed value in the "Age " column. Please give the processing process and necessary results in the report, and the complete results are saved in the file "result1_2.xlsx ".
1. First take out the " Age " column does not exist data with values ​​​​of -1 , 0 and "-"
long_data=long_data[~long_data["Age"].isin([0,'-',-1])]

2. Data processing

long_data["Age"] = long_data["Age"].str.replace("岁","")
long_data["Age"] = long_data["Age"].str.replace("","")
long_data["Age"].unique()
array(['52', '41', '42', '61', '39', '44', '53', '48', '60', '32', '38',
       '51', '56', '28', '57', '45', '46', '40', '30', '59', '33', '35',
       '54', '62', '23', '24', '36', '47', '73', '49', '37', '58', '21',
       '55', '29', '70', '34', '43', '31', '27', '50', '66', '26', '20',
       '64', '63', '71', '22', '25', '73 ', '75', '18', '1', '65', '72',
       '69', '33 ', '25 ', '24 ', '76', '47 ', '74', '19', '77', '0',
       '68', '49 ', '67', '63 ', '34 ', '50 ', '64 ', '29 ', '81', '27 ',
       '83', '79', '78', '23 ', '30 ', '26 ', '45 ', '77 ', '84', '80',
       '92', '32 ', '28 ', '31 ', '66 ', '22 ', '62 ', '46 ', '21 ',
       '57 ', '82', '88'], dtype=object)

3. Write excel file

df2=pd.DataFrame(long_data)
df2.to_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result1_2.xlsx',sheet_name='sheet1',index=None)

Task 1.2 Feature Encoding

Characteristic encoding is performed on the character data in the short-term data, such as encoding the credit default situation {'no ', ' yes '} as {0,1} . Please give the processing idea, process and necessary results in the report, and the complete results will be saved in the file "result1_3.xlsx ".
1. Observe the data type
data.dtypes
user_id        object
age             int64
job            object
marital        object
education      object
default        object
housing        object
loan           object
contact        object
month          object
day_of_week    object
duration        int64
poutcome       object
y              object
dtype: object

2. Find the distribution of each character data value separately

data["job"].value_counts()
admin.           8724
blue-collar      5670
technician       5459
services         2855
management       2310
retired          1212
self-employed    1092
entrepreneur     1088
unemployed        738
housemaid         688
student           609
Name: job, dtype: int64
data["marital"].value_counts()
married     17472
single       9426
divorced     3547
Name: marital, dtype: int64
data["education"].value_counts()
undergraduate     10397
postgraduate       8037
high school        7688
junior college     4312
illiterate           11
Name: education, dtype: int64
data["housing"].value_counts()
yes    16498
no     13947
Name: housing, dtype: int64
data["loan"].value_counts()
no     25680
yes     4765
Name: loan, dtype: int64
data["contact"].value_counts()
cellular     20420
telephone    10025
Name: contact, dtype: int64
data["month"].value_counts()
may    9718
jul    5075
aug    4667
jun    3608
nov    3492
apr    2112
oct     641
sep     493
mar     482
dec     157
Name: month, dtype: int64
data["day_of_week"].value_counts()
thu    6388
mon    6274
wed    6116
tue    5939
fri    5728
Name: day_of_week, dtype: int64
data["poutcome"].value_counts()
nonexistent    25796
failure         3458
success         1191
Name: poutcome, dtype: int64
data["y"].value_counts()
no     26589
yes     3856
Name: y, dtype: int64

2. Encode the data distribution as {'no ', ' yes '} as {0,1}

data["housing"] = data["housing"].str.replace("yes","1")
data["housing"] = data["housing"].str.replace("no","0")
data["loan"] = data["loan"].str.replace("yes","1")
data["loan"] = data["loan"].str.replace("no","0")
data["default"] = data["default"].str.replace("yes","1")
data["default"] = data["default"].str.replace("no","0")
data["y"] = data["y"].str.replace("yes","1")
data["y"] = data["y"].str.replace("no","0")

3. Use the one-hot encoding in sklearn to convert other non-{yes, no} specific value features into dummy variables, and assign "0" and "1" respectively

ps: There is a large amount of data, in order to avoid errors, the data is processed multiple times

from sklearn.preprocessing import OneHotEncoder
index=data.values[:,0:2]
index_data=index #获得ID列
job_marital_education=data.values[:,2:5]
enc=OneHotEncoder() #建立模型对象
df1_new=enc.fit_transform(job_marital_education).toarray() #标志转换
df1_index_education=pd.concat((pd.DataFrame(index_data),pd.DataFrame(df1_new)),axis=1) #组合为数据框
df1_index_education.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate']
df1_index_education.head()
contact_month_week=data.values[:,8:11]
enc=OneHotEncoder() #建立模型对象
df2_new=enc.fit_transform(contact_month_week).toarray() #标志转换
df_index_week=pd.concat((pd.DataFrame(df1_index_education),pd.DataFrame(df2_new)),axis=1) #组合为数据框
df_index_week.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate',
                      'cellular','telephone','apr','aug','dec','jul','jun','mar','may','nov','oct','sep','fri','mon','thu','tue','wed']
df_index_week
default_loan=data.values[:,5:8]
df_now=pd.concat((pd.DataFrame(df_index_week),pd.DataFrame(default_loan)),axis=1) #组合为数据框
df_now.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate',
                      'cellular','telephone','apr','aug','dec','jul','jun','mar','may','nov','oct','sep','fri','mon','thu','tue','wed','default',
       'housing','loan']
df_now
data["y"]=data["y"].astype(dtype='int')
poutcome=data.values[:,11:]
data["y"]=data["y"].astype(dtype='int')
poutcome=data.values[:,12:]
enc=OneHotEncoder() #建立模型对象
df3_new=enc.fit_transform(poutcome).toarray() #标志转换
df_all=pd.concat((pd.DataFrame(df_index_week),pd.DataFrame(df3_new)),axis=1) #组合为数据框
df_all.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate',
                      'cellular','telephone','apr','aug','dec','jul','jun','mar','may','nov','oct','sep','fri','mon','thu','tue','wed',
               'failure','nonexistent','success','1','0']
df_all.drop("0",axis=1,inplace=True)
df_all.drop("1",axis=1,inplace=True)
df_all
last=data.values[:,11:14:2]
data_last=pd.concat((pd.DataFrame(df_all),pd.DataFrame(last)),axis=1) #组合为数据框
data_last.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate',
                      'cellular','telephone','apr','aug','dec','jul','jun','mar','may','nov','oct','sep','fri','mon','thu','tue','wed','failure','nonexistent','success','duration','y']
data_last

When I did task 2 later, I found out that a few columns of data were missing [o(╥﹏╥)o], and then came back to make up

bb=data.values[:,5:8]
data_data=pd.concat((pd.DataFrame(data_last),pd.DataFrame(bb)),axis=1) #组合为数据框
data_data.columns=['user_id', 'age', 
                             'admin','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student',
                 'technician','unemployed','divorced','married','single',
                             'high school','illiterate','junior college','postgraduate','undergraduate',
                      'cellular','telephone','apr','aug','dec','jul','jun','mar','may','nov','oct','sep','fri','mon','thu','tue','wed',
               'failure','nonexistent','success','duration','y','default','housing','loan']
data_data

3. Finally write to Excel

df3=pd.DataFrame(data_data)
df3.to_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result1_3.xlsx',sheet_name='sheet1',index=None)

Mission 1 is over!

Task 2  Visual Analysis of Product Marketing Data

Based on short-term data, analyze the correlation between customers with different indicators and the behavior of purchasing bank products, and tap the loyalty of short-term customers to banks.

Task 2.1 

Calculate the correlation between all indicators of short-term data, draw a heat map of the correlation coefficient, and perform necessary analysis on the results in the report.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
plt.style.use('fivethirtyeight')
%matplotlib inline
import shap
plt.rc('figure',figsize=(18,9))


short_data=pd.read_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result1_3.xlsx')
corr=short_data.corr() #仅针对数值连续变量
plt.figure(figsize=(20,15))
sns.heatmap(corr
           ,annot=True
           ,linewidths=2
           ,linecolor='lightgrey')
plt.show()

Task 2.2

In the same canvas, draw a grouped histogram that reflects the proportion of customers of different ages under the purchase results of two products, the x- axis is age, the y- axis is the proportion value, and perform necessary analysis on the results in the report.
num_20_1=0
num_30_1=0
num_40_1=0
num_50_1=0
num_60_1=0
num_70_1=0
num_80_1=0
num_90_1=0
num_100_1=0
for i in range(len(short_data["age"])):
    if short_data.loc[i,'age']<=20 and short_data.loc[i,'y']==1:
        num_20_1+=1
    if 21<=short_data.loc[i,'age']<=30 and short_data.loc[i,'y']==1:
        num_30_1+=1
    if 31<=short_data.loc[i,'age']<=40 and short_data.loc[i,'y']==1:
        num_40_1+=1
    if 41<=short_data.loc[i,'age']<=50 and short_data.loc[i,'y']==1:
        num_50_1+=1
    if 51<=short_data.loc[i,'age']<=60 and short_data.loc[i,'y']==1:
        num_60_1+=1
    if 61<=short_data.loc[i,'age']<=70 and short_data.loc[i,'y']==1:
        num_70_1+=1
    if 71<=short_data.loc[i,'age']<=80 and short_data.loc[i,'y']==1:
        num_80_1+=1
    if 81<=short_data.loc[i,'age']<=90 and short_data.loc[i,'y']==1:
        num_90_1+=1
    if 91<=short_data.loc[i,'age']<=100 and short_data.loc[i,'y']==1:
        num_100_1+=1
num_20_0=0
num_30_0=0
num_40_0=0
num_50_0=0
num_60_0=0
num_70_0=0
num_80_0=0
num_90_0=0
num_100_0=0
for i in range(len(short_data["age"])):
    if short_data.loc[i,'age']<=20 and short_data.loc[i,'y']==0:
        num_20_0+=1
    if 21<=short_data.loc[i,'age']<=30 and short_data.loc[i,'y']==0:
        num_30_0+=1
    if 31<=short_data.loc[i,'age']<=40 and short_data.loc[i,'y']==0:
        num_40_0+=1
    if 41<=short_data.loc[i,'age']<=50 and short_data.loc[i,'y']==0:
        num_50_0+=1
    if 51<=short_data.loc[i,'age']<=60 and short_data.loc[i,'y']==0:
        num_60_0+=1
    if 61<=short_data.loc[i,'age']<=70 and short_data.loc[i,'y']==0:
        num_70_0+=1
    if 71<=short_data.loc[i,'age']<=80 and short_data.loc[i,'y']==0:
        num_80_0+=1
    if 81<=short_data.loc[i,'age']<=90 and short_data.loc[i,'y']==0:
        num_90_0+=1
    if 91<=short_data.loc[i,'age']<=100 and short_data.loc[i,'y']==0:
        num_100_0+=1
total_1=num_20_1+num_30_1+num_40_1+num_50_1+num_60_1+num_70_1+num_80_1+num_90_1+num_100_1
total_0=num_20_0+num_30_0+num_40_0+num_50_0+num_60_0+num_70_0+num_80_0+num_90_0+num_100_0
from matplotlib import pyplot as plt
font = {'family': 'MicroSoft YaHei',
        'weight': 'bold',
        'size': '12'}
matplotlib.rc("font",**font)
x=["<20","21-30","31-40","41-50","51-60","61-70","71-80","81-90","91-100"]
y_1=[num_20_1/total_1,num_30_1/total_1,num_40_1/total_1,num_50_1/total_1,num_60_1/total_1,num_70_1/total_1,num_80_1/total_1,num_90_1/total_1,num_100_1/total_1]
y_0=[num_20_0/total_0,num_30_0/total_0,num_40_0/total_0,num_50_0/total_0,num_60_0/total_0,num_70_0/total_0,num_80_0/total_0,num_90_0/total_0,num_100_0/total_0]
ax=plt.subplot(1,2,1)

plt.bar(x,y_1,color='orange')
plt.xlabel("年龄分组")
plt.xticks(x)
plt.ylabel("人数占比")
plt.title('客户购买结果为1')
ax=plt.subplot(1,2,2)
plt.bar(x,y_0,color='green')
plt.grid(color='#95a5a6',linestyle='--',linewidth=1,axis='y',alpha=0.5)
plt.xlabel("年龄分组")
plt.xticks(x)
plt.ylabel("人数占比")
plt.title('客户购买结果为0')  
plt.grid(color='#95a5a6',linestyle='--',linewidth=1,axis='y',alpha=0.5)
plt.show()

Task 2.3

In the same canvas, draw a pie chart of product purchases of blue-collar and student ( student ), and set the label of the pie chart to show the proportion of product purchases.
blue_1=0
student_1=0
for i in range(len(short_data["age"])):
    if short_data.loc[i,"blue-collar"]==1 and short_data.loc[i,'y']==1:
        blue_1+=1
    if short_data.loc[i,"student"]==1 and short_data.loc[i,'y']==1:
        student_1+=1

blue_0=0
student_0=0
for i in range(len(short_data["age"])):
    if short_data.loc[i,"blue-collar"]==1 and short_data.loc[i,'y']==0:
        blue_0+=1
    if short_data.loc[i,"student"]==1 and short_data.loc[i,'y']==0:
        student_0+=1

num1=[blue_1,student_1]
num0=[blue_0,student_0]
colors = ['blue','yellow']
# #设置突出模块偏移值
expodes = (0,0)
#设置绘图属性并绘图
ax=plt.subplot(1,2,1)
plt.pie(num1,explode=expodes,labels=labels,shadow=True,colors=colors)
plt.title("产品购买结果为1")
plt.axis('equal')
ax=plt.subplot(1,2,2)
plt.pie(num0,explode=expodes,labels=labels,shadow=True,colors=colors)
plt.title("产品购买结果为0")
plt.axis('equal')
plt.show()

Task 2.4

With the product purchase results as the x- axis and the call duration of the visiting customers as the y- axis, draw a boxplot of the call duration of the visiting customers, and perform necessary analysis on the results in the report.
x=["产品购买结果为1","产品购买结果为0"]
y_1=[]
y_0=[]
for i in range(len(short_data["age"])):
    if short_data.loc[i,'y']==0:
        y_1.append(short_data.loc[i,'duration'])
    if short_data.loc[i,'y']==1:
        y_0.append(short_data.loc[i,'duration'])
dt = pd.DataFrame({'产品购买结果为0': y_0})
dt = pd.DataFrame({'产品购买结果为1': y_1})
dt.boxplot()  #对数据框中每列画箱线图,pandas自己有处理的过程
plt.show()

 Task 3 Visual analysis of customer churn factors

Based on long-term data analysis, the factors that lead to bank customer loss are presented visually.

Task 3.1

In the same canvas, draw a line chart reflecting the proportion of customers of different ages under two churn situations, the x- axis is the age, and the y- axis is the proportion value
import pandas as pd
long_data=pd.read_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result1_2.xlsx')
num_20_1=0
num_30_1=0
num_40_1=0
num_50_1=0
num_60_1=0
num_70_1=0
num_80_1=0
num_90_1=0
num_100_1=0
for i in range(len(short_data["Age"])):
    if short_data.loc[i,'Age']<=20 and short_data.loc[i,'Exited']==1:
        num_20_1+=1
    if 21<=short_data.loc[i,'Age']<=30 and short_data.loc[i,'Exited']==1:
        num_30_1+=1
    if 31<=short_data.loc[i,'Age']<=40 and short_data.loc[i,'Exited']==1:
        num_40_1+=1
    if 41<=short_data.loc[i,'Age']<=50 and short_data.loc[i,'Exited']==1:
        num_50_1+=1
    if 51<=short_data.loc[i,'Age']<=60 and short_data.loc[i,'Exited']==1:
        num_60_1+=1
    if 61<=short_data.loc[i,'Age']<=70 and short_data.loc[i,'Exited']==1:
        num_70_1+=1
    if 71<=short_data.loc[i,'Age']<=80 and short_data.loc[i,'Exited']==1:
        num_80_1+=1
    if 81<=short_data.loc[i,'Age']<=90 and short_data.loc[i,'Exited']==1:
        num_90_1+=1
    if 91<=short_data.loc[i,'Age']<=100 and short_data.loc[i,'Exited']==1:
        num_100_1+=1
num_20_0=0
num_30_0=0
num_40_0=0
num_50_0=0
num_60_0=0
num_70_0=0
num_80_0=0
num_90_0=0
num_100_0=0
for i in range(len(short_data["Age"])):
    if short_data.loc[i,'Age']<=20 and short_data.loc[i,'Exited']==0:
        num_20_0+=1
    if 21<=short_data.loc[i,'Age']<=30 and short_data.loc[i,'Exited']==0:
        num_30_0+=1
    if 31<=short_data.loc[i,'Age']<=40 and short_data.loc[i,'Exited']==0:
        num_40_0+=1
    if 41<=short_data.loc[i,'Age']<=50 and short_data.loc[i,'Exited']==0:
        num_50_0+=1
    if 51<=short_data.loc[i,'Age']<=60 and short_data.loc[i,'Exited']==0:
        num_60_0+=1
    if 61<=short_data.loc[i,'Age']<=70 and short_data.loc[i,'Exited']==0:
        num_70_0+=1
    if 71<=short_data.loc[i,'Age']<=80 and short_data.loc[i,'Exited']==0:
        num_80_0+=1
    if 81<=short_data.loc[i,'Age']<=90 and short_data.loc[i,'Exited']==0:
        num_90_0+=1
    if 91<=short_data.loc[i,'Age']<=100 and short_data.loc[i,'Exited']==0:
        num_100_0+=1
total_1=num_20_1+num_30_1+num_40_1+num_50_1+num_60_1+num_70_1+num_80_1+num_90_1+num_100_1
total_1
total_0=num_20_0+num_30_0+num_40_0+num_50_0+num_60_0+num_70_0+num_80_0+num_90_0+num_100_0
total_0
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
font = {'family': 'MicroSoft YaHei',
        'weight': 'bold',
        'size': '12'}
matplotlib.rc("font",**font)
x=["<20","21-30","31-40","41-50","51-60","61-70","71-80","81-90","91-100"]

y_1=[num_20_1/total_1,num_30_1/total_1,num_40_1/total_1,num_50_1/total_1,num_60_1/total_1,num_70_1/total_1,num_80_1/total_1,num_90_1/total_1,num_100_1/total_1]
y_0=[num_20_0/total_0,num_30_0/total_0,num_40_0/total_0,num_50_0/total_0,num_60_0/total_0,num_70_0/total_0,num_80_0/total_0,num_90_0/total_0,num_100_0/total_0]
plt.figure(figsize=(12,5))
ax=plt.subplot(1,2,1)

plt.plot(x,y_1,color='orange')
plt.xlabel("年龄分组")
plt.xticks(x)
plt.ylabel("人数占比")
plt.title('客户流失')
ax=plt.subplot(1,2,2)
plt.plot(x,y_0,color='green')
plt.grid(color='#95a5a6',linestyle='--',linewidth=1,axis='y',alpha=0.5)
plt.xlabel("年龄分组")
plt.xticks(x)
plt.ylabel("人数占比")
plt.title('客户不流失')  
plt.grid(color='#95a5a6',linestyle='--',linewidth=1,axis='y',alpha=0.5)
plt.show()

Task 3.2

In the same canvas, draw a scatter plot reflecting the customer credit qualification and age distribution under two churn scenarios, with age on the x- axis and credit qualification on the y- axis.
x_1=[]
y_1=[]
x_0=[]
y_0=[]
for i in range(len(short_data["Age"])):
    if short_data.loc[i,'Exited']==1:
        x1=short_data.loc[i,'Age']
        y1=short_data.loc[i,'CreditScore']
        x_1.append(x1)
        y_1.append(y1)
    if short_data.loc[i,'Exited']==0:
        x0=short_data.loc[i,'Age']
        y0=short_data.loc[i,'CreditScore']
        x_0.append(x0)
        y_0.append(y0)
fig=plt.figure(figsize=(12,5),dpi=100)
ax=plt.subplot(1,2,1)
plt.scatter(x_1,y_1,s=1,)
plt.xlabel("年龄")
plt.ylabel("信用资格")
plt.title('客户流失')
ax=plt.subplot(1,2,2)
plt.scatter(x_0,y_0,s=1,)
plt.xlabel("年龄")
# plt.ylabel("信用资格")
plt.yticks([])
plt.title('客户不流失')
plt.show()

Task 3.3

Construct a perspective table containing the proportion of customers of each account age under different churn conditions (see Table 4 for details ), and draw a stacked column chart reflecting the proportion of account age of customers in the two churn situations on the same canvas, The x- axis is the age of the customer, and the y- axis is the proportion.

 

count_0_0=0
count_1_0=0
count_2_0=0
count_3_0=0
count_4_0=0
count_5_0=0
count_6_0=0
count_7_0=0
count_8_0=0
count_9_0=0
count_10_0=0
count_0=[count_0_0,count_1_0,count_2_0,count_3_0,count_4_0,count_5_0,count_6_0,count_7_0,count_8_0,count_9_0,count_10_0]
count_0_1=0
count_1_1=0
count_2_1=0
count_3_1=0
count_4_1=0
count_5_1=0
count_6_1=0
count_7_1=0
count_8_1=0
count_9_1=0
count_10_1=0
count_1=[count_0_1,count_1_1,count_2_1,count_3_1,count_4_1,count_5_1,count_6_1,count_7_1,count_8_1,count_9_1,count_10_1]
for i in range(len(short_data["Age"])):
    if short_data.loc[i,'Exited']==1:
        a1=short_data.loc[i,'Tenure']
        if a1==0:
            count_0_1+=1
        if a1==1:
            count_1_1+=1
        if a1==2:
            count_2_1+=1
        if a1==3:
            count_3_1+=1
        if a1==4:
            count_4_1+=1
        if a1==5:
            count_5_1+=1
        if a1==6:
            count_6_1+=1
        if a1==7:
            count_7_1+=1
        if a1==8:
            count_8_1+=1
        if a1==9:
            count_9_1+=1
        if a1==10:
            count_10_1+=1
    if short_data.loc[i,'Exited']==0:
        a2=short_data.loc[i,'Tenure']
        if a2==0:
            count_0_0+=1
        if a2==1:
            count_1_0+=1
        if a2==2:
            count_2_0+=1
        if a2==3:
            count_3_0+=1
        if a2==4:
            count_4_0+=1
        if a2==5:
            count_5_0+=1
        if a2==6:
            count_6_0+=1
        if a2==7:
            count_7_0+=1
        if a2==8:
            count_8_0+=1
        if a2==9:
            count_9_0+=1
        if a2==10:
            count_10_0+=1

Get the pivot table:

total_0=0
total_1=0
for i in range(0,11):
    total_0+=count_0[i]
    total_1+=count_1[i]
total_0
total_1
for i in range(0,11):
    count_0[i]=count_0[i]/total_0
    count_1[i]=count_1[i]/total_1
data_1={'count_0':count_0,'count_1':count_1}
df_1=pd.DataFrame(data_1)
df_1.plot(kind='bar',stacked=True,alpha=0.5)

 

Task 3.4

According to Table 5 and Table 6 , account age and customer financial assets are divided, and feature coding is performed as new customer features, in which the customer status is stored in the "Status" column, the asset stage is stored in the " AssetStage " column, and the coding results are saved into the file "result3.xlsx ".
Status=[]
AssetStage=[]

for i in range(len(short_data["Age"])):
    if 0<=short_data.loc[i,'Tenure']<=3:
        Status[i]="新客户"
    if 3<short_data.loc[i,'Tenure']<=6:
        Status[i]="稳定客户"
    if short_data.loc[i,'Tenure']>6:
        Status[i]="老客户"
for i in range(len(short_data["Age"])):
    if 0<=short_data.loc[i,'Balance']<=50000:
        AssetStage[i]="低资产"
    if 50000<short_data.loc[i,'Balance']<=90000:
        AssetStage[i]="中下资产"
    if 90000<short_data.loc[i,'Balance']<=120000:
        AssetStage[i]="中上资产"
    if short_data.loc[i,'Balance']>120000:
        AssetStage[i]="高资产"
data_1={'Status':Status,'AssetStage':AssetStage}
df_1=pd.DataFrame(data_1)
df_1.to_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\result3.xlsx',sheet_name='sheet1',index=None)
(2) Count the number of customers lost by new and old customers in each asset stage, draw a heat map on the same canvas, set the maximum and minimum values ​​of the color of the heat map to 1300 and 100, and perform necessary analysis on the results in the report .
test=pd.read_excel('F:\\泰迪杯B题\\B题:银行客户忠诚度分析赛题数据\\任务3.4.2.xlsx')
new_lower=0
new_low=0
new_high=0
new_higher=0
old_lower=0
old_low=0
old_high=0
old_higher=0
for i in range(len(test['Exited'])):
    if test.loc[i,'Exited']==1 and test.loc[i,'Status']=='新客户'and test.loc[i,'AssetStage']=="低资产":
        new_lower+=1
    if test.loc[i,'Exited']==1 and test.loc[i,'Status']=='新客户'and test.loc[i,'AssetStage']=="中下资产":
        new_low+=1 
    if test.loc[i,'Exited']==1 and test.loc[i,'Status']=='新客户'and test.loc[i,'AssetStage']=="中上资产":
        new_high+=1 
    if test.loc[i,'Exited']==1 and test.loc[i,'Status']=='新客户'and test.loc[i,'AssetStage']=="高资产":
        new_higher+=1 
    if test.loc[i,'Exited']==0 and test.loc[i,'Status']=='老客户'and test.loc[i,'AssetStage']=="低资产":
        old_lower+=1 
    if test.loc[i,'Exited']==0 and test.loc[i,'Status']=='老客户'and test.loc[i,'AssetStage']=="中下资产":
        old_low+=1 
    if test.loc[i,'Exited']==0 and test.loc[i,'Status']=='老客户'and test.loc[i,'AssetStage']=="中上资产":
        old_high+=1 
    if test.loc[i,'Exited']==0 and test.loc[i,'Status']=='老客户'and test.loc[i,'AssetStage']=="高资产":
        old_higher+=1 
new=[new_lower,new_low,new_high,new_higher]
old=[old_lower,old_low,old_high,old_higher]
data_2={'new_lower':[180],'new_low':[57],'new_high':[181],'new_higher':[246],
'old_lower':[1066],'old_low':[234],'old_high':[536],'old_higher':[800]}
df_2=pd.DataFrame(data_2)
plt.figure(figsize=(10,5))
sns.heatmap(df_2,vmax=1300,vmin=100)


ps: Since the space is too long, let's write it here for the time being. The rest will be written in another article, and then I will make a summary of the competition

2022 The 5th "Teddy Cup" Data Analysis Skills Competition - Question B - Bank Customer Loyalty Analysis (Part 2) Link: https://blog.csdn.net/weixin_60200880/article/details/127939604?spm=1001.2014.3001.5502

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/127917207