Article Directory
introduction
The following is a jupyter notebook, and the complete code analysis can be found on Github: https://github.com/Libra-1023/data-mining/blob/master/Bank_customer_churn/Bank_customer_churn_EDA.ipynb
1. Banking customer groups and product categories
Generally speaking, bank customers can be divided into individual customers and corporate customers.
The bank’s business to individual customers is mainly based on the reasonable arrangement of customers’ personal finances, providing them with deposits and withdrawals, small loans, agency investment and wealth management, information consulting, and other various intermediary services, thereby obtaining benefits for customers, and Help them prevent risks while improving the bank’s own benefits.
Corporate customers refer to all enterprises, institutions and government agencies that have business relationships with banks, of which corporate units are the mainstay. Corporate customers can bring a large amount of deposits, loans and fees to the bank, and become an important source of bank profits.
Retail customers are generally divided into the following five types:
Bank's business is generally divided into asset business and liability business.
Bank credit asset business:
- Credit Loans
- Mortgage loans-divided into inventory mortgage loans and real estate mortgage loans
- Guarantee loan
- Loan securitization
Bank liability business:
- Demand deposit
- Time deposit
- savings balance
- Negotiable certificate of deposit
- Other types
2. The business significance of the customer churn early warning model
Strictly speaking, customer churn refers to the termination and cancellation of all customer accounts in the bank. However, a specific business department can be separately defined in all or some businesses of the department, the customer's termination behavior.
The research results show that the customer churn of commercial banks is relatively serious, and the customer churn rate of domestic commercial banks can reach 20% or even higher. The cost of acquiring new customers can be up to 5 times that of maintaining existing customers. Therefore, it is particularly important to dig out information that has an impact on churn from the massive customer transaction records, and to establish an efficient customer churn early warning system.
The main reasons for customer churn include:
- Price loss
- Product loss
- Service loss
- Market loss
- Promotion loss
- Technology loss
- Political drain
Basic methods to maintain customer relationships:
- Tracking system
- Product follow-up
- Expand sales
- Maintenance visit
- Mechanism maintenance
Establish a quantitative model to reasonably predict the potential loss of customer groups
- Common risk factors
- The number and type of products held by the customer
- Customer's age and gender
- Affected by geographic area
- Affected by product category
- Transaction interval
- Marketing and promotion methods
- Bank's service method and attitude
3. Data introduction and description
The data set has a total of 17,241 customer data, including 1,741 churn samples, with a total churn rate of 10.10%. The
data is divided into two parts: bank-owned fields and external third-party data
bank-owned fields:
- Account information
- Human information
- Deposit information
- Consumption and transaction information
- Financial management and fund information
- Counter service, online banking information
External third-party data:
- Outbound customer data
- Asset data
- Other consumer data
1. Continuous variables of single factor analysis
-
Percentage of valid records—missing rate
# 提取出不含空值的包含自变量和因变量的数据集, np.nan != np.nan validDf = df.loc[df[col] == df[col]][[col,target]] # 非缺失度的百分比 validRcd = validDf.shape[0]*1.0/df.shape[0] # 格式化非缺失度 validRcdFmt = "%.2f%%"%(validRcd*100)
-
Overall distribution
Initial distribution and truncated distribution# 截断 if truncation == True: # 截断分布 pcnt95 = np.percentile(validDf[col],95) # 将流失客户与非流失客户的存款额大于截断值的赋值于截断值 x = x.map(lambda x: min(x,pcnt95)) y = y.map(lambda x: min(x,pcnt95))
-
Difference in distribution of target variable
Analysis of Variance: Wikipedia-Analysis of Variance
2. Categorical variables of single factor analysis
-
Percentage of valid records
validDf = df.loc[df[col] == df[col]][[col, target]] validRcd = validDf.shape[0]*1.0/df.shape[0] recdNum = validDf.shape[0] validRcdFmt = "%.2f%%"%(validRcd*100)
-
species
-
Overall distribution
# 对类别型变量进行单因子分析 filepath = path+r'/单因子分析/类别型变量/' for val in stringCols: CharVarPerf(Alldata,val,'CHURN_CUST_IND',filepath)
-
Difference in distribution of target variable
Chi-square test: Wikipedia-Chi-square test
chisqDf = Alldata[['GENDER_CD','CHURN_CUST_IND']]
grouped = chisqDf['CHURN_CUST_IND'].groupby(chisqDf['GENDER_CD']) # 分组
count = list(grouped.count())
churn = list(grouped.sum())
chisqTable = pd.DataFrame({
'total':count,'churn':churn})
# 0.101为期望流失率,相乘即为期望流失人数
chisqTable['expected'] = chisqTable['total'].map(lambda x: round(x*0.101))
chisqValList = chisqTable[['churn','expected']].apply(lambda x: (x[0]-x[1])**2/x[1], axis=1)
# chisqVal即为卡方
chisqVal = sum(chisqValList)
3. Multi-factor analysis
There is a certain degree of collinearity between variables due to business relations, calculation logic and other businesses. This collinearity needs to be studied and handled appropriately
- Information redundancy
- The cost of maintaining data
- Has a certain impact on some models
# 使用短名称代替原始名称,因为原始名称太长而无法显示
col_to_index = {
numericCols[i] : 'var'+str(i) for i in range(len(numericCols))}
# 在columns列表中取样,因为单个图无法显示太多的列
corrCols = random.sample(numericCols,15)
sampleDf = Alldata[corrCols]
for col in corrCols:
sampleDf.rename(columns = {
col : col_to_index[col]},inplace = True)
# 画散点矩阵图
# diagonal = 'hist' or 'kde',当diagonal = 'hist'时,为对角线直方图,当diagonal='kde'时,为核密度估计函数
# alpha=0.2为透明度
scatter_matrix(sampleDf, alpha=0.2, figsize=(6, 6), diagonal='kde')