Financial Scorecard Project—2. Introduction to Bank Customer Churn Early Warning Model (Single-factor and Multi-factor Analysis)

introduction

  The following is a jupyter notebook, and the complete code analysis can be found on Github: https://github.com/Libra-1023/data-mining/blob/master/Bank_customer_churn/Bank_customer_churn_EDA.ipynb

1. Banking customer groups and product categories

  Generally speaking, bank customers can be divided into individual customers and corporate customers.
  The bank’s business to individual customers is mainly based on the reasonable arrangement of customers’ personal finances, providing them with deposits and withdrawals, small loans, agency investment and wealth management, information consulting, and other various intermediary services, thereby obtaining benefits for customers, and Help them prevent risks while improving the bank’s own benefits.
  Corporate customers refer to all enterprises, institutions and government agencies that have business relationships with banks, of which corporate units are the mainstay. Corporate customers can bring a large amount of deposits, loans and fees to the bank, and become an important source of bank profits.
  Retail customers are generally divided into the following five types:
Insert picture description here
  Bank's business is generally divided into asset business and liability business.
Bank credit asset business:

  1. Credit Loans
  2. Mortgage loans-divided into inventory mortgage loans and real estate mortgage loans
  3. Guarantee loan
  4. Loan securitization

Bank liability business:

  1. Demand deposit
  2. Time deposit
  3. savings balance
  4. Negotiable certificate of deposit
  5. Other types

2. The business significance of the customer churn early warning model

  Strictly speaking, customer churn refers to the termination and cancellation of all customer accounts in the bank. However, a specific business department can be separately defined in all or some businesses of the department, the customer's termination behavior.
  The research results show that the customer churn of commercial banks is relatively serious, and the customer churn rate of domestic commercial banks can reach 20% or even higher. The cost of acquiring new customers can be up to 5 times that of maintaining existing customers. Therefore, it is particularly important to dig out information that has an impact on churn from the massive customer transaction records, and to establish an efficient customer churn early warning system.
The main reasons for customer churn include:

  1. Price loss
  2. Product loss
  3. Service loss
  4. Market loss
  5. Promotion loss
  6. Technology loss
  7. Political drain

Basic methods to maintain customer relationships:

  1. Tracking system
  2. Product follow-up
  3. Expand sales
  4. Maintenance visit
  5. Mechanism maintenance

Establish a quantitative model to reasonably predict the potential loss of customer groups

  • Common risk factors
  • The number and type of products held by the customer
  • Customer's age and gender
  • Affected by geographic area
  • Affected by product category
  • Transaction interval
  • Marketing and promotion methods
  • Bank's service method and attitude

3. Data introduction and description

  The data set has a total of 17,241 customer data, including 1,741 churn samples, with a total churn rate of 10.10%. The
  data is divided into two parts: bank-owned fields and external third-party data
bank-owned fields:

  • Account information
  • Human information
  • Deposit information
  • Consumption and transaction information
  • Financial management and fund information
  • Counter service, online banking information

External third-party data:

  • Outbound customer data
  • Asset data
  • Other consumer data

1. Continuous variables of single factor analysis

  • Percentage of valid records—missing rate

    # 提取出不含空值的包含自变量和因变量的数据集, np.nan != np.nan
        validDf = df.loc[df[col] == df[col]][[col,target]]
        # 非缺失度的百分比
        validRcd = validDf.shape[0]*1.0/df.shape[0]
        # 格式化非缺失度
        validRcdFmt = "%.2f%%"%(validRcd*100)
    
  • Overall distribution
    Initial distribution and truncated distribution

    # 截断
    if truncation == True:
        # 截断分布
        pcnt95 = np.percentile(validDf[col],95)
        # 将流失客户与非流失客户的存款额大于截断值的赋值于截断值
        x = x.map(lambda x: min(x,pcnt95))
        y = y.map(lambda x: min(x,pcnt95))
    
  • Difference in distribution of target variable

Analysis of Variance: Wikipedia-Analysis of Variance
Insert picture description here

2. Categorical variables of single factor analysis

  • Percentage of valid records

    validDf = df.loc[df[col] == df[col]][[col, target]]
    validRcd = validDf.shape[0]*1.0/df.shape[0]
    recdNum = validDf.shape[0]
    validRcdFmt = "%.2f%%"%(validRcd*100)
    
  • species

  • Overall distribution

    # 对类别型变量进行单因子分析
    filepath = path+r'/单因子分析/类别型变量/'
    for val in stringCols:
        CharVarPerf(Alldata,val,'CHURN_CUST_IND',filepath)
    
  • Difference in distribution of target variable

Chi-square test: Wikipedia-Chi-square test
Insert picture description here

chisqDf = Alldata[['GENDER_CD','CHURN_CUST_IND']]
grouped = chisqDf['CHURN_CUST_IND'].groupby(chisqDf['GENDER_CD'])   # 分组
count = list(grouped.count())
churn = list(grouped.sum())
chisqTable = pd.DataFrame({
    
    'total':count,'churn':churn})
# 0.101为期望流失率,相乘即为期望流失人数
chisqTable['expected'] = chisqTable['total'].map(lambda x: round(x*0.101))
chisqValList = chisqTable[['churn','expected']].apply(lambda x: (x[0]-x[1])**2/x[1], axis=1)
# chisqVal即为卡方
chisqVal = sum(chisqValList)

3. Multi-factor analysis

   There is a certain degree of collinearity between variables due to business relations, calculation logic and other businesses. This collinearity needs to be studied and handled appropriately

  • Information redundancy
  • The cost of maintaining data
  • Has a certain impact on some models
# 使用短名称代替原始名称,因为原始名称太长而无法显示
col_to_index = {
    
    numericCols[i] : 'var'+str(i) for i in range(len(numericCols))}
# 在columns列表中取样,因为单个图无法显示太多的列
corrCols = random.sample(numericCols,15)
sampleDf = Alldata[corrCols]
for col in corrCols:
    sampleDf.rename(columns = {
    
    col : col_to_index[col]},inplace = True)
# 画散点矩阵图
# diagonal = 'hist' or 'kde',当diagonal = 'hist'时,为对角线直方图,当diagonal='kde'时,为核密度估计函数
# alpha=0.2为透明度
scatter_matrix(sampleDf, alpha=0.2, figsize=(6, 6), diagonal='kde')

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/114310105