Practical project: Insurance industry user classification

1. Project introduction

Project purpose: Decision tree modeling, classifying insurance industry users, finding users with the most purchase intention, and promoting them

1.1 Industry Background

Introduce the development status and trend changes (macro, industry, society)

Insurance Industry Metrics:
insert image description here

1.2 Data introduction

Data-extraction code 1111
Data dictionary-extraction code 1111
Data source: An insurance company wants to promote a certain product
Business purpose: Make user portraits for this product, and find the most likely to buy people for marketing

The data has 76 fields to process. The data information is divided into several pieces of information:
basic information:
insert image description here
insert image description here
insert image description here
financial information, personal habits, family status, city of residence, etc.

The next processing: determine which features are likely to be related to whether the user purchases insurance, combine business experience, and explore which features are more important through data visualization, feature engineering, etc. Process
:

  1. Import data, observe,
    understand the number of data samples and features, basic information, check for duplicate values

  2. Exploring data, data visualization
    User age distribution, relationship between user age, gender, education and purchase of insurance
    Missing value filling scheme

  • Transcoding scheme (01 transcoding, dummy variables, variables that need special replacement)
  • Special: the same field or the field to be deleted (highly related fields, useless fields) null
    value filling, variable encoding, model modeling
    First split the training set and test set, then fill in missing values, and then transcode
  1. explore data, data visualization
#全部行都能输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 解决坐标轴刻度负号乱码
plt.rcParams['axes.unicode_minus'] = False

# 解决中文乱码问题
plt.rcParams['font.sans-serif'] = ['Simhei']

2. Code implementation

Import Data

data_00 = pd.read_csv('data/ma_resp_data_temp.csv')
feature_dict = pd.read_excel('保险案例数据字典.xlsx')
data_01 = data_00.copy()#备份一下

explore data

data_01.head()

insert image description here

feature_dict #数据字典中是字段的信息

insert image description here

Handle column label name exceptions

Problem: The fields in the data do not match those in the dictionary.
Solution: Determine whether the column label names in data_01 appear in the variable names of the data dictionary

#处理列标签名异常
data_01.columns

insert image description here

feature_dict.变量名
#求补集(找出不匹配的)
np.setxor1d(data_01.columns,feature_dict.变量名) 
#输出的结果是属于其中一个,但不属于另一个的特征

insert image description here
Speculation: N2029 corresponds to N2N29, and so on

#数据表列标签
NY8Y9', 'N2N29', 'N3N39', 'N4N49', 'N5N59', 'N6N64'
#相互对应
#数据字典
'N1819', 'N2029','N3039','N4049', 'N5059','N6064'

meda
# 删掉
data_01['meda'].nunique()#输出75,重复率高,直接删除

Replace exception label

a = ['NY8Y9', 'N2N29', 'N3N39', 'N4N49', 'N5N59', 'N6N64']
b = ['N1819', 'N2029','N3039','N4049', 'N5059','N6064']
# 要替换的列标签,做成映射字典
dic = dict(zip(a,b))
dic
#自定义要转化的向量化函数
def tran(x):
    if x in dic:
        return dic[x]
    else:
        return x
tran = np.vectorize(tran) #向量化

#使用向量化函数替换异常表头
data_01.columns = tran(data_01.columns)

dic content:
insert image description here

Create a custom translation function

Improve the efficiency of exploring data, create a custom translation function, and replace the DataFrame column label name with Chinese by means of a mapping dictionary

dic = {
    
    k:v for k,v in feature_dict[['变量名','变量说明']].values.reshape(-1,2)}

def chinese(x):
    y = x.copy()
    #将输入进来的字段名通过字典映射的方式去对应
    y.columns = pd.Series(y.columns).map(dic)
    return y
chinese(data_01).head()

insert image description here

Explore basic user information

feature_dict.变量名[:5]

insert image description here

feature_dict.变量名[:5].tolist() #得到列表
#['KBM_INDV_ID', 'resp_flag', 'age', 'GEND', 'c210mys']
data_01[feature_dict.变量名[:5].tolist()].head()

insert image description here

#将0_4列取出来并进行翻译
data0_4 = chinese(data_01[feature_dict.变量名[:5].tolist()])
data0_4.head()

insert image description here

data0_4.info()
data0_4.isnull().sum()

Custom Exploration Eigenfrequency Function

Input a DataFrame and output the frequency distribution of each feature

def fre(x):
    for i in x.columns:
        print("字段名:",i)
        print("----------")
        print("字段数据类型:",x[i].dtype)
        print("----------------------------")
        print(x[i].value_counts()) #频数
        print("----------------------------")
        print("缺失值的个数:",x[i].isnull().sum())
        print("------------------------------------------------\n\n")
fre(data0_4)

insert image description here
According to these output information, to make notes in the dictionary, do you need to fill, delete, transcode and other operations?

#条形图对目标列可视化一下
import seaborn as sns

#中文编码
sns.set_style("darkgrid",{
    
    "font.sans-serif":['simhei','Droid Sans Fallback']})
#sns.set()

plt.figure(1,figsize=(6,2))
sns.countplot(y='是否response',data=data0_4)
plt.show()

insert image description here

#根据年龄   概率密度图
#针对购买人群、非购买人群,所有人群的概率密度图
sns.kdeplot(data0_4.年龄[data0_4.是否response==1],label='购买')
sns.kdeplot(data0_4.年龄[data0_4.是否response==0],label='不购买')
sns.kdeplot(data0_4.年龄.dropna(),label='所有人')

plt.xlim([60,90])
plt.xlabel('Age')
plt.ylabel('Density')

insert image description here

Explore family member field information

#将5_22列取出来并进行翻译
data5_22 = chinese(data_01[feature_dict.变量名[5:23].tolist()])
data5_22.head()

insert image description here

fre(data5_22)

insert image description here

Explore Disease Related Fields

#将23_35列取出来并进行翻译
data23_35 = chinese(data_01[feature_dict.变量名[23:35].tolist()])
data23_35.head()

insert image description here

fre(data23_35)
#0 1 转码
def zero_one(x):
    for i in x.columns:
        if x[i].dtype == 'object':
            dic = dict(zip(list(x[i].value_counts().index),range(x[i].nunique())))
            x[i] = x[i].map(dic)
    return x
zero_one(data23_35).corr()

23-35 fields are all yes, the result of whether or not, perform 01 transcoding
insert image description here

import matplotlib.pyplot as plt
import seaborn as sns


#画一个热力图
sns.heatmap(zero_one(data23_35).corr(),cmap='Blues')

insert image description here

Custom function to filter fields whose correlation is higher than a certain value

Filter fields with correlation higher than 0.65

def higt_cor(x,y=0.65):
    data_cor = (x.corr()>y)
    a=[]
    
    for i in data_cor.columns:
        if data_cor[i].sum()>=2:
            a.append(i)

    return a  #这些是我们要考虑删除的
higt_cor(data23_35) #删除这三个:是否有关节炎  胆固醇含量是否过高  是否有过敏性鼻炎

insert image description here

Explore investment related fields

#将35_41列取出来并进行翻译
data35_41 = chinese(data_01[feature_dict.变量名[35:41].tolist()])
data35_41.head()

insert image description here

fre(data35_41)
sns.heatmap(zero_one(data35_41).corr(),cmap='Blues')

insert image description here

sns.countplot(x='N2NCY',hue='resp_flag',data=data_01)
plt.xlabel('县的大小')
plt.ylabel('购买数量')

insert image description here

Explore Household Income

#将51_59列取出来并进行翻译
data51_59 = chinese(data_01[feature_dict.变量名[51:59].tolist()])
data51_59.head()

insert image description here

fre(data51_59)
sns.heatmap(zero_one(data51_59).corr(),cmap='Blues')

insert image description here

higt_cor(data51_59)
#输出:['收入所处排名', '普查家庭有效购买收入', '家庭收入', '家庭房屋价值', '社会经济地位评分']

Explore your area

#将59列之后取出来并进行翻译
data59 = chinese(data_01[feature_dict.变量名[59:].tolist()])
data59.head()

insert image description here

fre(data59)
sns.countplot(x='STATE_NAME',hue='resp_flag',data=data_01)
plt.xlabel('所处的省份')
plt.ylabel('购买数量')

insert image description here

a = chinese(data_01[["c210apvt","c210blu","c210bpvt","c210mob","c210wht","zhip19"]])
sns.heatmap(a.corr(),cmap='Blues')

insert image description here

higt_cor(data59)
#输出:['贫穷以上人的比例', '已婚人群所占比例', '有房子人所占比例', '独宅住户所占比例']
sns.heatmap(data59.corr(),cmap='brg')

insert image description here
The processing to be done is recorded in the dictionary

data cleaning

data_02 = data_01.copy()
data_02.shape
#(43666, 76)

delete feature

del_col = ["KBM_INDV_ID","U18","POEP","AART","AHCH","AASN","COLLEGE",
 "INVE","c210cip","c210hmi","c210hva","c210kses","c210blu","c210bpvt","c210poo","KBM_INDV_ID","meda"]

data_02 = data_02.drop(columns=del_col)
data_02.shape  #(43666, 60)

remove duplicates

data_02.drop_duplicates().shape
#(43666, 60)

Divide training set and test set

Be sure to divide the data set before filling and transcoding

from sklearn.model_selection import train_test_split

y = data_02.pop('resp_flag') #标签
X = data_02  #特征

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,y,test_size=0.3,random_state=100)
Xtrain_01=Xtrain.copy()
Xtest_01=Xtest.copy()
Ytrain_01=Ytrain.copy()
Ytest_01=Ytest.copy()

fill missing values

fill missing values

fil = ["age","c210mah","c210b200","c210psu","c210wht","ilor"]

Xtrain_01[fil].median()

insert image description here

dic = dict(zip(Xtrain_01[fil].median().index,Xtrain_01[fil].median()))
dic

insert image description here

#向训练集填充中位数
Xtrain_01 = Xtrain_01.fillna(dic) 

fill mode

mod = ["N1819","ASKN","MOBPLUS","N2NCY","LIVEWELL","HOMSTAT","HINSUB"]

dic_mod = dict(zip(Xtrain_01[mod].mode().columns,Xtrain_01[mod].iloc[0,:]))

Xtrain_01 = Xtrain_01.fillna(dic_mod) 

replace padding

Xtrain_01['N6064'] = Xtrain_01['N6064'].replace('0','N')  #0  替换成  N
Xtrain_01.isnull().sum()[Xtrain_01.isnull().sum()!=0]
#Series([], dtype: int64)

Populate the test set (summary)

# 需要填的字段
fil = ["age","c210mah","c210b200","c210psu","c210wht","ilor"]

#填充中位数--测试集

dic = dict(zip(Xtest_01[fil].median().index,Xtest_01[fil].median()))

Xtest_01 = Xtest_01.fillna(dic) 

# #填充众数--测试集
mod = ["N1819","ASKN","MOBPLUS","N2NCY","LIVEWELL","HOMSTAT","HINSUB"]

dic_mod = dict(zip(Xtest_01[mod].mode().columns,Xtest_01[mod].iloc[0,:]))

Xtest_01 = Xtest_01.fillna(dic_mod) 

# #替换填充
Xtest_01['N6064'] = Xtest_01['N6064'].replace('0','N') 

Xtest_01.isnull().sum()[Xtest_01.isnull().sum() !=0]
#Series([], dtype: int64)

transcoding

encod_col = pd.read_excel('保险案例数据字典_清洗.xlsx',sheet_name=2)
encod_col.head()

insert image description here

# 查看Xtrain_01中object类型
object_tr =Xtrain_01.describe(include='O').columns
object_tr

insert image description here

#检查一下转码的目标是否出现
np.setdiff1d(object_tr,encod_col['变量名'])
#array([], dtype=object)

0-1 transcoding

# 获取0-1 转码的变量名
z_0_list = encod_col[encod_col['转']=='0-1'].变量名
z_0_list.head()

insert image description here

Xtrain_02 = Xtrain_01[z_0_list]
Xtrain_02.head()

insert image description here

#sklearn的预处理模块
from sklearn.preprocessing import OrdinalEncoder

#fit_transform 直接转
new_arr = OrdinalEncoder().fit_transform(Xtrain_02)
new_arr

insert image description here

# columns 设置表头为原来的   index 索引也是原来
Xtrain_02 = pd.DataFrame(data=new_arr,columns=Xtrain_02.columns,index=Xtrain_02.index)
Xtrain_02.head()

insert image description here
Replace the converted Xtrain_02 0-1 encoding variable with Xtrain_01

Xtrain_01[z_0_list] = Xtrain_02
Xtrain_01.head()

Dummy variable transcoding

#获取哑变量---转码的变量名

o_h_list = encod_col[encod_col['转']=='哑变量'].变量名
o_h_list

insert image description here

Xtrain_01[o_h_list].head()
o_h_01 = ['c210mys','LIVEWELL'] #非字符型的变量
o_h_02 = [i for i in o_h_list if i not in o_h_01] #字符类型的变量

#先转o_h_02
Xtrain_02 = Xtrain_01.copy()
chinese(Xtrain_02[o_h_02]).head()

insert image description here

Xtrain_02 = pd.get_dummies(chinese(Xtrain_02[o_h_02]))
Xtrain_02.head()

insert image description here

#w我们再转 o_h_01
Xtrain_03 = Xtrain_01.copy()

#转成字符类型
Xtrain_03 = Xtrain_03[o_h_01].astype(str)
#转化覆盖
Xtrain_03 = pd.get_dummies(chinese(Xtrain_03[o_h_01]))

Xtrain_03.head()

insert image description here
Xtrain_02 Xtrain_03 is converted, first delete the original transcoded field and then insert the converted one into the data set

# Xtrain_04 删除原转码的字段
Xtrain_04 = Xtrain_01.copy() 
Xtrain_04 = chinese(Xtrain_04.drop(columns=o_h_01+o_h_02))
Xtrain_04.head()

insert image description here

Xtrain_04.shape #(30566, 51)

Xtrain_02.shape #字符的哑变量
#(30566, 31)

Xtrain_03.shape #非字符的哑变量
#(30566, 14)
#将 Xtrain_04  Xtrain_02 Xtrain_03 合并
Xtrain_05 = pd.concat([Xtrain_04,Xtrain_02,Xtrain_03],axis=1)
Xtrain_05.shape
#(30566, 96)
Xtrain_05.head()

insert image description here

Transcoding the test set (summary)

0-1 Transcoding Summary

#获取需要转码的字段
encod_col = pd.read_excel('保险案例数据字典_清洗.xlsx',sheet_name=2)

# 查看Xtest_01中object类型
object_tr =Xtest_01.describe(include='O').columns

#检查一下转码的目标是否出现
np.setdiff1d(object_tr,encod_col['变量名'])

#0-1 转码
# 获取0-1 转码的变量名
z_0_list = encod_col[encod_col['转']=='0-1'].变量名

Xtest_02 = Xtest_01[z_0_list]

#sklearn的预处理模块
from sklearn.preprocessing import OrdinalEncoder

#fit_transform 直接转
new_arr = OrdinalEncoder().fit_transform(Xtest_02)
# columns 设置表头为原来的   index 索引也是原来
Xtest_02 = pd.DataFrame(data=new_arr,columns=Xtest_02.columns,index=Xtest_02.index)

Xtest_01[z_0_list] = Xtest_02

Xtest_01.head()

insert image description here
dummy variable, summary

#获取哑变量转码的变量
o_h_list = encod_col[encod_col['转']=='哑变量'].变量名

o_h_01 = ['c210mys','LIVEWELL'] #非字符型的变量
o_h_02 = [i for i in o_h_list if i not in o_h_01] #字符类型的变量

#先转o_h_02 字符类型
Xtest_02 = Xtest_01.copy()
Xtest_02 = pd.get_dummies(chinese(Xtest_02[o_h_02]))

#w我们再转 o_h_01  非字符
Xtest_03 = Xtest_01.copy()
#转成字符类型
Xtest_03 = Xtest_03[o_h_01].astype(str)
#转化覆盖
Xtest_03 = pd.get_dummies(chinese(Xtest_03[o_h_01]))


# Xtrain_04 删除原转码的字段
Xtest_04 = Xtest_01.copy() 
Xtest_04 = chinese(Xtest_04.drop(columns=o_h_01+o_h_02))


#将 Xtest_04  Xtest_02 Xtest_03 合并
Xtest_05 = pd.concat([Xtest_04,Xtest_02,Xtest_03],axis=1)
Xtest_05.shape #(13100,96)
Xtest_05.head()

insert image description here

preliminary modeling

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf = DecisionTreeClassifier(random_state=420,class_weight='balanced')
cvs = cross_val_score(clf,Xtrain_05,Ytrain)
cvs.mean()

insert image description here

Grid search to find optimal parameters

from sklearn.model_selection import GridSearchCV

#测试参数
param_test = {
    
    
             'splitter':('best','random'),
             'criterion':('gini','entropy'), #基尼  信息熵
             'max_depth':range(3,15) #最大深度
             #,min_samples_leaf:(1,50,5)
}

gsearch= GridSearchCV(estimator=clf, #对应模型
                param_grid=param_test,#要找最优的参数
                scoring='roc_auc',#准确度评估标准 
                n_jobs=-1,# 并行数  个数   -1:跟CPU核数一致
                cv = 5,#交叉验证 5折
                iid=False,# 默认是True  与各个样本的分布一致 
                verbose=2#输出训练过程
                )

gsearch.fit(Xtrain_05,Ytrain_01)

insert image description here

#优化期间观察到的最高评分
gsearch.best_score_

gsearch.best_params_

insert image description here

model evaluation

from sklearn.metrics import accuracy_score #准确率
from sklearn.metrics import precision_score #精准率
from sklearn.metrics import recall_score #召回率
from sklearn.metrics import roc_curve
y_pre = gsearch.predict(Xtest_05)
accuracy_score(y_pre,Ytest) #0.609007633

precision_score(y_pre,Ytest) #0.7481523
recall_score(y_pre,Ytest)#0.5100116264048572
fpr,tpr,thresholds = roc_curve(y_pre,Ytest) #roc参数
import matplotlib.pyplot as plt

plt.plot(fpr,tpr,c='b',label='roc曲线')
plt.plot(fpr,fpr,c='r',ls='--')

insert image description here

output rules

#最优参数
#{'criterion': 'entropy', 'max_depth': 6, 'splitter': 'best'}
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

import graphviz

#将最优参数放到分类器
clf = DecisionTreeClassifier(criterion='entropy',max_depth=6,splitter='best')
clf = clf.fit(Xtrain_05,Ytrain)


features = Xtrain_05.columns
dot_data = tree.export_graphviz(clf,
                     feature_names=features,
                     class_names=['Not Buy','Buy'],
                     filled=True,
                     rounded=True,
                     leaves_parallel=False)

graph= graphviz.Source(dot_data)

graph

insert image description here
insert image description here
insert image description here

3. Output analysis

What are the characteristics of the two types of customers with the highest percentage of purchases?
the first sort:

  • In an area with a low proportion of medical insurance coverage
  • Residence period is less than 7 years
  • 65-72-year-old group
    , when we make suggestions to business personnel, we suggest that they carry out publicity and promotion in areas with a low proportion of medical insurance coverage, and then focus on those old people who have just arrived in the area and are over 65 years old. If the crowd conducts insurance marketing, the success rate should be higher.
    The second category:
  • In an area with a low proportion of medical insurance coverage
  • Residence period is more than 7 years
  • Residential houses with higher value
    This group of people are users of high-end communities that live in the area. These groups of people are also the objects we need to focus on for insurance marketing.

other suggestion:

  1. Understand customer needs
    We need to understand customer needs and carry out insurance marketing according to customer needs. PIOS data: recommend products to customers, and use personal data (personal characteristics) to recommend insurance products to customers. Travelers: Based on their own data (family data), life stage information recommends financial insurance, life insurance, insurance, old insurance, and user education insurance. External data, asset insurance and life insurance are all provided to high-level personnel. Using external data, we can improve the management of insurance products and increase investment returns and returns.
  2. Develop new products
    Insurance companies should also assist external channels to develop insurance products suitable for different business environments, such as new types of insurance, such as flight delay insurance, travel time insurance and phone theft insurance. The aim is to offer other insurance products, not to benefit from these insurances, but to find potential customers. In addition, insurance companies will connect with customers and understand customers through data analysis. External factors will reduce the marketing cost of insurance and directly improve the return on investment.

Guess you like

Origin blog.csdn.net/Sun123234/article/details/128968067