Insurance industry data analysis

Data analysis complete process: insurance industry data analysis

1. Business background

1. Business environment

  • Macro
    China is the world's second largest insurance market, but there is still a significant gap between the insurance density and the world average.

  • Industry The
    insurance industry's insurance premiums in 2018 were 38 trillion yuan, a year-on-year increase of less than 4%. The past "short, flat and fast" development model can no longer meet the needs of industry development in the new era. The industry and users have long-standing pain points that are difficult to solve, which restricts the development of the industry.


  • The development of the social Internet economy has brought an incremental market to the insurance industry. At the same time, with the expansion of the scale of Internet users, the behavior and habits of users have changed. All of these need to be reached through the Internet.

  • Insurance technology: At present, along with the continuous application of technology in the insurance industry, the concept of Internet insurance will be highly integrated with the concept of insurance technology.
    Insert picture description here
    China's insurance market continues to grow rapidly.  According to data from the China Insurance Regulatory Commission, from 2011 to 2018, national premium income increased from 1.4 trillion to 3.8 trillion, a compound annual growth rate of 17.2%. In 2014, China’s premium income exceeded RMB 2 trillion, becoming the world’s third largest emerging insurance market after the United States and Japan; in 2016, China’s overall premium income exceeded RMB 3 trillion, surpassing Japan, becoming the world’s second largest insurance market ; In 2019, China's premium collection is expected to exceed 4 trillion yuan.

2. Development status

  • Overview
    Affected by the structural transformation of the insurance industry, the overall development of Internet insurance was hindered. In 2018, the industry premium income was 188.9 billion yuan, which was basically the same as last year. The development of different insurance types showed a differentiation pattern. Among them, health insurance grew rapidly, with a year-on-year growth of 108% in 2018. Mainly driven by short-term medical insurance.

  • Professional Internet insurance companies on the supply side of the structure are growing rapidly, but the excessively high fixed costs and channel fees have highlighted their profitability problems. In addition to the strong development status, the construction of operating channels and the output of technology are the methods to break the situation in the future. The formation of a third-party platform on the channel side is With the main and official website as the supplement, third-party platforms have gradually developed a variety of innovative business models such as B2C, B2A, and B2B2C.
  • Mode
    Internet insurance is not limited to channel innovation, its core advantages are also reflected in product design innovation and service experience improvement.

3. Development Trend

  • Competitive
    bureaus As more companies enter the bureau, the competition for traffic becomes more intense. In the end, in-depth cooperation between insurance companies and third-party platforms will become the norm.
  • Insurtech At present,
    along with the continuous application of technology in the insurance industry, the concept of Internet insurance will be highly integrated with the concept of insurance technology.

4. Metrics

5. Business goals

For users of health insurance products of insurance companies, create user portraits and then conduct precision insurance marketing.

2. Case data

1. Data source

An American insurance company has cooperated with this company for many years. Now the company has a new medical insurance product ready to go on the market.

2. Product introduction

This new medical product is mainly a medical supplementary insurance for people over 65 years old, and the sales channel is direct mail.

3. Commercial Purpose

Create a user profile for a certain health insurance product of an insurance company, and identify the most prone to purchase groups for insurance marketing.

4. Data introduction

There are 76 fields in this case data, and there are many fields. When processing data, you need to categorize the data according to categories to facilitate understanding and viewing.

4.1 Basic information

Insert picture description here

4.2 Basic situation

Insert picture description here

4.3 Family members

Insert picture description here

4.4 Status of family members

Insert picture description here

4.5 Disease history

Insert picture description here

4.6 Financial Information

Insert picture description here

4.7 Personal Habits

Insert picture description here

4.8 Family status

Insert picture description here

4.9 City of Residence

Insert picture description here

5. Analyze ideas

  • Based on experience, we can roughly determine which features are likely to be related to whether the user purchases insurance.
  • Combining our business experience, data visualization, and feature engineering methods, we will first explore which of these features are more important.
  • After modeling, review the features that we think are more important or unimportant here to see if the judgment is accurate.

Three, Python code implementation

  • Understand the number of data samples and features, data types, basic information, etc.
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv(r'D:\liwork\a\data\ma_resp_data_temp.csv')
pd.set_option('max_columns',100)  #显示100列数据
df.head()
df.shape
df.info()
  • Statistics basic information, count the number of null values
#将id转化为object
df['KBM_INDV_ID']=df['KBM_INDV_ID'].astype('object')
df.dtypes
df.describe().T
describe = df.describe().T
describe.to_excel('output/describe_var.xlsx')
#统计空值
len(df.columns)
#空值的列
len(df.columns)-df.dropna(axis=1).shape[1]
NA=df.isnull().sum() # 统计各个列空值的数量
NA
#重置索引
NA=NA.reset_index()
NA
`#修改列名
NA.columns=['Var','NA_count']
NA``
#过滤出有缺失的数据,过滤出大于0的数据
NA=NA[NA.NA_count>0].reset_index(drop=True)
NA
#统计空值比例
NA.NA_count/df.shape[0]
  • Data visualization analysis
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')

#支持中文输出
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

 # plt.rcParams['font.family']='Arial Unicode MS' # OS系统
  • Exploring whether the sample classification is balanced
df.resp_flag.value_counts()

plt.figure(figsize=(10,3))
sns.countplot(y='resp_flag',data=df)
plt.show()

#比例关系
df.resp_flag.sum()/df.resp_flag.shape[0]
  • Plot the distribution of age
#直方图+密度曲线
sns.distplot(df['age'],bins=20)

df['age'].min()   #查看年龄
df['age'].max()
  • Draw the age distribution of the two types of samples separately
x = np.random.randn(100)
sns.kdeplot(x)

#填充颜色
sns.kdeplot(x,shade=True,color='y')

sns.kdeplot(df.age[df.resp_flag==0],label='0',shade=True)
sns.kdeplot(df.age[df.resp_flag==1],label='1',shade=True)
plt.xlabel('Age')
plt.ylabel('Density')
  • Check the number of insurance purchases between different academic qualifications
#学历的分布
plt.figure(figsize=(10,3))
sns.countplot(y='c210mys',data=df)
plt.show()

sns.countplot(x='c210mys',hue='resp_flag',data=df)
  • The number of purchases corresponding to the size of different counties
sns.countplot(x='N2NCY',hue='resp_flag',data=df)

For columns with null values, the data type of each column is calculated and added to the NA table

temp=[]
for i in NA.Var:
    temp.append(df[i].dtypes)
    
NA['数据类型']=temp

NA
  • Null padding
NA[NA.Var!='age']

df.AASN.mode()[0]

#用众数填充
for i in NA[NA.Var!='age'].Var:
     df[i].fillna(df[i].mode()[0],inplace=True)

#对年龄用均值进行填充
df.age.fillna(df.age.mean(),inplace=True)

#验证结果 全是0就代表都已填充完毕
df.isnull().sum()

Variable coding

df.head()

#删除ID
del df['KBM_INDV_ID']

#筛选object
df_object=df.select_dtypes('object')

df_object.shape

from sklearn.preprocessing import OrdinalEncoder

df_object=OrdinalEncoder().fit_transform(df_object)
df_object

#字符转数值
for i in df.columns:
    if df[i].dtypes=='object':
        df[i]=OrdinalEncoder().fit_transform(df[[i]])
df.head()

Modeling

from sklearn import tree
from sklearn.model_selection import train_test_split

#切分数据集
X=df.iloc[:,1:]
y=df['resp_flag']

Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,y,test_size=0.3,random_state=420)

#建模
clf = tree.DecisionTreeClassifier().fit(Xtrain,Ytrain)
clf.score(Xtest,Ytest)

Model optimization

from sklearn.model_selection import GridSearchCV

#网格搜索
param_grid={'max_depth':range(3,8),
           'min_samples_leaf':range(1000,3000,100)}
GR = GridSearchCV(tree.DecisionTreeClassifier(),param_grid,n_jobs=-1,cv=5)

GR.fit(Xtrain,Ytrain)

#求出tree.DecisionTreeClassifier里面参数的值
GR.best_params_

GR.best_score_

clf=tree.DecisionTreeClassifier(max_depth=7,min_samples_leaf=1000).fit(Xtrain,Ytrain)
clf.score(Xtest,Ytest)

Draw a decision tree

features=list(df.columns[1:])

import graphviz  #要提前安装哦

dot_data = tree.export_graphviz(clf,
                               feature_names=features,
                               class_names=['No Purchase','Purchase'],
                               filled=True,
                               rounded=True)

graph = graphviz.Source(dot_data)
graph

#输出图片
graph.render('model1')

Four, output results

Let's take a look at the characteristics of the two types of customers with the highest purchase ratio:

the first sort

  • Located in an area with a low percentage of medical insurance coverage
  • Residence period is less than 7 years
  • 65-72 years old group
  • So when we make recommendations to business people, we suggest that they promote and promote in areas with a low percentage of medical insurance coverage, and then focus on those who have just arrived in the area and are over the age of 65 to carry out insurance marketing to these groups. The success rate should be higher.

Second category

  • Located in an area with a low percentage of medical insurance coverage
  • Residence period is more than 7 years
  • Higher residential value
  • This group of people are users of high-end communities that often live in the area. These groups are also the targets we need to focus on for insurance marketing.

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/113591095