"Application of Machine Learning in Auto Insurance Pricing" Experimental Report

 

Table of contents

1. Experimental topic

        Application of Machine Learning in Auto Insurance Pricing

2. Experimental settings

1. Operating system:

2. IDE:

3. python:

4. Libraries:

3. Experimental content

Conjectures before the experiment:

4. Experimental results

1. Data preprocessing and data division

One-hot encoding processing results (taking regions as an example)

2. Model training

3. Draw the initial decision tree

4. Model evaluation

5. Model optimization

Draw the optimized decision tree

6. Modify the sample and grid search parameters to further optimize the model

5. Experimental Analysis


 

 

1. Experimental topic

        Application of Machine Learning in Auto Insurance Pricing

2. Experimental settings

1.  Operating system:

        Windows 11 Home

2. IDE:

        PyCharm 2022.3.1 (Professional Edition)

3. python

        3.8.0

4.  Library:

numpy

1.20.0

 

matplotlib

3.7.1

 

pandas

1.1.5

 

scikit-learn

0.24.2

 

 

conda create -n ML python==3.8 pandas scikit-learn numpy matplotlib

3. Experimental content

        In this experiment, a decision tree model is used for modeling to realize the analysis of auto insurance data. The auto insurance data is the following MTPLdata.csv data set:

f2dccf851f8245909e63b5e927fd0e01.png

        The auto insurance dataset contains 500,000 samples, each with 8 features and 1 label. Among them, label is a binary variable with a value of 0 or 1, indicating whether the owner has reported a car insurance claim (clm, int64); features include age of the owner (age, int64), age of the vehicle (ac, int64), power ( power, int64), fuel type (gas, object), brand (brand, object), owner's area (area, object), residential vehicle density (dens, int64), and car license type (ct, object).

Conjectures before the experiment :

        See the experiment report for details

4. Experimental results

1. Data preprocessing and data division

        Read in data and perform data preprocessing, including dummy variable processing and splitting training and testing sets

MTPLdata = pd.read_csv('MTPLdata.csv')
# 哑变量处理-独热编码
# 将clm列的数据类型转换为字符串
MTPLdata['clm'] = MTPLdata['clm'].map(str)
# 选择包括第1、2、3、4、5、6、7、8列的数据作为特征输入
# ac、brand、age、gas、power
X_raw = MTPLdata.iloc[:, [0, 1, 2, 3, 4]]
# X_raw = MTPLdata.iloc[:, [0, 1, 2, 3, 4, 5, 6, 7]]
# 对X进行独热编码
X = pd.get_dummies(X_raw)
# 选择第9列作为标签y
y = MTPLdata.iloc[:, 8]

# 将数据划分为训练集和测试集,测试集占总数据的20%
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1)

 

One-hot encoding processing results (taking regions as an example)

bca076c069c04a629d08ec60ca4b6d9d.png

2. Model training

        We use a decision tree classifier model for training (set the maximum depth of the tree to 2, use balanced class weights, and use the Gini coefficient to check the accuracy by default).

model = DecisionTreeClassifier(max_depth=2, class_weight='balanced', random_state=123)
model.fit(X_train, y_train)     # 数据拟合
model.score(X_test, y_test)     # 在测试集上评估模型

3. Draw the initial decision tree

        In order to better interpret the decision tree model, call the plot_tree function to draw a decision tree.

plt.figure(figsize=(11, 11))
plot_tree(model, feature_names=X.columns, node_ids=True, rounded=True, precision=2)
plt.show()

e3ef80d6c242491193ea5a23b4866f78.png

 

4. Model evaluation

pred = model.predict(X_test)
table = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
# table

# 计算模型的准确率、错误率、召回率、特异度和查准率
table = np.array(table)  # 将pandas DataFrame转换为numpy array
Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)      # 准确率
Error_rate = 1 - Accuracy  # 错误率
Sensitivity = table[1, 1] / (table[1, 0] + table[1, 1])     # 召回率
Specificity = table[0, 0] / (table[0, 0] + table[0, 1])     # 特异度
Recall = table[1, 1] / (table[0, 1] + table[1, 1])          # 查准率

5. Model optimization

        In order to find a better model, we use the cost_complexity_pruning_path function to calculate the total impurity of the leaf nodes of the decision tree corresponding to different ccp_alpha, and draw the relationship between ccp_alpha and the total impurity.

model = DecisionTreeClassifier(class_weight='balanced', random_state=123)
path = model.cost_complexity_pruning_path(X_train, y_train)
plt.plot(path.ccp_alphas, path.impurities, marker='o', drawstyle='steps-post')
plt.xlabel('alpha (cost-complexity parameter)')
plt.ylabel('Total Leaf Impurities')
plt.title('Total Leaf Impurities vs alpha for Training Set')
plt.show()

                                        1w sample 50w sample

74fe13bc704a4a1c9f27b316cd7fd8c6.png

         Next, we select the optimal ccp_alpha by cross-validation, and retrain the model with the optimal ccp_alpha.

 

Draw the optimized decision tree

rangeccpalpha = np.linspace(0.000001, 0.0001, 10, endpoint=True)
param_grid = {
    'max_depth':  np.arange(3, 7, 1),
    # 'ccp_alpha': rangeccpalpha,
    'min_samples_leaf': np.arange(1, 5, 1)
}
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeClassifier(class_weight='balanced', random_state=123),
                     param_grid, cv=kfold)
model.fit(X_train, y_train)

8a1af918529548a9b3d2294700c4d59d.png

 

In addition, the importance of individual features is calculated and a feature importance map is plotted.

plt.figure(figsize=(20, 20))
sorted_index = model.feature_importances_.argsort()
plt.barh(range(X_train.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X_train.shape[1]), X_train.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Decision Tree')
plt.tight_layout()
plt.show()

d24a68bc5d2242bd9b054d558b1c8567.png

6. Modify the sample and grid search parameters to further optimize the model

 

   See the experiment report for details

 

5. Experimental Analysis

        Please download the code and experiment report resources corresponding to this experiment (the experiment analysis part has 2 pages and 1162 words)

      

 

 

Guess you like

Origin blog.csdn.net/m0_63834988/article/details/132307577