Allstate American Insurance Claims Prediction Model Case_Enterprise Research_Dissertation Research_Graduation Design

overview

Founded in 1931, Allstate Insurance Company (Allstate) is the second largest property and casualty insurance company engaged in personal insurance business in the United States, and ranks among the 15 largest life insurance companies in the United States. The company is headquartered in the Chicago area.

In December 2018, the World Brand Lab released the "Top 500 World Brands 2018" list, and Allstate Insurance Company of the United States ranked 486th.

Allstate is working to improve its claims service and has released a scientific dataset that will help build models that predict claim severity. This case study aims to develop an automated model for predicting the cost (severity) of a specific insurance claim.

picture

The dataset provided to Allstate contains data on accidents that occurred in households (each row of data corresponds to a specific household), and the target variable is claims cost, which is the number being measured.

Since the data provided to Allstate is about personal information, Allstate has done a good job of heavily anonymizing the data (changing feature names), and this aspect of the dataset makes it difficult to understand the features. The dataset contains claims records for Allstate customers. The entire dataset is highly anonymized.

The following figure is a screenshot of the dataset

picture

Problem statement:

The training dataset contains both categorical and continuous features. The target variable is a loss (claim severity), which is of type numeric. Since the target variables are numbers and loss values, we can treat this problem as a regression problem. This is also a supervised machine learning problem because we have target values ​​for the training data. In my opinion, doing so will improve overall customer service, which will benefit the company and claimants, and help the insurance company.

Our task is to predict how severe a new family's claim will or is likely to be, predicting future losses based on given characteristics.

index

For the regression model, the metric we are going to use - Mean Absolute Error (MAE)

This is a simple and straightforward indicator that compares the predicted value with the actual value. MAE does not affect our model much by not identifying outliers. MAE is easy to explain because it is simple and easy to calculate.

picture

part 1:

Data Analysis and Visualization:

1.1 Dataset overview:

 

· Training data: 188318 rows and 132 features/columns and target variable.

· Test data: 129446 rows and 132 features/columns where the target feature does not exist.

· In addition to the target feature and id feature, there are 130 distinct features in total. These features contain categorical and numeric data types.

· Of the 130 features, 116 are categorical and 14 are numerical. We notice that there are no missing values ​​in the train dataset. This shows that Allstate provides highly user-friendly preset data.

1.2 Missing values

picture

  • Since there are no missing values ​​in the training and test data, we can say that this data has been preprocessed. If there are missing values ​​in the data, we need to consider alternatives such as dropping those rows or filling those values ​​with mean values ​​to handle missing data.

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)ax1.hist(unique_categories.unique_values, bins=50)ax1.set_title('Categorical featurs vs Unique values ')ax1.set_xlabel('Unique values in a feature')ax1.set_ylabel('Features')

values = unique_categories[unique_categories.unique_values <= 25].unique_valuesax2.set_xlim(1,25)ax2.hist(values, bins=30)ax2.set_title('Zooming left part of histogram')ax2.set_xlabel('Unique values in a feature')ax2.set_ylabel('Features')ax2.grid(True)

picture

  • From the image above, we notice that there are almost 100 features with less than 10 unique values ​​for each feature.

  • There is one feature that contains more than 300 unique values, we also see this in the code above.

  • Trying to plot the zoomed image to see how these distributions happen, we can see that only very few features have more unique values.

1.3 Continuous feature data analysis part:

picture

  • We notice that the distribution of each continuous feature varies greatly. There are many spikes in each graph. So there is no consistency of these continuous features in PDF.

  • The Cont2 features actually look like a normal distribution, but we can't talk about what that feature is because they are anonymous.

1.4 Checkbox plots for continuous features

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(14,5)ax1.boxplot(train_data['loss'])ax1.set_ylabel('loss')ax1.set_title('Box plot of loss feature')

#values = uniq_values_in_categories[uniq_values_in_categories.unique_values <= 25].unique_valuesax2.set_ylim(1,10000)ax2.boxplot(train_data['loss'])ax2.set_ylabel('loss')ax2.set_title('Zoomed version of loss feature')

picture

picture

  • We can see that most of the continuous features have mean values ​​around 0.5. All values ​​are between 0 and 1. This indicates that the data has been normalized.

  • We noticed that the average of all continuous features is around 0.5, which is very good, we can say that Allstate did the preprocessing and then gave the data.

  • All continuous features are in the range [0,1] which is very good

1.5 Check the correlation between consecutive features:

picture

  • There are several highly correlated features. See for example cont 14 and cont 12 - they have a correlation of 0.99, which is pretty much the maximum it can get.

  • Therefore, we can remove some highly correlated features and train the model to see if it experiences any performance drop.

  • On the other hand, we cannot blindly drop features since we don't know anything specific about the data or feature names. It can also lead to worse models.

  • So it's a bit difficult to deal with these types of data. We actually need to train the model with and without model descent and have to decide what is the right way to do it.

  • Cont11 and Cont12 are very related. I'm considering dropping at least one of these features.

1.6 Analyzing target variables:

picture

1.7 Calculating skewness, see how we can reduce skewness:

picture

  • We can see that there is a large skewness in the target variable, which will lead to wrong predictions.

  • We applied a log transformation to this target variable and noticed that the skewness dropped a lot.

1.8 Feature Importance

We can see the last few features like Cat62, Cat64, Cat70, Cat22...these features have very low importance when you train the model on combined data and test and train.

picture

1.9 Principal Component Analysis [PCA]:

picture

I observed that even with a model of 2 principal components, we cannot easily separate the data. This means that both the test data and the training data have similar distributions.

1.9 XGBOOST hyperparameter tuning

I use Gridsearchcv to tune the parameters of the XGBOOST model

  • We tuned 2 parameters and got {'max_depth': 10, 'min_child_weight': 1}

  • We tuned 2 parameters and got {'gamma': 1, 'learning_rate': 0.1}

  • We tuned 2 parameters and got {'colsample_bytree': 0.6, 'subsample': 0.8}

  • Final hyperparameter tuning results: {'colsample_bytree': 0.5,'subsample': 0.8,'learning_rate': 0.1,'max_depth': 12,'min_child_weight': 1,'gamma':1}

寻找最好的 n_rounds :res = xgb.cv(params, xgtrain, num_boost_round=2500, nfold=5,             stratified=False,early_stopping_rounds=50,              verbose_eval=500, show_stdv=True,             feval=log_xgboost_eval_mae, maximize=False)
best_nrounds = res.shape[0] - 1cv_mean = res.iloc[-1, 0]cv_std = res.iloc[-1, 1]print('Ensemble-CV: {0}+{1}'.format(cv_mean, cv_std))print('Best n rounds are :',best_nrounds)

Ensemble-CV: 0.40487799999999996+0.0002799499955349193
Best n rounds are: 2418

最不重要 0 特征是 cat24最不重要 1 特征是 cat60最不重要 2 特征是 cat46最不重要 3 特征是 cat69最不重要 4 特征是 cat34最不重要 5 特征是 cat47 最不重要6 特征是 cat70 最不重要 7 特征是 cat20最不重要8 个特征是 cat15最不重要 9 个特征是 cat64最不重要 10 个特征是 cat62 ********* ['cat24', 'cat60', 'cat46', 'cat69', 'cat34', 'cat47', 'cat70', 'cat20', 'cat15', 'cat64', 'cat62']

Feature importance plays an important role in every aspect of a machine learning model because it allows us to understand why it does things the way it does. So it is always good to have a highly interpretable model.

We can see 'cat14', 'cat35', 'cat97', 'cat20', 'cat48', 'cat86', 'cat70', 'cat62', 'cat69', 'cat15', 'cat64' - these is the least important feature of the trained model.

Summarize:

On many datasets, the ensemble model will give better results than the common model. We can still optimize the model by making some more sensitive hyperparameter adjustments to the model. We can remove the least important features, or we can try to do some feature engineering. We can try ensemble and deep learning models to get better scores.

This is the end of the introduction of the case of Allstate Insurance Company’s claim prediction model here. More practical cases in "Python Financial Risk Control Model Cases Encyclopedia" will be updated regularly for bank training. Remember to bookmark the courses.

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/132049593