Data combat competition (4) - Traffic Accident Claims Audit

Foreword

1. Background

  After rubbing traffic (accident) happened, claims staff will go to the scene investigation, gathering information, which often affect whether the owner can get the insurance company's claims. Claims personnel training data comprising the incident side 36 of information gathering, the information has been encoded on the field, and the party is finally obtained accident claims. Our mission is based on the probability that 36 information to predict the accident claims the party is not

2, task type

  Getting binary classification model

3, the data file description

train.csv training set file size is 15.6MB

test.csv prediction set a file size of 6.1MB

sample_submit.csv submit an example of a file size of 1.4MB

4, data variables described 

  A total of 200,000 samples of the training set, the prediction set has 80 000 samples. 

5, assessment methods

  Your present the results for each test sample is not a probability reviewed by, that is, the probability is 1 Evaluation. Evaluation of Accuracy - recall area under the curve (Precision-Recall AUC), hereinafter referred to as PR-AUC. 

  PR-AUC is in the range of 0-1. Closer to 1, indicating that the results predicted by the model closer to the true result.

5.1 definition and precision and recall is calculated as follows:

  Bowen can refer to: Machine Learning Notes: Common assessment methods

  First, we start chatting about the confusion matrix, confusion matrix is ​​used to summarize the results of a matrix classifier for classification K yuan, in fact, it is a form k * k to record predictions classifier.

  For the most common binary classification, its confusion matrix is ​​2 * 2, as follows:

  TP = True Positive = true positive; FP = False Positive = false positive

  FN = False Negative = false negative; TN = True Negative = true negative

Below is an example

 For example, we have a model to predict the 15 samples, and the results are as follows:

Prediction value: 111,110,000,011,101

Real value: 0 11011001010100

   The confusion matrix is ​​above, these four values ​​confusion matrix, are often used to define a number of other measures.

Accuracy (Accuracy) = (TP + TN) / (TP + TN + FN + TN)

  In the example above, accuracy = (4 + 5) / 15 = 0.6

Precision (precision, or PPV, positive predictive value) = TP / (TP + FP)

  In the example above, accuracy = 5 / (5 + 4) = 0.556

Recall (recall, or sensitivity, sensitivity, true positive rate, TPR, True Positive Rate) = TP / (TP + FN)

  In the above example, recall = 5 / (5 + 2) = 0.714

Specificity (specificity, or the true negative rate, TNR, True Negative Rate) = TN / (TN + FP)

  In the above example, specificity = 4 / (4 + 2) = 0.667 

F1- value (F1-score) = 2 * TP / (2 * TP + FP + FN) 

  In the above example, F1- value = 2 * 5 / (5 + 2 * 4 + 2) = 0.625

5.2 accurate rate Precision, recall Recall

  Accuracy rate (accuracy) and recall two measurements are widely used in the field of information retrieval and statistical classification, used in the quality assessment. Wherein accuracy is retrieved relevant document tree ratio retrieved total number of documents, measures the precision retrieval system; recall rate is the ratio of all the number of retrieved relevant number of documents and document library, measure that recall retrieval system.

  In general, Precsion is retrieved entry (such as: documents, Web pages, etc.) how much is accurate, Recall is all exactly how many entries are retrieved, and both are defined as follows:

  Accuracy (also known as precision) and recall (also known as the recall) is a measure of a contradiction. In general, when the precision rate and the recall rate is often low, while the recall rate, precision rate is often low, so usually only a few simple tasks, it may make precision and recall rates are high.

5.3 PR-AUC is defined as follows:

  First example:

  For example, 100 test samples, according to our model, we got the 100 points is divided into 1 label probability of y1, y2, y3, ... y100,

  Next we need to threshold t, the probability of label into, if y_i Obviously, a value of t, corresponding to a set of (precision, recall). We traverse all values ​​of t, 0, y1, y2, y3, ... y100, 1. We get 102 group (precision, recall).

  Recall in the X-axis, Y-axis accuracy, we can bid in an XOY coordinate system coordinates of points 102, 102 points connected to this line, the fold line is called precision recall curves. Curve and the coordinate axes is the area surrounded precision - recall AUC. AUC closer to 1, the better the model.

   AUC is a model classification index, and just evaluation dichotomous model. AUC is the Area Under Curve (AUC) for short, the Curve is the ROC (Receiver Operating Characteristic), translated as "receiver operating characteristic curve." That is a ROC curve, AUC is an area of ​​value.

  ROC curve should try to deviate from the reference line, the closer the better the upper left.

  AUC: ROC area under the curve, the reference area is 0.5, the AUC should be greater than 0.5, and offset the better.

5.4 What is the AUC?

  AUC is the area under the ROC curve area covered, obviously, the larger the AUC, the better the classifier classification results.

  AUC = 1 is a perfect classifier, when using this prediction model, no matter what the threshold value can be set to come to a perfect forecast. The vast majority predict occasions, there is no perfect classifier.

  0.5 <AUC <1, better than random guessing, the classifier (Model) properly setting a threshold value, then, to have predictive value.

  AUC = 0.5 , 和随机猜想一样,模型没有预测价值。

  AUC < 0.5,比随机猜想还差,但只要总是反预测就行,这样就由于随机猜测。

 

  AUC的物理意义:假设分类器的输出是样本属于正类的score(置信度),则AUC的物理意义为:任意一对(正,负)样本,正样本的score大于负样本的score的概率。

  AUC的物理意义正样本的预测结果大于负样本的预测结果的概率。所以AUC反应的是分类器对样本的排序能力。

  另外值得注意的是:AUC对样本是否均衡并不敏感,这也是不均衡样本通常采用AUC评价分类器性能的一个原因。

5.5  PR-AUC的计算方法如下:

  第一种方法就是:AUC为ROC曲线下的面积,那我们直接计算面积可得。面积为一个个小的梯形面积之和。计算的精度与阈值的精度有关。

  第二种方法:根据AUC的物理意义,我们计算正样本score大于负样本的score的概率。取N* M(N为正样本数,M为负样本数)个二元组,比较score,最后得到AUC,时间复杂度为O(N*M)。

  第三种方法:与第二种方法相似,直接计算正样本score大于负样本的概率。我们首先把所有样本按照score排序,依次用rank表示他们,如最大score的样本,rank = n(n=M+N),其次为 n-1。那么对于正样本中rank最大的样本,rank_max,有M - 1个其他正样本比他的score小。最后我们得到正样本大于负样本的概率的时间复杂度为 O(N+M)

from sklearn.metrics import roc_auc_score
# y_test:实际的标签, dataset_pred:预测的概率值。
roc_auc_score(y_test, dataset_pred)

  

  使用sklearn.metrics.average_precision_score

>>> import numpy as np
>>> from sklearn.metrics import average_precision_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_predict = np.array([0.1, 0.4, 0.35, 0.8])
>>> average_precision_score(y_true, y_predict)  
0.791666666

  

6,完整代码,请移步小编的GitHub

  传送门:请点击我

 

数据预处理

1,观察数据有没有缺失值

print(train.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
city          10000 non-null int64
hour          10000 non-null int64
is_workday    10000 non-null int64
weather       10000 non-null int64
temp_1        10000 non-null float64
temp_2        10000 non-null float64
wind          10000 non-null int64
dtypes: float64(2), int64(5)
memory usage: 547.0 KB
None

  我们可以看到,共有10000个观测值,没有缺失值。

2,观察每个变量的基础描述信息

print(train.describe())


               city          hour      ...             temp_2          wind
count  10000.000000  10000.000000      ...       10000.000000  10000.000000
mean       0.499800     11.527500      ...          15.321230      1.248600
std        0.500025      6.909777      ...          11.308986      1.095773
min        0.000000      0.000000      ...         -15.600000      0.000000
25%        0.000000      6.000000      ...           5.800000      0.000000
50%        0.000000     12.000000      ...          16.000000      1.000000
75%        1.000000     18.000000      ...          24.800000      2.000000
max        1.000000     23.000000      ...          46.800000      7.000000

[8 rows x 7 columns]

  通过观察可以得出一些猜测,如城市0 和城市1基本可以排除南方城市;整个观测记录时间跨度较长,还可能包含了一个长假期数据等等。

3,查看相关系数

  (为了方便查看,绝对值低于0.2的就用nan替代)

    corr = feature_data.corr()
    corr[np.abs(corr) < 0.2] = np.nan
    print(corr)


            city  hour  is_workday  weather    temp_1    temp_2  wind
city         1.0   NaN         NaN      NaN       NaN       NaN   NaN
hour         NaN   1.0         NaN      NaN       NaN       NaN   NaN
is_workday   NaN   NaN         1.0      NaN       NaN       NaN   NaN
weather      NaN   NaN         NaN      1.0       NaN       NaN   NaN
temp_1       NaN   NaN         NaN      NaN  1.000000  0.987357   NaN
temp_2       NaN   NaN         NaN      NaN  0.987357  1.000000   NaN
wind         NaN   NaN         NaN      NaN       NaN       NaN   1.0

  从相关性角度来看,用车的时间和当时的气温对借取数量y有较强的关系;气温和体感气温显强正相关(共线性),这个和常识一致。

模型训练及其结果展示

1,标杆模型:LASSO逻辑回归模型

  该模型预测结果结果的PR-AUC为:0.714644

# -*- coding: utf-8 -*-

import pandas as pd
from sklearn.linear_model import LogisticRegression

# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")

# 删除id
train.drop('CaseId', axis=1, inplace=True)
test.drop('CaseId', axis=1, inplace=True)

# 取出训练集的y
y_train = train.pop('Evaluation')

# 建立LASSO逻辑回归模型
clf = LogisticRegression(penalty='l1', C=1.0, random_state=0)
clf.fit(train, y_train)
y_pred = clf.predict_proba(test)[:, 1]

# 输出预测结果至my_LASSO_prediction.csv
submit['Evaluation'] = y_pred
submit.to_csv('my_LASSO_prediction.csv', index=False)

  

2,标杆模型:随机森林分类模型

  该模型预测结果的PR-AUC为:0.850897

# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")

# 删除id
train.drop('CaseId', axis=1, inplace=True)
test.drop('CaseId', axis=1, inplace=True)

# 取出训练集的y
y_train = train.pop('Evaluation')

# 建立随机森林模型
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(train, y_train)
y_pred = clf.predict_proba(test)[:, 1]

# 输出预测结果至my_RF_prediction.csv
submit['Evaluation'] = y_pred
submit.to_csv('my_RF_prediction.csv', index=False)

  我提交的结果:

   这里我尝试了使用随机森林进行关键特征提取,然后对关键特征进行模型训练,发现效果不是很好,所以这里就不贴特征提取的代码了。如果有需求,请参考我之前的博客。

KMeans 算法与交通事故理赔审核预测

  K-Means 是基于划分的聚类方法,他是数据挖掘十大算法之一。基于划分的方法是将样本集组成的矢量空间划分成为多个区域,每个区域都存在一个样本中心,通过建立映射关系,可以将所有样本分类到其相应的中心。

1,经典的K-Means聚类算法步骤

  • 1,初始化聚类中心
  • 2,分配样本到相近的聚类集合
  • 3,根据步骤2的结果,更新聚类中心
  • 4,若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2.

  经典的K-means算法在初始化聚类中心时采用的时随机采样的方式,不能保证得到期望的聚类结果,可以选择重复训练多个模型,选取其中表现最好的,但是有没有更好的方法呢?David Arthur提出的 K-means++算法能够有效地产生初始化的聚类中心。

  首先随机初始化一个聚类中心C1,然后通过迭代计算最大概率值X,将其加入到中心点中,重复该过程,直到选择K个中心。

2,快速了解数据情况

  显示数据简略信息,可以看到每列有多少非空的值,以及每列数据对应的数据类型。

  本文数据对应的结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 37 columns):
Q1            200000 non-null int64
Q2            200000 non-null int64
Q3            200000 non-null int64
Q4            200000 non-null int64
Q5            200000 non-null int64
Q6            200000 non-null int64
Q7            200000 non-null int64
Q8            200000 non-null int64
Q9            200000 non-null int64
Q10           200000 non-null int64
Q11           200000 non-null int64
Q12           200000 non-null int64
Q13           200000 non-null int64
Q14           200000 non-null int64
Q15           200000 non-null int64
Q16           200000 non-null int64
Q17           200000 non-null int64
Q18           200000 non-null int64
Q19           200000 non-null int64
Q20           200000 non-null int64
Q21           200000 non-null int64
Q22           200000 non-null int64
Q23           200000 non-null int64
Q24           200000 non-null int64
Q25           200000 non-null int64
Q26           200000 non-null int64
Q27           200000 non-null int64
Q28           200000 non-null int64
Q29           200000 non-null int64
Q30           200000 non-null int64
Q31           200000 non-null int64
Q32           200000 non-null int64
Q33           200000 non-null int64
Q34           200000 non-null int64
Q35           200000 non-null int64
Q36           200000 non-null int64
Evaluation    200000 non-null int64
dtypes: int64(37)
memory usage: 56.5 MB
None

  想要了解特征之间的相关性,可计算相关系数矩阵,然后可对某个特征来排序

 

   排序后结果如下:

Evaluation    1.000000
Q28           0.410700
Q30           0.324421
Q36           0.302709
Q35           0.224996
Q34           0.152743
Q32           0.049397
Q21           0.034897
Q33           0.032248
Q13           0.023603
Q8            0.021922
Q19           0.019694
Q20           0.013903
Q4            0.011626
Q27           0.004262
Q23           0.002898
Q7            0.001143
Q31          -0.000036
Q14          -0.000669
Q29          -0.002014
Q10          -0.002711
Q12          -0.005287
Q1           -0.006511
Q16          -0.007184
Q18          -0.007643
Q26          -0.008188
Q11          -0.009252
Q24          -0.010891
Q22          -0.011821
Q25          -0.012660
Q6           -0.016072
Q2           -0.018307
Q15          -0.019570
Q9           -0.021261
Q5           -0.023893
Q3           -0.026349
Q17          -0.028461
Name: Evaluation, dtype: float64

  

3,使用K-Means训练模型

  KMeans():n_clusters指要预测的有几个类;init指初始化中心的方法,默认使用的是k-means++方法,而非经典的K-means方法的随机采样初始化,当然你可以设置为random使用随机初始化;n_jobs指定使用CPU核心数,-1为使用全部CPU。

  完整的代码如下:

import pandas as pd

traindata = pd.read_csv(r'data/train.csv')
testdata = pd.read_csv(r'data/test.csv')

# 去掉没有意义的一列
traindata.drop('CaseId', axis=1, inplace=True)
testdata.drop('CaseId', axis=1, inplace=True)

# head() 默认显示前5行数据,可指定显示多行
# 例如 head(50)显示前50行
# 查看每类有多少空值
# res = traindata.isnull().sum()
# 显示数据简略信息,可以每列有多少非空的值,以及每列数据对应的数据类型
# res = traindata.info()

# 以图的形式,快速了解数据
# ~hist():绘制直方图,参数figsize可指定输出图片的尺寸。
# traindata.hist(figsize=(20, 20))

# # 想要了解特征之间的相关性,可计算相关系数矩阵,然后可对某个特征来排序
# corr_matrix = traindata.corr()
# # ascending=False 表示降序排列
# corr_matrix = corr_matrix['Evaluation'].sort_values(ascending=False)
# print(corr_matrix)

# 从训练集中分类标签
y = traindata['Evaluation']
traindata.drop('Evaluation', axis=1, inplace=True)

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2, init='k-means++', n_jobs=-1)
clf.fit(traindata, y)
y_pred = clf.predict(testdata)

# 保存预测的结果
submitData = pd.read_csv(r'data/sample_submit.csv')
submitData['Evaluation'] = y_pred
submitData.to_csv("KMeans.csv", index=False)

  结果如下:0.485968

   K-means算法是数据挖掘的十大经典算法之一,但实际中如果想要得到满意的效果,还是非常难的,这里做一个尝试,确实是不行的。

 4,自己使用XGBoost训练

  直接训练,代码如下:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

traindata = pd.read_csv(r'data/train.csv')
testdata = pd.read_csv(r'data/test.csv')

# 去掉没有意义的一列
traindata.drop('CaseId', axis=1, inplace=True)
testdata.drop('CaseId', axis=1, inplace=True)

# 从训练集中分类标签
trainlabel = traindata['Evaluation']
traindata.drop('Evaluation', axis=1, inplace=True)

traindata1, testdata1, trainlabel1 = traindata.values, testdata.values, trainlabel.values
# 数据集分割
X_train, X_test, y_train, y_test = train_test_split(traindata1, trainlabel1,
                                                    test_size=0.3, random_state=123457)
# 训练模型
model = xgb.XGBClassifier(max_depth=5,
                          learning_rate=0.1,
                          gamma=0.1,
                          n_estimators=160,
                          silent=True,
                          objective='binary:logistic',
                          nthread=4,
                          seed=27,
                          colsample_bytree=0.8)

model.fit(X_train, y_train)

# 对测试集进行预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print('accuracy:%2.f%%' % (accuracy * 100))

#查看AUC评价标准
# from sklearn import metrics
##必须二分类才能计算
# print("AUC Score (Train): %f" % metrics.roc_auc_score(y_test, y_pred))

def run_predict():
    y_pred_test = model.predict_proba(testdata1)[:, 1]
    # 保存预测的结果
    submitData = pd.read_csv(r'data/sample_submit.csv')
    submitData['Evaluation'] = y_pred_test
    submitData.to_csv("xgboost.csv", index=False)


run_predict()

  结果如下:

   然后对XGBoost进行调参,调参结果如下:

  这里直接展示了模型的最佳参数:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.6, gamma=0.3, learning_rate=0.1,
       max_delta_step=0, max_depth=6, min_child_weight=4, missing=None,
       n_estimators=1000, n_jobs=1, nthread=4, objective='binary:logistic',
       random_state=0, reg_alpha=1, reg_lambda=1, scale_pos_weight=1,
       seed=27, silent=True, subsample=0.9)

  然后运行,得到的结果如下:

   当然相比较之前的xgboost,结果提高了一些。

  到目前为止,就做这些尝试吧,看来xgboost还真是解题利器。有时间的话,继续尝试其他算法,那这些简单的题,目的是继续尝试应用自己学到的算法。

Guess you like

Origin www.cnblogs.com/wj-1314/p/10790197.html