Foreword
1. Background
After rubbing traffic (accident) happened, claims staff will go to the scene investigation, gathering information, which often affect whether the owner can get the insurance company's claims. Claims personnel training data comprising the incident side 36 of information gathering, the information has been encoded on the field, and the party is finally obtained accident claims. Our mission is based on the probability that 36 information to predict the accident claims the party is not
2, task type
Getting binary classification model
3, the data file description
train.csv training set file size is 15.6MB
test.csv prediction set a file size of 6.1MB
sample_submit.csv submit an example of a file size of 1.4MB
4, data variables described
A total of 200,000 samples of the training set, the prediction set has 80 000 samples.
5, assessment methods
Your present the results for each test sample is not a probability reviewed by, that is, the probability is 1 Evaluation. Evaluation of Accuracy - recall area under the curve (Precision-Recall AUC), hereinafter referred to as PR-AUC.
PR-AUC is in the range of 0-1. Closer to 1, indicating that the results predicted by the model closer to the true result.
5.1 definition and precision and recall is calculated as follows:
Bowen can refer to: Machine Learning Notes: Common assessment methods
First, we start chatting about the confusion matrix, confusion matrix is used to summarize the results of a matrix classifier for classification K yuan, in fact, it is a form k * k to record predictions classifier.
For the most common binary classification, its confusion matrix is 2 * 2, as follows:
TP = True Positive = true positive; FP = False Positive = false positive
FN = False Negative = false negative; TN = True Negative = true negative
Below is an example
For example, we have a model to predict the 15 samples, and the results are as follows:
Prediction value: 111,110,000,011,101
Real value: 0 11011001010100
The confusion matrix is above, these four values confusion matrix, are often used to define a number of other measures.
Accuracy (Accuracy) = (TP + TN) / (TP + TN + FN + TN)
In the example above, accuracy = (4 + 5) / 15 = 0.6
Precision (precision, or PPV, positive predictive value) = TP / (TP + FP)
In the example above, accuracy = 5 / (5 + 4) = 0.556
Recall (recall, or sensitivity, sensitivity, true positive rate, TPR, True Positive Rate) = TP / (TP + FN)
In the above example, recall = 5 / (5 + 2) = 0.714
Specificity (specificity, or the true negative rate, TNR, True Negative Rate) = TN / (TN + FP)
In the above example, specificity = 4 / (4 + 2) = 0.667
F1- value (F1-score) = 2 * TP / (2 * TP + FP + FN)
In the above example, F1- value = 2 * 5 / (5 + 2 * 4 + 2) = 0.625
5.2 accurate rate Precision, recall Recall
Accuracy rate (accuracy) and recall two measurements are widely used in the field of information retrieval and statistical classification, used in the quality assessment. Wherein accuracy is retrieved relevant document tree ratio retrieved total number of documents, measures the precision retrieval system; recall rate is the ratio of all the number of retrieved relevant number of documents and document library, measure that recall retrieval system.
In general, Precsion is retrieved entry (such as: documents, Web pages, etc.) how much is accurate, Recall is all exactly how many entries are retrieved, and both are defined as follows:
Accuracy (also known as precision) and recall (also known as the recall) is a measure of a contradiction. In general, when the precision rate and the recall rate is often low, while the recall rate, precision rate is often low, so usually only a few simple tasks, it may make precision and recall rates are high.
5.3 PR-AUC is defined as follows:
First example:
For example, 100 test samples, according to our model, we got the 100 points is divided into 1 label probability of y1, y2, y3, ... y100,
Next we need to threshold t, the probability of label into, if y_i Obviously, a value of t, corresponding to a set of (precision, recall). We traverse all values of t, 0, y1, y2, y3, ... y100, 1. We get 102 group (precision, recall).
Recall in the X-axis, Y-axis accuracy, we can bid in an XOY coordinate system coordinates of points 102, 102 points connected to this line, the fold line is called precision recall curves. Curve and the coordinate axes is the area surrounded precision - recall AUC. AUC closer to 1, the better the model.
AUC is a model classification index, and just evaluation dichotomous model. AUC is the Area Under Curve (AUC) for short, the Curve is the ROC (Receiver Operating Characteristic), translated as "receiver operating characteristic curve." That is a ROC curve, AUC is an area of value.
ROC curve should try to deviate from the reference line, the closer the better the upper left.
AUC: ROC area under the curve, the reference area is 0.5, the AUC should be greater than 0.5, and offset the better.
5.4 What is the AUC?
AUC is the area under the ROC curve area covered, obviously, the larger the AUC, the better the classifier classification results.
AUC = 1 is a perfect classifier, when using this prediction model, no matter what the threshold value can be set to come to a perfect forecast. The vast majority predict occasions, there is no perfect classifier.
0.5 <AUC <1, better than random guessing, the classifier (Model) properly setting a threshold value, then, to have predictive value.
AUC = 0.5 , 和随机猜想一样,模型没有预测价值。
AUC < 0.5,比随机猜想还差,但只要总是反预测就行,这样就由于随机猜测。
AUC的物理意义:假设分类器的输出是样本属于正类的score(置信度),则AUC的物理意义为:任意一对(正,负)样本,正样本的score大于负样本的score的概率。
AUC的物理意义正样本的预测结果大于负样本的预测结果的概率。所以AUC反应的是分类器对样本的排序能力。
另外值得注意的是:AUC对样本是否均衡并不敏感,这也是不均衡样本通常采用AUC评价分类器性能的一个原因。
5.5 PR-AUC的计算方法如下:
第一种方法就是:AUC为ROC曲线下的面积,那我们直接计算面积可得。面积为一个个小的梯形面积之和。计算的精度与阈值的精度有关。
第二种方法:根据AUC的物理意义,我们计算正样本score大于负样本的score的概率。取N* M(N为正样本数,M为负样本数)个二元组,比较score,最后得到AUC,时间复杂度为O(N*M)。
第三种方法:与第二种方法相似,直接计算正样本score大于负样本的概率。我们首先把所有样本按照score排序,依次用rank表示他们,如最大score的样本,rank = n(n=M+N),其次为 n-1。那么对于正样本中rank最大的样本,rank_max,有M - 1个其他正样本比他的score小。最后我们得到正样本大于负样本的概率的时间复杂度为 O(N+M)
from sklearn.metrics import roc_auc_score # y_test:实际的标签, dataset_pred:预测的概率值。 roc_auc_score(y_test, dataset_pred)
使用sklearn.metrics.average_precision_score
>>> import numpy as np >>> from sklearn.metrics import average_precision_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_predict = np.array([0.1, 0.4, 0.35, 0.8]) >>> average_precision_score(y_true, y_predict) 0.791666666
6,完整代码,请移步小编的GitHub
传送门:请点击我
数据预处理
1,观察数据有没有缺失值
print(train.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 7 columns): city 10000 non-null int64 hour 10000 non-null int64 is_workday 10000 non-null int64 weather 10000 non-null int64 temp_1 10000 non-null float64 temp_2 10000 non-null float64 wind 10000 non-null int64 dtypes: float64(2), int64(5) memory usage: 547.0 KB None
我们可以看到,共有10000个观测值,没有缺失值。
2,观察每个变量的基础描述信息
print(train.describe()) city hour ... temp_2 wind count 10000.000000 10000.000000 ... 10000.000000 10000.000000 mean 0.499800 11.527500 ... 15.321230 1.248600 std 0.500025 6.909777 ... 11.308986 1.095773 min 0.000000 0.000000 ... -15.600000 0.000000 25% 0.000000 6.000000 ... 5.800000 0.000000 50% 0.000000 12.000000 ... 16.000000 1.000000 75% 1.000000 18.000000 ... 24.800000 2.000000 max 1.000000 23.000000 ... 46.800000 7.000000 [8 rows x 7 columns]
通过观察可以得出一些猜测,如城市0 和城市1基本可以排除南方城市;整个观测记录时间跨度较长,还可能包含了一个长假期数据等等。
3,查看相关系数
(为了方便查看,绝对值低于0.2的就用nan替代)
corr = feature_data.corr() corr[np.abs(corr) < 0.2] = np.nan print(corr) city hour is_workday weather temp_1 temp_2 wind city 1.0 NaN NaN NaN NaN NaN NaN hour NaN 1.0 NaN NaN NaN NaN NaN is_workday NaN NaN 1.0 NaN NaN NaN NaN weather NaN NaN NaN 1.0 NaN NaN NaN temp_1 NaN NaN NaN NaN 1.000000 0.987357 NaN temp_2 NaN NaN NaN NaN 0.987357 1.000000 NaN wind NaN NaN NaN NaN NaN NaN 1.0
从相关性角度来看,用车的时间和当时的气温对借取数量y有较强的关系;气温和体感气温显强正相关(共线性),这个和常识一致。
模型训练及其结果展示
1,标杆模型:LASSO逻辑回归模型
该模型预测结果结果的PR-AUC为:0.714644
# -*- coding: utf-8 -*- import pandas as pd from sklearn.linear_model import LogisticRegression # 读取数据 train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") submit = pd.read_csv("sample_submit.csv") # 删除id train.drop('CaseId', axis=1, inplace=True) test.drop('CaseId', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('Evaluation') # 建立LASSO逻辑回归模型 clf = LogisticRegression(penalty='l1', C=1.0, random_state=0) clf.fit(train, y_train) y_pred = clf.predict_proba(test)[:, 1] # 输出预测结果至my_LASSO_prediction.csv submit['Evaluation'] = y_pred submit.to_csv('my_LASSO_prediction.csv', index=False)
2,标杆模型:随机森林分类模型
该模型预测结果的PR-AUC为:0.850897
# -*- coding: utf-8 -*- import pandas as pd from sklearn.ensemble import RandomForestClassifier # 读取数据 train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") submit = pd.read_csv("sample_submit.csv") # 删除id train.drop('CaseId', axis=1, inplace=True) test.drop('CaseId', axis=1, inplace=True) # 取出训练集的y y_train = train.pop('Evaluation') # 建立随机森林模型 clf = RandomForestClassifier(n_estimators=100, random_state=0) clf.fit(train, y_train) y_pred = clf.predict_proba(test)[:, 1] # 输出预测结果至my_RF_prediction.csv submit['Evaluation'] = y_pred submit.to_csv('my_RF_prediction.csv', index=False)
我提交的结果:
这里我尝试了使用随机森林进行关键特征提取,然后对关键特征进行模型训练,发现效果不是很好,所以这里就不贴特征提取的代码了。如果有需求,请参考我之前的博客。
KMeans 算法与交通事故理赔审核预测
K-Means 是基于划分的聚类方法,他是数据挖掘十大算法之一。基于划分的方法是将样本集组成的矢量空间划分成为多个区域,每个区域都存在一个样本中心,通过建立映射关系,可以将所有样本分类到其相应的中心。
1,经典的K-Means聚类算法步骤
- 1,初始化聚类中心
- 2,分配样本到相近的聚类集合
- 3,根据步骤2的结果,更新聚类中心
- 4,若达到最大迭代步数或两次迭代差小于设定的阈值则算法结束,否则重复步骤2.
经典的K-means算法在初始化聚类中心时采用的时随机采样的方式,不能保证得到期望的聚类结果,可以选择重复训练多个模型,选取其中表现最好的,但是有没有更好的方法呢?David Arthur提出的 K-means++算法能够有效地产生初始化的聚类中心。
首先随机初始化一个聚类中心C1,然后通过迭代计算最大概率值X,将其加入到中心点中,重复该过程,直到选择K个中心。
2,快速了解数据情况
显示数据简略信息,可以看到每列有多少非空的值,以及每列数据对应的数据类型。
本文数据对应的结果如下:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200000 entries, 0 to 199999 Data columns (total 37 columns): Q1 200000 non-null int64 Q2 200000 non-null int64 Q3 200000 non-null int64 Q4 200000 non-null int64 Q5 200000 non-null int64 Q6 200000 non-null int64 Q7 200000 non-null int64 Q8 200000 non-null int64 Q9 200000 non-null int64 Q10 200000 non-null int64 Q11 200000 non-null int64 Q12 200000 non-null int64 Q13 200000 non-null int64 Q14 200000 non-null int64 Q15 200000 non-null int64 Q16 200000 non-null int64 Q17 200000 non-null int64 Q18 200000 non-null int64 Q19 200000 non-null int64 Q20 200000 non-null int64 Q21 200000 non-null int64 Q22 200000 non-null int64 Q23 200000 non-null int64 Q24 200000 non-null int64 Q25 200000 non-null int64 Q26 200000 non-null int64 Q27 200000 non-null int64 Q28 200000 non-null int64 Q29 200000 non-null int64 Q30 200000 non-null int64 Q31 200000 non-null int64 Q32 200000 non-null int64 Q33 200000 non-null int64 Q34 200000 non-null int64 Q35 200000 non-null int64 Q36 200000 non-null int64 Evaluation 200000 non-null int64 dtypes: int64(37) memory usage: 56.5 MB None
想要了解特征之间的相关性,可计算相关系数矩阵,然后可对某个特征来排序
排序后结果如下:
Evaluation 1.000000 Q28 0.410700 Q30 0.324421 Q36 0.302709 Q35 0.224996 Q34 0.152743 Q32 0.049397 Q21 0.034897 Q33 0.032248 Q13 0.023603 Q8 0.021922 Q19 0.019694 Q20 0.013903 Q4 0.011626 Q27 0.004262 Q23 0.002898 Q7 0.001143 Q31 -0.000036 Q14 -0.000669 Q29 -0.002014 Q10 -0.002711 Q12 -0.005287 Q1 -0.006511 Q16 -0.007184 Q18 -0.007643 Q26 -0.008188 Q11 -0.009252 Q24 -0.010891 Q22 -0.011821 Q25 -0.012660 Q6 -0.016072 Q2 -0.018307 Q15 -0.019570 Q9 -0.021261 Q5 -0.023893 Q3 -0.026349 Q17 -0.028461 Name: Evaluation, dtype: float64
3,使用K-Means训练模型
KMeans():n_clusters
指要预测的有几个类;init
指初始化中心的方法,默认使用的是k-means++
方法,而非经典的K-means方法的随机采样初始化,当然你可以设置为random
使用随机初始化;n_jobs
指定使用CPU核心数,-1为使用全部CPU。
完整的代码如下:
import pandas as pd traindata = pd.read_csv(r'data/train.csv') testdata = pd.read_csv(r'data/test.csv') # 去掉没有意义的一列 traindata.drop('CaseId', axis=1, inplace=True) testdata.drop('CaseId', axis=1, inplace=True) # head() 默认显示前5行数据,可指定显示多行 # 例如 head(50)显示前50行 # 查看每类有多少空值 # res = traindata.isnull().sum() # 显示数据简略信息,可以每列有多少非空的值,以及每列数据对应的数据类型 # res = traindata.info() # 以图的形式,快速了解数据 # ~hist():绘制直方图,参数figsize可指定输出图片的尺寸。 # traindata.hist(figsize=(20, 20)) # # 想要了解特征之间的相关性,可计算相关系数矩阵,然后可对某个特征来排序 # corr_matrix = traindata.corr() # # ascending=False 表示降序排列 # corr_matrix = corr_matrix['Evaluation'].sort_values(ascending=False) # print(corr_matrix) # 从训练集中分类标签 y = traindata['Evaluation'] traindata.drop('Evaluation', axis=1, inplace=True) from sklearn.cluster import KMeans clf = KMeans(n_clusters=2, init='k-means++', n_jobs=-1) clf.fit(traindata, y) y_pred = clf.predict(testdata) # 保存预测的结果 submitData = pd.read_csv(r'data/sample_submit.csv') submitData['Evaluation'] = y_pred submitData.to_csv("KMeans.csv", index=False)
结果如下:0.485968
K-means算法是数据挖掘的十大经典算法之一,但实际中如果想要得到满意的效果,还是非常难的,这里做一个尝试,确实是不行的。
4,自己使用XGBoost训练
直接训练,代码如下:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import xgboost as xgb from sklearn.metrics import accuracy_score traindata = pd.read_csv(r'data/train.csv') testdata = pd.read_csv(r'data/test.csv') # 去掉没有意义的一列 traindata.drop('CaseId', axis=1, inplace=True) testdata.drop('CaseId', axis=1, inplace=True) # 从训练集中分类标签 trainlabel = traindata['Evaluation'] traindata.drop('Evaluation', axis=1, inplace=True) traindata1, testdata1, trainlabel1 = traindata.values, testdata.values, trainlabel.values # 数据集分割 X_train, X_test, y_train, y_test = train_test_split(traindata1, trainlabel1, test_size=0.3, random_state=123457) # 训练模型 model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, gamma=0.1, n_estimators=160, silent=True, objective='binary:logistic', nthread=4, seed=27, colsample_bytree=0.8) model.fit(X_train, y_train) # 对测试集进行预测 y_pred = model.predict(X_test) # 计算准确率 accuracy = accuracy_score(y_test, y_pred) print('accuracy:%2.f%%' % (accuracy * 100)) #查看AUC评价标准 # from sklearn import metrics ##必须二分类才能计算 # print("AUC Score (Train): %f" % metrics.roc_auc_score(y_test, y_pred)) def run_predict(): y_pred_test = model.predict_proba(testdata1)[:, 1] # 保存预测的结果 submitData = pd.read_csv(r'data/sample_submit.csv') submitData['Evaluation'] = y_pred_test submitData.to_csv("xgboost.csv", index=False) run_predict()
结果如下:
然后对XGBoost进行调参,调参结果如下:
这里直接展示了模型的最佳参数:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=0.6, gamma=0.3, learning_rate=0.1, max_delta_step=0, max_depth=6, min_child_weight=4, missing=None, n_estimators=1000, n_jobs=1, nthread=4, objective='binary:logistic', random_state=0, reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True, subsample=0.9)
然后运行,得到的结果如下:
当然相比较之前的xgboost,结果提高了一些。
到目前为止,就做这些尝试吧,看来xgboost还真是解题利器。有时间的话,继续尝试其他算法,那这些简单的题,目的是继续尝试应用自己学到的算法。