Model Selection & Evaluation

Agenda

Cross Validation
Hyperparameter Tuning
Model Evaluation
Model Persistance
Validation Curves
Learning Curves
交叉验证
超参数调整
模型评估
模型的持久性
验证曲线
学习曲线

Cross Validation交叉验证

Simple models underfit.
Accuracy for training data & validation data is not much different.
But, accurcy ain’t that great.
This situation is of low variance & high bias
On moving towards complex models, accuracy improves.
But, gap between accuracy on training data & validation data increases
This situation is of high variance & low bias
简单模型的拟合度不足。
训练数据和验证数据的准确度没有太大区别。
但是，准确度没有那么高。
这种情况是低方差和高偏差。
在向复杂模型发展的过程中，准确度有所提高。
但是，训练数据和验证数据的准确性之间的差距会增加。
这种情况是高偏差和低偏差。

We need to compare across models to find the best model.
We need to compare across all hyper-parameters for a particular model.
The data that is used for training should not be used for validation.
The validation accuracy is the one that we claims
我们需要在不同模型之间进行比较，以找到最佳模型。
我们需要对一个特定模型的所有超参数进行比较。
用于训练的数据不应该被用于验证。
验证精度就是我们所声称的验证精度。

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline

digits = load_digits()
plt.imshow(digits.images[0],cmap='gray')

在这里插入图片描述

from sklearn.model_selection import train_test_split
dt = DecisionTreeClassifier(max_depth=10)
trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)
dt.fit(trainX,trainY)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

dt.score(testX,testY)

0.8355555555555556

dt.score(trainX,trainY)

0.9740163325909429

Decreasing the complexity of model
降低模型的复杂性

dt = DecisionTreeClassifier(max_depth=7)
dt.fit(trainX,trainY)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

dt.score(testX,testY)

0.8155555555555556

dt.score(trainX,trainY)

0.8864142538975501

Observation : With decrease in complexity the gap in training & validation accuracy also decreased
观察：随着复杂度的降低，训练和验证精度的差距也在缩小。

Cross Validation API交叉验证API

Splits data into k parts.
Use k - 1 parts for training the model
Use kth part for validation
Repeat the above steps multiple times to get a generalized behaviour
将数据分割成k个部分。
使用k-1部分来训练模型。
使用第K部分进行验证
重复上述步骤多次，得到一个通用的行为。

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt, digits.data, digits.target)

d:\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)

scores

array([0.68604651, 0.8096828 , 0.74161074])

scores.mean()

0.7457800181857993

Cross-validate Function : Scores for multiple matrices交叉验证函数：多个矩阵的分数

from sklearn.model_selection import cross_validate
scoring = ['precision_macro', 'recall_macro', 'accuracy']
results=cross_validate(dt, digits.data, digits.target, scoring=scoring, cv=5)
results

{'fit_time': array([0.01800108, 0.01200056, 0.01400089, 0.01300073, 0.01400089]),
 'score_time': array([0.00300026, 0.00300002, 0.00200009, 0.00300002, 0.00300002]),
 'test_precision_macro': array([0.7732771 , 0.71087424, 0.77524663, 0.78964348, 0.7585891 ]),
 'test_recall_macro': array([0.76278636, 0.66593093, 0.76876662, 0.77198413, 0.74553688]),
 'test_accuracy': array([0.76373626, 0.66574586, 0.76880223, 0.77310924, 0.74366197])}

for k, v in results.items():
    print(k,end=' ')
    print(v)

fit_time [0.01800108 0.01200056 0.01400089 0.01300073 0.01400089]
score_time [0.00300026 0.00300002 0.00200009 0.00300002 0.00300002]
test_precision_macro [0.7732771  0.71087424 0.77524663 0.78964348 0.7585891 ]
test_recall_macro [0.76278636 0.66593093 0.76876662 0.77198413 0.74553688]
test_accuracy [0.76373626 0.66574586 0.76880223 0.77310924 0.74366197]

results.keys()

dict_keys(['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro', 'test_accuracy'])

import pandas as pd
results=pd.DataFrame(results.values(),index=results.keys()).T

results

	fit_time	score_time	test_precision_macro	test_recall_macro	test_accuracy
0	0.018001	0.003	0.773277	0.762786	0.763736
1	0.012001	0.003	0.710874	0.665931	0.665746
2	0.014001	0.002	0.775247	0.768767	0.768802
3	0.013001	0.003	0.789643	0.771984	0.773109
4	0.014001	0.003	0.758589	0.745537	0.743662

Stratification for dealing with imbalanced Classes分层处理不平衡类的问题

StratifiedKFold
- Class frequencies are preserved in data splitting
分层KFold
- 类频率在数据拆分中得到保留

import numpy as np

Y = np.append(np.ones(12),np.zeros(6))
X = np.ones((18,3))

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)
list(skf.split(X,Y))

[(array([ 4,  5,  6,  7,  8,  9, 10, 11, 14, 15, 16, 17]),
  array([ 0,  1,  2,  3, 12, 13])),
 (array([ 0,  1,  2,  3,  8,  9, 10, 11, 12, 13, 16, 17]),
  array([ 4,  5,  6,  7, 14, 15])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7, 12, 13, 14, 15]),
  array([ 8,  9, 10, 11, 16, 17]))]

Y[[ 4,  5,  6,  7,  8,  9, 10, 11, 14, 15, 16, 17]]

array([1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

Hyperparameter Tuning 超参数调整

Model parameters are learnt by learning algorithms based on data
Hyper-parameters needs to be configured
Hyper-parameters are data dependent & many times need experiments to find the best
sklearn provides GridSerach for finding the best hyper-parameters

Exhaustive GridSearch穷尽的网格搜索

Searches sequentially for all the configued params
For all possible combinations
模型参数由学习算法根据数据学习算法来学习。
需要配置超参数。
超参数依赖于数据，很多时候需要通过实验来找到最佳的参数。
sklearn提供了GridSerach，用于寻找最佳的超参数。
按顺序搜索所有配置的参数。
对于所有可能的组合

trainX, testX, trainY, testY = train_test_split(digits.data, digits.target)
dt = DecisionTreeClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(dt, param_grid={'max_depth':range(5,30,5)}, cv=5)
grid_search.fit(digits.data,digits.target)

d:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None, param_grid={'max_depth': range(5, 30, 5)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

grid_search.best_params_

{'max_depth': 20}

grid_search.best_score_

0.7868670005564831

grid_search.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=20,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

RandomizedSearch随机搜索

Unlike GridSearch, not all parameters are tried & tested
But rather a fixed number of parameter settings is sampled from the specified distributions.

Comparing GridSearch and RandomSearchCV

与GridSearch不同，不是所有的参数都是经过测试的。
而是从指定的分布中抽出一个固定数量的参数设置。

比较GridSearch和RandomSearchCV。

from time import time

#randint is an intertor for generating numbers between range specified
from scipy.stats import randint
X = digits.data
Y = digits.target

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": randint(1,11),
              "min_samples_split": randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

param_dist

{'max_depth': [3, None],
 'max_features': <scipy.stats._distn_infrastructure.rv_frozen at 0xf179248>,
 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen at 0xf1793c8>,
 'bootstrap': [True, False],
 'criterion': ['gini', 'entropy']}

rf = RandomForestClassifier(n_estimators=20)
n_iter_search = 20
random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, Y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))

RandomizedSearchCV took 4.52 seconds for 20 candidates parameter settings.


d:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

random_search.best_score_

0.9365609348914858

param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(rf, param_grid=param_grid, cv=5)
start = time()
grid_search.fit(X, Y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))

GridSearchCV took 15.34 seconds for 72 candidate parameter settings.


d:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

grid_search.best_score_

0.9354479688369505

GridSearch & RandomizedSearch can fine tune hyper-parameters of transformers as well when part of pipeline
GridSearch和RandomizedSearch可以微调变压器的超参数，当管道的一部分时，也可以微调变压器的超参数。

Model Evaluation模型评估

Three different ways to evaluate quality of model prediction
- score method of estimators, a default method is configured .i.e r2_score for regression, accuracy for classification
- Model evalutaion tools like cross_validate or cross_val_score also returns accuracy
- Metrices module is rich with various prediction error calculation techniques
评价模型预测质量的三种不同方法
- 估计器的得分方法，默认配置了一种方法，即r2_score用于回归，准确度用于分类。
- 模型评估工具如cross_validate或cross_val_score等模型评估工具也会返回精度。
- Metrices模块具有丰富的各种预测误差计算技术。

trainX, testX, trainY, testY = train_test_split(X,Y)
rf.fit(trainX, trainY)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Technique 1 - Using score function
技巧一----使用评分函数

rf.score(testX,testY)

0.9577777777777777

Technique 2 - Using cross_val_score as discussed above
技巧2 - 使用上文讨论过的cross_val_score

cross_val_score(rf,X,Y,cv=5)

array([0.92307692, 0.90055249, 0.93871866, 0.94677871, 0.88169014])

Cancer prediction sample for understanding metrices癌症预测样本了解度量衡的癌症预测样本

from sklearn.datasets import load_breast_cancer
dt = DecisionTreeClassifier()
cancer_data = load_breast_cancer()
trainX, testX, trainY, testY = train_test_split(cancer_data.data, cancer_data.target)

dt.fit(trainX,trainY)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

pred = dt.predict(testX)

Technique 3 - Using metrices使用衡量指标

Classfication metrices分类衡量指标

Accuracy Score - Correct classification vs ( Correct classification + Incorrect Classification )
准确率得分 - 正确分类与（正确分类+不正确分类）的比较

from sklearn import metrics
metrics.accuracy_score(y_pred=pred, y_true=testY)

0.9300699300699301

Confusion Matrix - Shows details of classification inclusing TP,FP,TN,FN
- True Positive (TP), Actual class is 1 & prediction is also 1
- True Negative (TN), Actual class is 0 & prediction is also 0
- False Positive (FP), Acutal class is 0 & prediction is 1
- False Negative (FN), Actual class is 1 & prediction is 0

confusion_result=metrics.confusion_matrix(y_pred=pred, y_true=testY, labels=[0,1])
confusion_result

array([[60,  6],
       [ 4, 73]], dtype=int64)

tp=confusion_result[1][1]
tn=confusion_result[0][0]
fp=confusion_result[0][1]
fn=confusion_result[1][0]

Precision Score精确度得分
- Ability of a classifier not to label positive if the sample is negative
- Claculated as TP/(TP+FP)
- We don’t want a non-spam mail to be marked as spam
- 如果样本为负值，分类器不标记正值的能力
- 按TP/(TP+FP)的形式计算
- 我们不希望非垃圾邮件被标记为垃圾邮件。

precision_result=tp/(tp+fp)
precision_result

0.9240506329113924

metrics.precision_score(y_pred=pred, y_true=testY)

0.9240506329113924

Recall Score召回率
- Ability of classifier to find all positive samples
- It’s ok to predict patient tumor to be cancer so that it undergoes more test
- But it is not ok to miss a cancer patient without further analysis
- 找到所有阳性样本的能力
- 預測病人的腫瘤是癌症是可以的，所以要多做一些測試
- 但如果没有进一步的分析，错过了癌症患者也是不行的。

metrics.recall_score(y_pred=pred, y_true=testY)

0.948051948051948

F1 score
- Weighted average of precision & recall

metrics.f1_score(y_pred=pred, y_true=testY)

0.9358974358974359

ROC & AUC

House Price Prediction - Understanding matrices

from sklearn.datasets import california_housing
house_data = california_housing.fetch_california_housing()
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(house_data.data, house_data.target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

pred = lr.predict(house_data.data)

Matrices for Regression

mean squared error
- Sum of squares of difference between expected value & actual value

metrics.mean_squared_error(y_pred=pred, y_true=house_data.target)

0.5243209861846071

mean absolute error
- Sum of abs of difference between expected value & actual value

metrics.mean_absolute_error(y_pred=pred, y_true=house_data.target)

0.5311643817546461

r2 score
- Returns accuracy of model in the scale of 0 & 1
- It measures goodness of fit for regression models
- Calculated as = (variance explained by the model)/(Total variance)
- High r2 means target is close to prediction

metrics.r2_score(y_pred=pred, y_true=house_data.target)

0.6062326851998051

Metrices for Clustering 用于聚类的衡量指标

Two forms of evaluation
supervised, which uses a ground truth class values for each sample.
- completeness_score
- homogeneity_score
unsupervised, which measures the quality of model itself
- silhoutte_score
- calinski_harabaz_score
两种评价形式
监督的，它为每个样本使用了一个地面真值类的值。
- 完整度_score
- 同质性分数
无监督，衡量模型本身的质量
- silhoutte_score
- calinski_harabaz_score

completeness_score

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
Accuracy is 1.0 if data belonging to same class belongs to same cluster, even if multiple classes belongs to same cluster

from sklearn.metrics.cluster import completeness_score

completeness_score( labels_true=[10,10,11,11],labels_pred=[1,1,0,0])

1.0

The acuracy is 1.0 because all the data belonging to same class belongs to same cluster

completeness_score( labels_true=[11,22,22,11],labels_pred=[1,0,1,1])

0.3836885465963443

The accuracy is .3 because class 1 - [11,22,11], class 2 - [22]

print(completeness_score([10, 10, 11, 11], [0, 0, 0, 0]))

1.0

homogeneity_score

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

from sklearn.metrics.cluster import homogeneity_score

homogeneity_score([0, 0, 1, 1], [1, 1, 0, 0])

1.0

homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3])

0.9999999999999999

homogeneity_score([0, 0, 0, 0], [1, 1, 0, 0])

1.0

Same class data is broken into two clusters

silhoutte_score

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.
The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

Selecting the number of clusters with silhouette analysis on KMeans clustering

from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)

plt.scatter(X[:,0],X[:,1],s=10)

在这里插入图片描述

range_n_clusters = [2, 3, 4, 5, 6]

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

for n_cluster in range_n_clusters:
    kmeans = KMeans(n_clusters=n_cluster)
    kmeans.fit(X)
    labels = kmeans.predict(X)
    print (n_cluster, silhouette_score(X,labels))

2 0.7049787496083262
3 0.5882004012129721
4 0.6505186632729437
5 0.5746932321727457
6 0.49417400746431644

The best number of clusters is 2

calinski_harabaz_score

The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.

from sklearn.metrics import calinski_harabaz_score

for n_cluster in range_n_clusters:
    kmeans = KMeans(n_clusters=n_cluster)
    kmeans.fit(X)
    labels = kmeans.predict(X)
    print (n_cluster, calinski_harabaz_score(X,labels))

2 1604.112286409658
3 1809.991966958033
4 2704.4858735121097
5 2281.91411035916
6 2040.6320809618921

Model Persistance

Model training is an expensive process
It is desireable to save the model for future reuse
using pickle & joblib this can be achieved
模型培训是一个昂贵的过程
希望将模型保存下来，以便将来再利用。
使用pickle和joblib可以实现这个功能

import pickle
s = pickle.dumps(dt)
pickle.loads(s)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

type(s)

bytes

joblib is better extension of pickle
Doesn’t convert into string

from sklearn.externals import joblib
joblib.dump(dt, 'dt.joblib')

['dt.joblib']

Loading the file back into model

dt = joblib.load('dt.joblib')
dt

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Validation Curves验证曲线

To validate a model, we need a scoring function.
Create a grid of possible hyper-prameter configuration.
Select the hyper-parameter which gives the best score
要验证一个模型，我们需要一个评分函数。
创建一个可能的超参数配置的网格。
选择一个能给出最佳得分的超参数。

from sklearn.model_selection import validation_curve

param_range = np.arange(1, 50, 2)

train_scores, test_scores = validation_curve(RandomForestClassifier(), 
                                             digits.data, 
                                             digits.target, 
                                             param_name="n_estimators", 
                                             param_range=param_range,
                                             cv=3, 
                                             scoring="accuracy", 
                                             n_jobs=-1)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.show()

在这里插入图片描述

6. Learning Curves

Learning curves shows variation in training & validation score on increasing the number of samples

from sklearn.model_selection import learning_curve

sljwy

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

私信关注

Model Selection & Evaluation

Model Selection & Evaluation

Agenda

Cross Validation交叉验证

Cross Validation API交叉验证API

Cross-validate Function : Scores for multiple matrices交叉验证函数：多个矩阵的分数

Stratification for dealing with imbalanced Classes分层处理不平衡类的问题

Hyperparameter Tuning 超参数调整

Exhaustive GridSearch穷尽的网格搜索

RandomizedSearch随机搜索

Comparing GridSearch and RandomSearchCV

比较GridSearch和RandomSearchCV。

Model Evaluation模型评估

Cancer prediction sample for understanding metrices癌症预测样本了解度量衡的癌症预测样本

Technique 3 - Using metrices使用衡量指标

Classfication metrices分类衡量指标

House Price Prediction - Understanding matrices

Matrices for Regression

Metrices for Clustering 用于聚类的衡量指标

completeness_score

homogeneity_score

silhoutte_score

Selecting the number of clusters with silhouette analysis on KMeans clustering

calinski_harabaz_score

Model Persistance

Validation Curves验证曲线

6. Learning Curves

猜你喜欢