超参数：大多数算法需要选择设置很多参数，这些设置或者调优按钮叫作超参数。帮助我们在最大化性能的时候控制算法的行为。评估模型的好坏：比如交叉验证、自举、T检验、配对卡方检验。

一、评估一个模型

最常用的三个模型评估策略：通过比较模型的预测结果和一些真实的结果来评估模型的有效性。

K折交叉验证
自举
配对卡方检验

分割数据集为训练集和测试集，如果数据集较小，选择50%、50%分割可能比较合适。

# 评估一个模型
from sklearn.datasets import load_iris
import numpy as np
import cv2
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X, cv2.ml.ROW_SAMPLE, y)
_, y_hat = knn.predict(X)
print('训练集上的准确度：',accuracy_score(y, y_hat))

# 分割数据集为训练集和测试集，如果数据集较小，选择50%、50%分割
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37,train_size=0.8)
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
_, y_test_hat = knn.predict(X_test)
print('测试集上的准确度：',accuracy_score(y_test, y_test_hat))

最优模型：

最优模型”的问题基本上可以看成是找出偏差和方差的平衡点的问题。

使用复杂度较低的模型（高偏差）时，训练数据往往欠拟合，说明模型对训练数据和新数据都缺乏预测能力。
而使用复杂度较高的模型（高方差）时，训练数据往往过拟合，说明模型对训练数据预测能力强，但是对新数据的预测能力很很差。
当使用复杂度适中的模型时，验证曲线得分很高。说明再该模型复杂度条件下，偏差与方差达到均衡状态。

这是啥啊？？？？？

二、理解K折（留一法）交叉验证、自举、配对卡方检验

2.1、K折交叉验证

交叉验证：是一种评估模型泛化能力的方法，它比把数据集分割为训练集和测试集更加稳定和全面。

K折交叉验证：分为k个子集，1个评估，剩余k-1作为训练。重复k次，通常k取5-10。

目的：仅仅是用于评估一个给定算法在特定数据集上训练的泛化能力如何。根据观察在不同折的准确率区别，我们可以知道应用到新数据上时，模型最差和最好的表现情况。

2.2、留一法交叉验证

留一法交叉验证：选择和数据集中的数据点相等的折数。数据点：N，设置K=N。即测试集只有一个数据。

2.3、自举

自举:从有N个样本的数据集中，随机以替换的方式选择N个样本来形成自举（同一个数据可能在训练集中出现多次）。使用自举样本对分类器进行训练，在袋外样本（自举过程中没有出现的所有样本）上对分类器进行测试。

通过成千上万次的迭代，来确保结果具有一般性。

2.4、T检验

T检验：在实践中，通过T测试可以让我们确定两个数据样本是否来自于相同的平均值或者期望潜在分布。对于我们来说，可以确定两个独立的分类器的测试分数是否有相同的平均值(如，p=0.5,表明两个分类器在100次检测中有50次的结果是一样的。)。

使用 SciPy的 stats模块的ttest_ind函数实现T检验：根据T测试返回的p（通常=0.05，表示一百次只正确5次）参数来检验，p越小，意味着两个分类器有明显不同的结果。

2.5、配对卡方检验

配对卡方检验:确定两个模型是否有显著不同的分类结果。希望知道下面两个结果：

模型A分类错误而模型B分类正确的数据点有多少？
模型A分类正确而模型B分类错误的数据点有多少？

p值越小，说明两个分类器越不一样（同T检验）

因为要知道每个数据点的分类结果，因为把配对卡方检验应用到留一法交叉验证会更有意义。

# 理解K折交叉验证、留一法交叉验证、自举、T检验、配对卡方检验


#　使用OpenCV手动实现2折交叉验证
from sklearn.datasets import load_iris
import numpy as np
from sklearn.model_selection import train_test_split
import cv2
from sklearn.metrics import accuracy_score
iris = load_iris()                                     # 载入鸢尾花数据集
X = iris.data.astype(np.float32)
y = iris.target
X_fold1, X_fold2, y_fold1, y_fold2 = train_test_split(X, y, random_state=37, train_size=0.5)
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X_fold1, cv2.ml.ROW_SAMPLE, y_fold1)
_, y_hat_fold2 = knn.predict(X_fold2)
knn.train(X_fold2, cv2.ml.ROW_SAMPLE, y_fold2)
_, y_hat_fold1 = knn.predict(X_fold1)
print('使用OpenCV手动实现1折上的准确率:',accuracy_score(y_fold1, y_hat_fold1))
print('使用OpenCV手动实现2折上的准确率:',accuracy_score(y_fold2, y_hat_fold2))

#　使用sklearn自动实现k折交叉验证
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
model = KNeighborsClassifier(n_neighbors=1)
scores = cross_val_score(model, X, y, cv=5)     # 5折交叉验证
print('使用sklearn自动实现5折交叉验证:',scores)
print('使用sklearn自动实现5折交叉验证(平均值，误差):',scores.mean(), scores.std())


#　实现留一法交叉验证
from sklearn.model_selection import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut())
# print('使用留一法交叉验证:',scores)
print('使用留一法交叉验证(平均值，误差):',scores.mean(), scores.std())


# 使用自举评估鲁棒性
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
# 从有N个样本的数据集中，随机以替换的方式选择N个样本来形成自举，
# 告诉函数从[0，len(X)-1]范围中以可替代的方式（replace=True）抽取个len(X)样本。返回一个索引列表，形成自举。
idx_boot = np.random.choice(len(X), size=len(X), replace=True)
X_boot = X[idx_boot, :]
y_boot = y[idx_boot]
# 把那些自举过程中没有出现的所有样本放入到袋外集
idx_oob = np.array([x not in idx_boot
                    for x in np.arange(len(X))], dtype=np.bool)
X_oob = X[idx_oob, :]
y_oob = y[idx_oob]
knn.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)         # 使用自举样本对分类器进行训练
_, y_hat = knn.predict(X_oob)                        # 在袋外样本上对分类器进行测试
print('使用自举的准确率:',accuracy_score(y_oob, y_hat))
# 迭代自举过程1000次
def yield_bootstrap(model, X, y, n_iter=10000):
    for _ in range(n_iter):
        # train the classifier on bootstrap
        idx_boot = np.random.choice(len(X), size=len(X),replace=True)
        X_boot = X[idx_boot, :]
        y_boot = y[idx_boot]
        model.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)

        # test classifier on out-of-bag examples
        idx_oob = np.array([x not in idx_boot
                            for x in np.arange(len(X))],
                           dtype=np.bool)
        X_oob = X[idx_oob, :]
        y_oob = y[idx_oob]
        _, y_hat = model.predict(X_oob)

        # return accuracy
        yield accuracy_score(y_oob, y_hat)     # yield可以把函数自动转换为一个生成器,每次迭代不需要append函数
np.random.seed(42)
acc=list(yield_bootstrap(knn, X, y, n_iter=10))
print('使用10自举的准确率(平均值，误差):',np.mean(acc), np.std(acc))
# acc = list(yield_bootstrap(knn, X, y, n_iter=1000))
# print('使用1000自举的准确率(平均值，误差):',np.mean(acc), np.std(acc))
# acc = list(yield_bootstrap(knn, X, y, n_iter=10000))
# print('使用10000自举的准确率(平均值，误差):',np.mean(acc), np.std(acc))


# 实现T检验
from scipy.stats import ttest_ind
# 简单验证T检验的有效性
scores_a = [1, 1, 1, 1, 1]   # 假设在两个分类器运行5折交叉验证，模型A正确率：100%，模型B正确率：0%
scores_b = [0, 0, 0, 0, 0]
print('模型A正确率：100%，模型B正确率：0%，使用T检验:',ttest_ind(scores_a, scores_b))   # p=0
scores_a = [0.9, 0.9, 0.9, 0.8, 0.8]
scores_b = [0.8, 0.8, 0.9, 0.9, 0.9]
print('两个模型得分的数字一样使用T检验:',ttest_ind(scores_a, scores_b))                # p=1
# 实际中使用T检验
k1 = KNeighborsClassifier(n_neighbors=1)          # 模型A
scores_k1 = cross_val_score(k1, X, y, cv=10)
print('模型A的准确率(平均值，误差):',np.mean(scores_k1), np.std(scores_k1))
k3 = KNeighborsClassifier(n_neighbors=3)          # 模型B
scores_k3 = cross_val_score(k3, X, y, cv=10)
print('模型B的准确率(平均值，误差):',np.mean(scores_k3), np.std(scores_k3))
print('对模型A、模型B使用T检验:',ttest_ind(scores_k1, scores_k3))   # P=0.77,表明模型AB在100次中有77次结果是一样的


# 配对卡方检验
from scipy.stats import binom


def mcnemar_midp(b, c):
    """
    Compute McNemar's test using the "mid-p" variant suggested by:

    M.W. Fagerland, S. Lydersen, P. Laake. 2013. The McNemar test for 
    binary matched-pairs data: Mid-p and asymptotic are better than exact 
    conditional. BMC Medical Research Methodology 13: 91.

    `b` is the number of observations correctly labeled by the first---but 
    not the second---system; `c` is the number of observations correctly 
    labeled by the second---but not the first---system.
    """
    n = b + c
    x = min(b, c)
    dist = binom(n, .5)
    p = 2. * dist.cdf(x)
    midp = p - dist.pmf(x)
    return midp
# 简单验证配对卡方检验的有效性
scores_a = np.array([1, 1, 1, 1, 1])
scores_b = np.array([0, 0, 0, 0, 0])
a1_b0 = scores_a * (1 - scores_b)
print('A正确B错误的数量:',a1_b0)
a0_b1 = (1 - scores_a) * scores_b
print('B正确A错误的数量:',a0_b1)
print('对模型AB，验证配对卡方检验:',mcnemar_midp(a1_b0.sum(), a0_b1.sum()))    # p值越小，说明两个分类器越不一样

# 将配对卡方检验应用到留一法交叉验证
scores_k1 = cross_val_score(k1, X, y, cv=LeaveOneOut())   # 1折留一法交叉验证结果
scores_k3 = cross_val_score(k3, X, y, cv=LeaveOneOut())   # 3折留一法交叉验证结果
print('1正确3错误的数量:',np.sum(scores_k1 * (1 - scores_k3)))
print('3正确1错误的数量:',np.sum((1 - scores_k3) * scores_k3))
print('对模型13折留一法交叉验证，验证配对卡方检验:',mcnemar_midp(np.sum(scores_k1 * (1 - scores_k3)),
             np.sum((1 - scores_k1) * scores_k3)))

三、使用网格搜索进行超参数调优

3.1、网格搜索

网格搜索：超参数调优最常用的工具，使用一个for循环尝试所有可能的参数组合。

3.2、理解验证集的价值

在使用网络搜索时，我们使用了测试集来进行验证，因此，最后我们无法使用测试集来进行最后的数据评估。解决方案分为三个数据集：

训练集：用于构建模型
验证集：用于选择模型的参数
测试集：用于评估最终模型的性能

最终，我们在网格搜索中找到的k值，并在训练（训练集和验证集）和测试数据上再次训练模型，这样我们就使用了尽可能多的数据来构建模型。

3.3、网格搜索结合交叉验证

我们可能偶然的选择了一个分割方法，将大多数容易分类的数据分到了验证集。因此解决这个问题采用，网格搜索结合交叉验证，这样数据就被分为训练集和验证集多次。交叉验证在网格搜索的每一步执行以评估每个参数组合。如下图所示：

在这里插入图片描述
One potential danger of the grid search we just implemented is that the outcome might be relatively sensitive to how exactly we split the data. After all, we might have accidentally chosen a split that put most of the easy-to-classify data points in the test set, resulting in an overly optimistic score. Although we would be happy at first, as soon as we tried the model on some new held-out data, we would find that the actual performance of the classifier is much lower than expected.
Instead, we can combine grid search with cross-validation. This way, the data is split multiple times into training and validation sets, and cross-validation is performed at every step of the grid search to evaluate every parameter combination.
Because grid search with cross-validation is such a commonly used method for hyperparameter tuning, scikit-learn provides the GridSearchCV class, which implements it in the form of an estimator.
We can specify all the parameters we want GridSearchCV to search over by using a dictionary. Every entry of the dictionary should be of the form {name: values}, where name is a string that should be equivalent to the parameter name usually passed to the classifier, and values is a list of values to try.
For example, in order to search for the best value of the parameter n_neighbors of the KNeighborsClassifier class, we would design the parameter dictionary as follows:

# 使用网格搜索进行超参数调优

from sklearn.datasets import load_iris
import numpy as np
from sklearn.model_selection import train_test_split
import cv2
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# 实现一个简单的网格搜索
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37)
best_acc = 0
best_k = 0
for k in range(1, 20):
    knn = cv2.ml.KNearest_create()
    knn.setDefaultK(k)
    knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
    _, y_test_hat = knn.predict(X_test)
    acc = accuracy_score(y_test, y_test_hat)
    if acc > best_acc:
        best_acc = acc
        best_k = k
print('最佳准确率、最佳k值：',best_acc, best_k)

# 理解验证集的价值
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=37)  # 分为（训练集和验证集）、（测试集）两部分
print('（训练集和验证集）：',X_trainval.shape)
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=37) # 分为训练集、验证集。两部分
print('训练集：',X_train.shape)
best_acc = 0.0
best_k = 0
for k in range(1, 20):                                 # 用验证集来网格搜索进行超参数调优
    knn = cv2.ml.KNearest_create()
    knn.setDefaultK(k)
    knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
    _, y_valid_hat = knn.predict(X_valid)
    acc = accuracy_score(y_valid, y_valid_hat)
    if acc >= best_acc:
        best_acc = acc
        best_k = k
print('验证集下的最佳准确率、最佳k值：',best_acc, best_k)
# 根据验证集得到的最佳k值，应用与测试集中（使用了尽可能多的数据来构建模型）
knn = cv2.ml.KNearest_create()
knn.setDefaultK(best_k)
knn.train(X_trainval, cv2.ml.ROW_SAMPLE, y_trainval)
_, y_test_hat = knn.predict(X_test)
print('测试集下的最佳准确率、最佳k值：',accuracy_score(y_test, y_test_hat), best_k)

# 网格搜索结合交叉验证
param_grid = {'n_neighbors': range(1, 20)}     # n_neighbors：调参名，range(1, 20)：尝试值的范围
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)    # sklearn提供的网格搜索结合交叉验证函数
grid_search.fit(X_trainval, y_trainval)
print('网格搜索结合交叉验证,验证集下的最佳准确率、最佳k值：',grid_search.best_score_, grid_search.best_params_)
print('测试集上的准确度：',grid_search.score(X_test, y_test))

3.3、网格搜索结合嵌套交叉验证

在这里插入图片描述

四、使用不同评估指标来对模型评分

在实践中，我们常常感兴趣的不仅在于进行准确的预测，还在于把这些预测作为更大的决策过程的一部分。比如：最小化假正的数量可能与最大化准确率同样重要。

4.1、选择正确的分类指标：

准确率：预测正确的数量/测试集的总数量
精确率：不把正样本标记为负的能力
召回率（敏感率）：分类器检索所有正样本的能力

F值：计算的是精确率和召回率的调和平均值。公式：2（精确率召回率）/（精确率+召回率）

设置作业点：保证一定的召回率的同时，准确率不能太低。

4.2、选择正确的回归指标：

均方误差：计算训练集中每个数据点的预测结果和真实目标值之间的平方误差，并取平均值。
可释方差：计算一个模型可以解释测试数据的方差或者离散度的程度。一般用相关系数来表示。
R^2值：与可释方差密切相关，使用的是一个无偏方差估计。也叫作决定系数。（默认使用）

4.3、链接算法形成一个管道：

随着我们把复杂的评估指标结合到精心设计的网格搜索中，我们的模型选择代码可能变得非常复杂。解决方法，管道：用于简化模型选择的有用的结构。

pipeline类最常用的使用方式是使用一个想分类器一样的监督模型把不同的预处理步骤连接起来。

# 链接算法形成一个管道

from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

# 用sklearn实现管道
cancer = load_breast_cancer()
X = cancer.data.astype(np.float32)
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37)
svm = SVC()
svm.fit(X_train, y_train)
print('支持向量机的准确度：',svm.score(X_test, y_test))
# "scaler是MinMaxScaler()的一个实例, svm是SVC()的一个实例
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
pipe.fit(X_train, y_train)
print('使用管道后的准确度：',pipe.score(X_test, y_test))


# 在网格搜索中使用管道
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}      # 构建参数网格

grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)
print('最佳得分：',grid.best_score_)
print('最佳参数：',grid.best_params_)
print('准确度：',grid.score(X_test, y_test))

在定义参数网格来搜索并由管道和参数网格构建一个GridSearchCV。在定义参数网格时，我们需要指定每个参数在管道中属于哪一步。我们想要调整的参数C和gamma，都是第二部中SVC的参数。我们把这一步命名为SVM,为管道定义参数网格的语法是为每个参数指定步骤名，接下来是__(双下划线)，后跟参数名。

通过超参数调优选择合适的模型-基于opencv和python的学习笔记（二十三）