Machine Learning Model Selection Evaluation and Hyperparameter Tuning

How to choose a model? How to evaluate the model? How to tune the hyperparameters of the model? Model evaluation should be performed on the test set, not the training set, otherwise the accuracy of the evaluation is always 100%, and the model tuning should be iteratively performed on the verification set. Therefore, generally after we prepare the data set, we need to divide it into training set and test set. The distribution ratio is generally between 5:5 and 8:2, that is, at most 80% of the training set, 20% of the test set, and the verification set contains in the training set. There is a sklearn.model_selection.train_test_split method in sklearn to split the data set.

1. How to choose a suitable model?

The selection of the model needs to consider many factors, such as the flexibility or complexity of the model (that is, the hyperparameters supported by the model). The more the better, the core point of consideration is the balance between the deviation and variance of the model. It is equivalent to the accuracy of the model prediction, and the variance is equivalent to the stability or robustness of the model's prediction in the entire test data. Generally speaking, underfitting means that the training set is too small, resulting in poor prediction accuracy (that is, large deviation, fluctuation or Variance is small), overfitting means that there is too much data in the training set, and all the fluctuation data of the training set are over-learned, resulting in good prediction accuracy (that is, small deviation, large fluctuation or variance), both cases are not good, both Balance is the best.

2. The four most commonly used model evaluation methods

1. General evaluation and verification : Split the data set into a training set and a test set, complete the training on the training set, form a model, make predictions on the test set, and then evaluate the accuracy of the model, which will bring two problems , problem 1, because there is no standard method for splitting the ratio, the model may be underfitting or overfitting, problem 2 is due to the random selection of test set data, it is possible to just select a set of prediction data suitable for the model, resulting in high (good) accuracy, or the opposite. Examples are as follows:

#加载花萼数据
from sklearn.datasets import load_iris
iris = load_iris()
import numpy as np
X = iris.data.astype(np.float32)
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37,train_size=0.8) #8：2拆分，随机
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1) #设置K近邻的数量为1
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
_, y_test_hat = knn.predict(X_test)
accuracy_score(y_test, y_test_hat) #96.7%

2. K-fold cross-validation : It is equivalent to dividing N data sets into K folds, each group is equivalent to N/K data, and then K-1 data sets are used for training, and the other is used for testing. The advantage is Use data more efficiently, and get higher accuracy through multiple iterations. Examples are as follows:

#加载花萼数据
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target
from sklearn.model_selection import train_test_split
#将数据分为两等分，各50%，相当于二折
X_fold1, X_fold2, y_fold1, y_fold2 = train_test_split(X, y, random_state=37, train_size=0.5)
#opencv方式
import cv2
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1) #K=1
knn.train(X_fold1, cv2.ml.ROW_SAMPLE, y_fold1) #第一折训练
_, y_hat_fold2 = knn.predict(X_fold2)          #第一折预测
knn.train(X_fold2, cv2.ml.ROW_SAMPLE, y_fold2) #第二折训练
_, y_hat_fold1 = knn.predict(X_fold1)          #第二折预测
from sklearn.metrics import accuracy_score
accuracy_score(y_fold1, y_hat_fold1) #第一折评估
accuracy_score(y_fold2, y_hat_fold2) #第二折评估
#sklearn方式
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) #cv指定折数，这里是5折，无需手动分割数据集，cross_val_score会根据折数自动分割。
scores.mean(), scores.std() #评估的平均值和标准差
#留一法交叉验证，这是交叉验证的一种特殊方法，相当于将K=N,在K或N次迭代中，留一个数据点进行测试验证评估，具体实现如下
from sklearn.model_selection import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut())
scores.mean(), scores.std()

3. Bootstrapping verification : used to evaluate the robustness of the model, examples are as follows:

#加载花萼数据
from sklearn.datasets import load_iris
iris = load_iris()
import numpy as np
X = iris.data.astype(np.float32)
y = iris.target
idx_boot = np.random.choice(len(X), size=len(X), replace=True) #随机以替换的方式选择N个样本
X_boot = X[idx_boot, :]
y_boot = y[idx_boot]
idx_oob = np.array([x not in idx_boot for x in np.arange(len(X))], dtype=np.bool)
X_oob = X[idx_oob, :]
y_oob = y[idx_oob]
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)
_, y_hat = knn.predict(X_oob)
accuracy_score(y_oob, y_hat)
acc=list(yield_bootstrap(knn, X, y, n_iter=10))
print(acc)
acc = list(yield_bootstrap(knn, X, y, n_iter=1000))#迭代1000次
np.mean(acc), np.std(acc)

#迭代调用n_iter次的函数，实现模型训练预测和准确率评估
def yield_bootstrap(model, X, y, n_iter=10000):
    for _ in range(n_iter):
        # train the classifier on bootstrap
        idx_boot = np.random.choice(len(X), size=len(X),replace=True)
        X_boot = X[idx_boot, :]
        y_boot = y[idx_boot]
        model.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)        
        # test classifier on out-of-bag examples
        idx_oob = np.array([x not in idx_boot for x in np.arange(len(X))],dtype=np.bool)
        X_oob = X[idx_oob, :]
        y_oob = y[idx_oob]
        _, y_hat = model.predict(X_oob)        
        # return accuracy
        yield accuracy_score(y_oob, y_hat)

4. T test

The t-test test determines whether two data samples come from the same underlying distribution with the same mean or expected value. Examples are as follows:

#加载花萼数据
from sklearn.datasets import load_iris
iris = load_iris()
import numpy as np
X = iris.data.astype(np.float32)
y = iris.target
k1 = KNeighborsClassifier(n_neighbors=1)
scores_k1 = cross_val_score(k1, X, y, cv=10)
np.mean(scores_k1), np.std(scores_k1)
k3 = KNeighborsClassifier(n_neighbors=3)
scores_k3 = cross_val_score(k3, X, y, cv=10)
np.mean(scores_k3), np.std(scores_k3)
from scipy.stats import ttest_ind
ttest_ind(scores_k1, scores_k3)

3. Model hyperparameter selection and tuning

The parameter tuning of the model generally adopts grid search. Grid search is actually multiple for loop nested adjustments. Generally, one parameter uses one for loop. Examples are as follows:

#加载花萼数据
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37)
best_acc = 0
best_k = 0
import cv2
from sklearn.metrics import accuracy_score
for k in range(1, 20):
    knn = cv2.ml.KNearest_create()
    knn.setDefaultK(k)
    knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
    _, y_test_hat = knn.predict(X_test)
    acc = accuracy_score(y_test, y_test_hat)
    if acc > best_acc:
        best_acc = acc
        best_k = k
print(best_acc, best_k)
#训练集、验证集和测试集的拆分：在网格搜索过程中，如果将数据集还分为训练集和测试集，利用测试集来评估模型并更新超参数，就会出现将测试集信息暴漏给模型，导致评估不准确，因此需要将数据集拆分为训练集、验证集和测试集，训练集用于训练数据，验证集用于选择模型的最佳参数，测试集用于评估模型。
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=37)    #数据集分为训练验证集和测试集
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=37)#训练验证集进一步分为训练集和验证集
best_acc = 0.0
best_k = 0
for k in range(1, 20):
    knn = cv2.ml.KNearest_create()
    knn.setDefaultK(k)
    knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train)    #在训练集训练模型
    _, y_valid_hat = knn.predict(X_valid)             #在验证集上预测数据
    acc = accuracy_score(y_valid, y_valid_hat)        #根据验证集预测数据情况评估准确率，不断更新最佳超参数
    if acc >= best_acc:
        best_acc = acc
        best_k = k
print(best_acc, best_k) #1.0,7

knn = cv2.ml.KNearest_create()
knn.setDefaultK(best_k) #best_k=7,是前面迭代得到的最近k值
knn.train(X_trainval, cv2.ml.ROW_SAMPLE, y_trainval) #在训练验证集上重新训练模型
_, y_test_hat = knn.predict(X_test)
print(accuracy_score(y_test, y_test_hat))
print(best_k)
#网格搜索结合交叉验证实现超参数调优，利用sklearn提供的GridSearchCV类，实现在网格搜索中加入交叉验证机制
param_grid = {'n_neighbors': range(1, 20)} #搜索n_neighbors的最佳参数，范围1~19，多个其他参数也用类似方式设置
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_trainval, y_trainval) #在训练集训练模型
print(grid_search.best_score_, grid_search.best_params_) #获得最佳的验证得分和最佳超参数k值
print(grid_search.score(X_test, y_test)) #测试集上的评估

4. Use Pipeline to realize the connection of each step of machine learning

Machine learning algorithms require many steps, such as data preprocessing, training, prediction, evaluation and other steps, and the Pipeline class of sklearn itself has fit, predict and score methods, so it can connect different models and processing steps of the classifier, quite in a pipeline. Examples are as follows:

#加载花萼数据
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data.astype(np.float32)
y = cancer.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37)
#采用SVM
from sklearn.svm import SVC
svm = SVC()
#svm.fit(X_train, y_train)
#svm.score(X_test, y_test)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

#结合sklearn的网格搜索
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_params_)
print(grid.score(X_test, y_test))

5. The evaluation indicators of the model mainly include those

1. Classification model

1) Accuracy: It is the evaluation indicator that the previous examples focus on, and the proportion of correct data in the test set is predicted.

2) Accuracy: Indicates the ability of the model not to predict positive samples as negative.

3) Recall rate: Indicates the ability of the model to predict all positive samples, and there is also an F value = 2*(precision rate*recall rate)/(precision rate+recall rate), which is equivalent to the harmonic mean of the two.

2. Regression model

1) Mean square error: The square error between the predicted value and the real value is also an indicator often used in regression algorithms.

2) Explainable variance: The variance or dispersion of predictions.

3) R2 value: Unbiased variance estimation, see sklearn.metrics.r2_score class for details, and it is also an indicator often used in regression algorithms.

Machine Learning Model Selection Evaluation and Hyperparameter Tuning

Guess you like