ML model selection

In machine learning, the algorithm is trained to learn the parameters of a model by minimizing the value of a certain loss function. In addition, many algorithms (such as support vector machines and random forests) have some hyperparameters, which must be defined outside the learning process.

The process of selecting the best learning algorithm and selecting the best hyperparameters is called model selection.

1. Exhaustive search to select the best model

Select the best model by searching a series of hyperparameters

Use sklaern's GridSearchCV:

import numpy as np
from sklearn import linear_model,datasets
from sklearn.model_selection import GridSearchCV
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建逻辑回归对象
logistic = linear_model.LogisticRegression()
#创建正则化惩罚的候选超参数区间
penalty = ["l1","l2"]
#创建正则化候选超参数区间
C= np.logspace(0,4,10)
#创建候选超参数的字典
hyperparameters = dict(C=C,penalty = penalty)
#创建网格搜索对象
gridsearch = GridSearchCV(logistic,hyperparameters,cv=5,verbose=0)
#训练网格搜索
best_model = gridsearch.fit(features,target)
#查看最佳超参数
print('Best Penalty:',best_model.best_estimator_.get_params()['penalty'])
print('C:',best_model.best_estimator_.get_params()['C'])
--->
Best Penalty: l2
C: 7.742636826811269
#预测目标向量
best_model.predict(features)

2. Random search to select the best model

Using RandomizedSearchCV of scikit-learn to select the model saves more resources than exhaustive search.

from scipy.stats import uniform
from sklearn import linear_model,datasets
from sklearn.model_selection import RandomizedSearchCV

#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建逻辑回归对象
logistic = linear_model.LogisticRegression()
#创建正则化惩罚的候选超参数区间
penalty = ['l1','l2']
#创建正则化候选超参数区间
C = uniform(loc = 0,scale = 4)
#创建候选超参数的字典
hyperparameters = dict(C=C,penalty = penalty)
#创建随机搜索对象
randomizedsearch = RandomizedSearchCV(logistic,hyperparameters,random_state=1,n_iter=100,cv=5,verbose=0,n_jobs=-1)
#训练网格搜索
best_model = randomizedsearch.fit(features,target)

More effective than GridSearchCV is to select a specific number of random combinations of hyperparameters on the parameter distribution provided by the user (such as normal distribution, uniform distribution).

When using RandomizedSearchCV, if you specify a distribution, sklearn will randomly sample the hyperparameters from the distribution without replacement.

#查看最佳超参数
print('Best Penalty:',best_model.best_estimator_.get_params()['penalty'])
print('C:',best_model.best_estimator_.get_params()['C'])
--->
Best Penalty: l2
C: 3.730229437354635
#预测目标向量
best_model.predict(features)

The number of samples for hyperparameter combinations (ie, the number of candidate models) is specified by the parameter n_iter (number of iterations).

3. Choose the best model from a variety of learning algorithms

Create a dictionary for candidate learning algorithms and their hyperparameters, and select the best model by searching a series of learning algorithms and their hyperparameters:

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

#设置随机数种子
np.random.seed(0)
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建pipeline
pipe = Pipeline([("classifier",RandomForestClassifier())])
#创建候选学习算法及超参数的字典
search_space = [{
    
    "classifier":[LogisticRegression()],
                 "classifier__penalty":['l1','l2'],
                 "classifier__C":np.logspace(0,4,10)},
                {
    
    "classifier":[RandomForestClassifier()],
                 "classifier__n_estimators":[10,100,1000],
                 "classifier__max_features":[1,2,3]}]
#创建GridSearchCV搜索对象
randomizedsearch = GridSearchCV(pipe,search_space,cv=5,verbose=0)
#训练网格搜索
best_model = gridsearch.fit(features,target)

Sklearn now allows learning algorithms as part of the search space. In the above solution, a search space containing two learning algorithms is defined: logistic regression and random forest. Each algorithm has its own hyperparameters, and the format of classifier__[hyperparameter_name] is used to define candidate values ​​of hyperparameters.

After the search is complete, use best_estimator_ to view the model and its hyperparameters:

best_model.best_estimator_.get_params()
-->
{
    
    'C': 7.742636826811269,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

4. Add data preprocessing to the model selection process

Create a pipeline containing data preprocessing steps and their parameters, and incorporate data preprocessing steps into the model selection process:

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#设置随机数种子
np.random.seed(0)
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建一个包含StandardScaler和PCA的预处理对象
preprocess = FeatureUnion([("std",StandardScaler()),("pca",PCA())])
#创建一个pipeline
pipe = Pipeline([("preprocess",preprocess),
                 ("classifier",LogisticRegression())])
#创建候选值的取值空间
search_space = [{
    
    "preprocess__pca__n_components":[1,2,3],
                 "classifier__penalty":["l1","l2"],
                 "classifier__C":np.logspace(0,4,10)}]
#创建网格搜索对象
clf = GridSearchCV(pipe,search_space,cv=5,verbose=0,n_jobs=-1)
#训练模型
best_model = clf.fit(features,target)

FeatureUnion can combine multiple preprocessing operations. In the above solution, FeatureUnion combines two preprocessing steps: eigenvalue standardization (StandardScaler) and principal component analysis (PCA) to generate an object called preprocess, which contains two preprocess Processing steps. Then, use the learning algorithm to include the preprocess into the pipeline. In the end, the complex operations such as fitting, conversion, and training the model using various hyperparameter combinations are all handed over to scikit-learn for processing.

After the model selection is over, check the preprocessing parameters of the best model. In this solution, you can see the number of principal components:

best_model.best_estimator_.get_params()['preprocess__pca__n_components']
--->
2

5. Parallelization acceleration model selection

Set n_jobs=-1 to use all CPU cores and speed up the process of model selection:

import numpy as np
from sklearn import linear_model,datasets
from sklearn.model_selection import GridSearchCV
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建逻辑回归对象
logistic = linear_model.LogisticRegression()
#创建正则化惩罚的候选超参数区间
penalty = ["l1","l2"]
#创建正则化候选超参数区间
C= np.logspace(0,4,10)
#创建候选超参数的字典
hyperparameters = dict(C=C,penalty = penalty)
#创建网格搜索对象
gridsearch = GridSearchCV(logistic,hyperparameters,cv=5,n_jobs=-1,verbose=0)
#训练网格搜索
best_model = gridsearch.fit(features,target)

Without involving too many technical details, the number of models that scikit-learn can train at the same time is equal to the number of CPU cores of the machine. The default value of n_jobs is 1, which means that only one core is used.

6. Use algorithm-specific methods to accelerate model selection

If you are using a specific learning algorithm, you can use sklearn's model-specific cross-validation hyperparameter tuning function to accelerate model selection. For example, using LogisticRegressionCV:

from sklearn import linear_model,datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建LogisticRegressionCV对象
logistic = linear_model.LogisticRegressionCV(Cs=100)
#训练模型
logistic.fit(features,target)

—>output:

LogisticRegressionCV(Cs=100, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

Sometimes, using the characteristics of learning algorithms can find the best hyperparameters faster than brute force search or random model search. In sklearn, many learning algorithms (such as ridge regression, elastic network regression) have specific cross-validation methods to use their own advantages to find the best hyperparameters. For example, LogisticRegression is used to implement a standard logistic regression classifier, while LogisticsRegressionCV implements an efficient cross-validation logistic regression classifier that can identify the best value of the hyperparameter C.

7. Performance evaluation after model selection

Problem description: Evaluate the performance of the model found through model selection

Use nested cross-validation to avoid evaluation bias:

import numpy as np
from sklearn import linear_model,datasets
from sklearn.model_selection import GridSearchCV,cross_val_score
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建逻辑回归对象
logistic = linear_model.LogisticRegression()
#创建候选超参数C的20个候选值
C= np.logspace(0,4,10)
#创建候选超参数的字典
hyperparameters = dict(C=C)
#创建网格搜索对象
gridsearch = GridSearchCV(logistic,hyperparameters,cv=5,n_jobs=-1,verbose=0)
#执行嵌套交叉验证并输出平均分
cross_val_score(gridsearch,features,target).mean()

—>

0.9800000000000001

In general model selection methods (ie GridSearch and RandomizedCV), cross-validation is used to evaluate which hyperparameters generate the best model. However, there is a problem: because these data are used to select the best hyperparameters, they can no longer be used to evaluate the performance of the model. Solution: Include another cross-validation for model search in the cross-validation! In nested cross-validation, "internal" cross-validation is used to select the best model, while "outer" cross-validation provides an unbiased estimate of model performance. In this solution, the internal cross-validation is the GridSearchCV object, and then the cross_val_score method is used to encapsulate it into the external cross-validation.

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/108697999
Recommended