Machine Learning (14): Advanced Hyperparameter Tuning_RandomizedSearchCV and HalvingSearchCV

The full text has a total of more than 19,000 words, and the expected reading time is about 40~60 minutes | Full of dry goods, it is recommended to collect!

insert image description here
Code and dataset download

1. Hyperparameter Optimization and Theoretical Limits of Enumeration Grids

1.1 Hyperparameter optimization (HPO, HyperParameter Optimization)

Hyperparameter optimization (HPO, HyperParameter Optimization) is a key task in machine learning, and its main goal is to find the optimal model hyperparameter configuration to optimize the performance of the model on specific tasks.

In machine learning, there are two types of parameters:

  1. Model parameters : These are the parameters that the model learns during training, for example: slope and intercept in linear regression, these parameters are updated and learned through optimization algorithms such as gradient descent.
  2. Hyperparameters : These parameters cannot be learned during training, but need to be set in advance. For example, learning rate, number of training epochs. The setting of hyperparameters has a significant impact on the performance of the model.

Hyperparameter optimization is to find the optimal settings of these hyperparameters to maximize the performance of the model on the validation set. Theoretically, when the computing power and data are sufficient, the performance of HPO must exceed that of humans . HPO can reduce the human workload, and the results obtained by HPO are more likely to be reproduced than the search, so HPO can greatly improve the reproducibility and fairness of scientific research. At present, hyperparameter optimization algorithms can be mainly divided into:

211

1.2 Theoretical Limits of Enumerated Grids

In the introduction to hyperparameter tuning: an article to understand enumeration grid search , this article explains what hyperparameters are and how to optimize hyperparameters through grid search.

Among all hyperparameter optimization algorithms, enumeration grid search is the most basic and classic method. Before the search starts, it is necessary to manually list the alternative values ​​of each hyperparameter one by one, and arrange and combine different values ​​of multiple different hyperparameters to form a parameter space (parameter space). The enumeration grid search algorithm will bring all the parameter combinations in this parameter space into the model for training, and finally select the combination with the strongest generalization ability as the final hyperparameter of the model.

For grid search, if a certain point in the parameter space points to the real minimum value of the loss function, then the minimum value and the corresponding parameters must be captured when enumerating the grid search (relatively, if the parameter space Without any point pointing to the real minimum value of the loss function, then the grid search must not be able to find the parameter combination corresponding to the minimum value).

The larger and denser the parameter space, the greater the possibility that the combination in the parameter space just covers the minimum point of the loss function. That is to say, in extreme cases, when the parameter space exhausts all possible values, the grid search must be able to find the optimal parameter combination corresponding to the minimum value of the loss function, and the generalization ability of the parameter combination must be strong for manual tuning.

However, the disadvantages of grid search are also very obvious, especially when the parameter space is large, grid search requires a lot of time and computing resources, and when the parameter dimension increases, the amount of calculation required for grid search is exponential Rising. Take random forest as an example:

There is only one parameter n_estimators, and the alternative range is [50, 100, 150, 200, 250, 300], which needs to be modeled 6 times.
Increase the parameter max_depth, and the alternative range is [2,3,4,5,6], which needs to be modeled 30 times.
Add the parameter min_sample_split, and the alternative range is [2,3,4,5], which needs to be modeled 120 times.

At the same time, the goal of parameter optimization is to find the combination with the strongest generalization ability of the model. Therefore, cross-validation is required to reflect the generalization ability of the model. Assuming that the number of cross-validations is 5, the three parameters need to be modeled 600 times.

In the face of artificial neural networks, fusion models, and integrated models with many hyperparameters and possible infinite hyperparameter values, the time required for grid search will increase sharply with the increase in the complexity of data and models. Doing a grid search can take days and nights.

To solve this problem, the researchers proposed two strategies, random search and step-by-step search. This article will continue the theme of the previous article, introduce these two efficient search strategies in depth, and compare their performance.

If you are not clear about enumeration grid search, I suggest reading this article:
Introduction to Hyperparameter Tuning: One article to understand enumeration grid search

1.3 Practical operation: Looking at GridSearchCV from <Kaggle Competition Case: House Price Prediction>

Competition objectives and data description

The main goal of this competition is to predict the final price of a house, which is a typical regression problem because a continuous output (house price) needs to be predicted.

The dataset contains 79 explanatory variables describing nearly every aspect of homes in Ames, Iowa, including quality, condition, square footage, number of garages, basement condition, and more. These variables can all be used to predict the final selling price of a home.

The dataset is divided into training set and test set. The training set is used to build and train the model, and the test set is used to evaluate the performance of the model. The training set contains the characteristics of the houses and the corresponding sales prices, while the test set only contains the characteristics of the houses, and the contestants need to predict the sales prices of these houses.

The evaluation metric for this competition is Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. This means that the error between the predicted price and the actual price is squared, averaged, and finally taken as the square root. Using logarithms instead of raw prices minimizes large errors in forecasting prices that are too high or too low.

Build benchmark: use random forest to do enumeration grid search

Step 1: Import the basic library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate, KFold, GridSearchCV

import time

Step 2: Import the data set (the training set that has completed the basic processing can be used directly)

data = pd.read_csv("../datasets/House Price/train_encode.csv")

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

data

image-20230710101123554

Step 3: Basic Data Exploration

# 显示数据集的基本信息(列名、数据类型、非空值数量等)
print(data.info())

# 数据的统计信息
print(data.describe())

# 检查缺失值
print(data.isnull().sum())

# 检查目标列(如果是监督学习的情况)
print(data['SalePrice'].describe())

# 数据分布情况,例如绘制柱状图、箱线图等(需要matplotlib或seaborn库)
plt.hist(data['SalePrice'])
plt.show()

Step 4: Construct parameter space

This parameter grid is used to do a brute force search on the input random forest parameters, perform model training on every possible parameter combination, and then select the optimal set of parameters according to predetermined scoring criteria.

#参数空间
param_grid_simple = {
    
    "criterion": ["squared_error","poisson"]
                     , 'n_estimators': [*range(20,100,5)]
                     , 'max_depth': [*range(10,25,2)]
                     , "max_features": ["log2","sqrt",16,32,64,"auto"]
                     , "min_impurity_decrease": [*np.arange(0,5,10)]
                    }

Step5: Instantiate and build grid search

Set the parameters and model of the grid search, the scoring method is "neg_mean_squared_error", that is, the negative mean square error, the negative value is because sklearn will choose the model with the highest score when selecting the model, so the negative mean square error is used to make the error smaller , the higher the score.

model_rf = RandomForestRegressor(random_state=24, verbose=True,)
cv = KFold(n_splits=5, shuffle=True, random_state=24)
search = GridSearchCV(estimator=model_rf,
                     param_grid=param_grid_simple,
                     scoring = "neg_mean_squared_error",
                     verbose = True,
                     cv = cv,
                     n_jobs=-1)

Step 6: Train and calculate time

Perform model training and print out the consumed time in the format of "minutes + seconds". Note that divmod()the function returns two values, the first value is the quotient (minutes) and the second value is the remainder (seconds).

start = time.time()
search.fit(X, y)
end = time.time()

elapsed_time = end - start # 得到的时间是秒级别的
minutes, seconds = divmod(elapsed_time, 60) # 将秒转换为分钟和秒
print(f"Elapsed time: {
      
      int(minutes)} minutes {
      
      int(seconds)} seconds")

image-20230710105010207

In the above process, GridSearchCV and five-fold cross-validation method are used to find the optimal parameters among 1536 different parameter combinations. A total of 7680 models were trained and evaluated. This process took a total of 3 minutes and 2 seconds. Finally, using the found optimal parameters, the model was finally trained on all the data, which took 0.1 seconds.

Step 7: Rebuild the model according to the optimal parameters and evaluate the performance

The following code is used to obtain the optimal model of GridSearchCV, then calculate and print out the RMSE of the optimal model, then use this optimal model for cross-validation, and finally calculate and print out the RMSE of the training and test data, in order to Evaluate the performance of the model.

from sklearn.metrics import mean_squared_error
import warnings

# 获取最优模型
best_estimator = search.best_estimator_

# 打印最优模型
print("Best estimator:")
print(best_estimator)

# 获取GridSearchCV的最优分数(注意:这是负的MSE)
best_score = search.best_score_

# 将负的MSE转换为RMSE
rmse = np.sqrt(-best_score)

# 打印RMSE
print(f"RMSE of the best estimator found by GridSearchCV: {
      
      rmse:.4f}")

# 使用最优模型进行交叉验证,返回训练得分
from sklearn.model_selection import cross_validate
scores = cross_validate(best_estimator, X, y, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

# 计算训练和测试的RMSE
train_rmse = np.sqrt(-scores['train_score'].mean())
test_rmse = np.sqrt(-scores['test_score'].mean())

# 打印训练和测试的RMSE
print(f"Train RMSE: {
      
      train_rmse:.4f}")
print(f"Test RMSE: {
      
      test_rmse:.4f}")

One thing to note: when calculating the RMSE, the np.sqrt(-scores). This is because the GridSearchCV and cross_validate functions take a negative value for the "neg_mean_squared_error" score when calculating it (because in these functions higher scores represent better performance, but in MSE lower values ​​represent better performance). Therefore, when calculating RMSE, you need to take the negative value first, and then take the square root.

image-20230710111849016

Step 8: Conclusion output

213

Explain: The process of GridSearchCV is based on cross-validation (CV), and for each set of parameters, multiple (usually 5 or 10) verifications will be performed to calculate the mean value, and this process will be reduced to a certain extent. Likelihood of fit. So the optimal score obtained (27984) is calculated as the mean of all cross-validation sets.

Then, when retraining the model with this optimal parameter and performing cross-validation, different results may be obtained on the training and test sets, because each data split (that is, the split of the training set and the test set) may have different. So the RMSE (29731) you get at this step may vary.

This phenomenon is actually very common, because the performance of machine learning models will be affected by many factors, including the distribution of data, the complexity of the model, the method of splitting training and testing data, and so on. So even the same model may get different results on different datasets.

2. Overview of RandomizedSearchCV

2.1 Basic concept of RandomizedSearchCV

As mentioned in the previous two sections, the traditional grid search method (GridSearchCV) searches the preset parameter space to find the optimal parameters, although it can guarantee to find the optimal solution within a certain range, but when the parameter space increases , the computing time and resources required will increase exponentially. For example, if there are 10 parameters and each parameter has 5 possible values, then there will be 5 10 = 9 , 765 , 625 5^{10} = 9,765,625510=9,765,There are 625 possible parameter combinations to try, which is very time consuming.

Looking carefully at the above process of using enumeration grid search, it is not difficult to see that there are two factors that determine the operation speed of enumeration grid search :

1. The size of the parameter space: the larger the parameter space, the more modeling times are required

2. The size of the data volume: the larger the data volume, the more computing power and time required for each modeling

Therefore, the grid search optimization method in sklearn mainly includes two categories, one is to adjust the search space , and the other is to adjust the data for each training . Among them, the specific method of adjusting the parameter space is to abandon the global hyperparameter space that must be used in the original search, instead select some parameter combinations, construct a hyperparameter subspace, and only search in the subspace.

In order to solve this problem, RandomizedSearchCV came into being. Compared with GridSearchCV, RandomizedSearchCV does not try all possible parameter combinations, but randomly samples a part of the combinations in the parameter space to try. This can not only greatly reduce the search time and computing resource consumption, but also in many cases, the performance of RandomizedSearchCV is not inferior to GridSearchCV.

Look at the code:

# 假设的参数组合
n_estimators = np.array([50,100,150,200,250,300])
max_depth = np.array([2,3,4,5,6])

# 创建参数组合的网格
param_grid = np.array(np.meshgrid(n_estimators, max_depth)).T.reshape(-1,2)

# 随机选择一部分参数组合来模拟随机搜索
np.random.seed(0)
param_grid_random = param_grid[np.random.choice(param_grid.shape[0], size=8, replace=False), :]

# 创建子图
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

# 左图:网格搜索
ax[0].scatter(param_grid[:, 0], param_grid[:, 1], color='blue')
ax[0].set_title('Grid Search')
ax[0].set_xlabel('n_estimators')
ax[0].set_ylabel('max_depth')

# 右图:随机搜索
ax[1].scatter(param_grid[:, 0], param_grid[:, 1], color='blue', alpha=0.3)  # 画出所有的参数组合
ax[1].scatter(param_grid_random[:, 0], param_grid_random[:, 1], color='red')  # 画出被随机选择的参数组合
ax[1].set_title('Randomized Search')
ax[1].set_xlabel('n_estimators')
ax[1].set_ylabel('max_depth')

plt.tight_layout()
plt.show()

The two-dimensional space in the following figure is an example. In the parameter space composed of n_estimators and max_depth, the value of n_estimators is assumed to be [50,100,150,200,250,300], and the value of max_depth is assumed to be [2,3,4,5,6]. Then The enumerated grid search must search for all 30 parameter combinations. When adjusting the search space, in fact, only the orange parameter combination can be sampled as a "subspace", and only the orange parameter combination can be searched. In this way, the amount of calculation required for the overall search is greatly reduced. Originally, 30 modelings were required, but now only 8 modelings are required.

image-20230711082717313

2.2 Working principle of RandomizedSearchCV

In sklearn, the method of randomly extracting parameter subspaces and searching in subspaces is called RandomizedSearchCV. Due to the narrowing of the search space, the number of parameter groups that need to be enumerated and compared is also correspondingly reduced, and the overall search time will also be reduced accordingly. Therefore:

Random search is much faster than enumerated grid search when setting the same global space .

When setting the same number of training times, random search can cover a much larger space than enumerative grid search .

Also, nicely, the minimum loss from random grid search is very close to the minimum loss from enumeration grid search .

It can be said that the calculation speed is improved without too much damage to the accuracy of the search.

The working principle of RandomizedSearchCV is very simple. **Given a preset parameter space and a preset number of attempts, it will randomly select a part of parameter combinations in the parameter space for training and verification, and finally return the group that performs best among the tried parameter combinations parameter. **However, it should be noted that when the random grid search is actually running, it does not first sample the subspace and then search the subspace , but as if it is a "loop iteration", randomly selects 1 in this iteration A set of parameters is used for modeling, and a set of parameters is randomly selected for modeling in the next iteration. Since this random sampling is not replaced, there will be no problem of drawing the same set of parameters twice. The number of iterations of the random grid search can be controlled to control the size of the parameter subspace extracted as a whole. This approach is often called "giving the random grid search a fixed amount of calculation. When all the calculation amount is consumed, the random The grid search stops".

An assumption based on such an approach is that not all parameters have an equally important impact on the performance of the model, and changes in some important parameters will have a greater impact on the performance of the model than others. Through random sampling, there is a high probability that the optimal values ​​of these important parameters can be found, resulting in a model with good performance.

2.3 Interpretation of RandomizedSearchCV parameters in Sklearn

Let's first look at the parameters of RandomizedSearchCV in Sklearn:

image-20230711083421536

Read it:

135

The fundamental reason why random grid search is effective is that:

The sampled subspace can feed back the distribution of the global space to a certain extent, and the larger the subspace (the more parameter combinations it contains), the closer the distribution of the subspace is to the distribution of the global space

When the global space itself is sufficiently dense, a small subspace can also obtain a distribution similar to that of the global space

If the global space includes the theoretical minimum value of the loss function, then a subspace that is highly similar to the global space distribution is likely to also contain the minimum value of the loss function, or include a series of next smallest values ​​​​very close to the minimum value

Therefore, as long as the subspace is large enough, the effect of random grid search must be highly similar to enumeration grid search. When the global parameter space is fixed, random grid search can make a trade-off between efficiency and accuracy . The larger the subspace, the higher the accuracy, and the smaller the subspace, the higher the efficiency.

2.4 Practical operation: Looking at RandomizedSearchCV from <Kaggle Competition Case: House Price Prediction>

In order to intuitively understand the difference between RandomizedSearchCV and GridSearchCV, let's take Kaggle's housing price prediction as an example and use the random forest model. Random forest has many parameters that can be tuned, such as n_estimators(number of trees), max_features(maximum number of features per tree), max_depth(maximum depth of trees), etc. If 5 possible values ​​are set for each parameter, then using GridSearchCV requires training and validation 5^3 = 125times. However, if we use RandomizedSearchCV and set the number of attempts to 30, then we only need to train and verify 30 times, which greatly reduces the computational complexity, and it is also possible to get a result similar to GridSearchCV.

Still use the same data from Section 1.3

Step 1: Import the basic library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate, KFold, GridSearchCV

import time

Step 2: Import the data set (the training set that has completed the basic processing can be used directly)

data = pd.read_csv("../datasets/House Price/train_encode.csv")

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

data

image-20230710101123554

Step 3: Construct the same global parameter space

#创造参数空间 - 使用与网格搜索时完全一致的空间,以便于对比
param_grid_simple = {"criterion": ["squared_error","poisson"]
                     , 'n_estimators': [*range(20,100,5)]
                     , 'max_depth': [*range(10,25,2)]
                     , "max_features": ["log2","sqrt",16,32,64,"auto"]
                     , "min_impurity_decrease": [*np.arange(0,5,10)]
                    }

Step 4: Instantiate and build a random grid search, the subspace size is only set to 800 (the search parameters of the grid search are 1536)

model_rf1 = RandomForestRegressor(random_state=24, verbose=True,)
cv = KFold(n_splits=5, shuffle=True, random_state=24)

#定义随机搜索
search1 = RandomizedSearchCV(estimator=model_rf1
                            ,param_distributions=param_grid_simple
                            ,n_iter = 800 #子空间的大小是全域空间的一半左右
                            ,scoring = "neg_mean_squared_error"
                            ,verbose = True
                            ,cv = cv
                            ,random_state=24
                            ,n_jobs=-1
                           )

Step 5: Train and calculate time

Perform model training and print out the consumed time in the format of "minutes + seconds". Note that divmod()the function returns two values, the first value is the quotient (minutes) and the second value is the remainder (seconds).

start = time.time()
search1.fit(X, y)
end = time.time()

elapsed_time = end - start # 得到的时间是秒级别的
minutes, seconds = divmod(elapsed_time, 60) # 将秒转换为分钟和秒
print(f"Elapsed time: {
      
      int(minutes)} minutes {
      
      int(seconds)} seconds")

image-20230711091512571

In the above process, RandomizedSearchCV and five-fold cross-validation method are used to find the optimal parameters among 800 different parameter combinations. A total of 4000 models are trained and evaluated. This process took a total of 1 minute and 33 seconds. Finally, using the found optimal parameters, the model was finally trained on all the data, which took 0.1 seconds.

Step 6: Rebuild the model according to the optimal parameters and evaluate the performance

The following code is used to obtain the optimal model of RandomizedSearchCV, then calculate and print out the RMSE of the optimal model, then use this optimal model for cross-validation, and finally calculate and print out the RMSE of the training and test data, in order to Evaluate the performance of the model.

from sklearn.metrics import mean_squared_error
import warnings

# 获取最优模型
best_estimator = search1.best_estimator_

# 打印最优模型
print("Best estimator:")
print(best_estimator)

# 获取GridSearchCV的最优分数(注意:这是负的MSE)
best_score = search1.best_score_

# 将负的MSE转换为RMSE
rmse = np.sqrt(-best_score)

# 打印RMSE
print(f"RMSE of the best estimator found by GridSearchCV: {
      
      rmse:.4f}")

# 使用最优模型进行交叉验证,返回训练得分
scores = cross_validate(best_estimator, X, y, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

# 计算训练和测试的RMSE
train_rmse = np.sqrt(-scores['train_score'].mean())
test_rmse = np.sqrt(-scores['test_score'].mean())

# 打印训练和测试的RMSE
print(f"Train RMSE: {
      
      train_rmse:.4f}")
print(f"Test RMSE: {
      
      test_rmse:.4f}")

image-20230711091956533

Step 7: Conclusion output

136

Overall, random grid search finds models as good as enumerative grid search in relatively less time, which demonstrates the advantages of random grid search when searching large-scale parameter spaces. However, for small-scale parameter spaces, enumerative grid search may be more accurate. In addition, the current optimal model may have some overfitting, and it may be necessary to further adjust the parameters of the model or adopt some strategies to prevent overfitting, such as increasing the regularization strength of the model, or using a more complex model structure.

3. Excellent features of RandomizedSearchCV

3.1 The larger the subspace, the higher the accuracy, and the smaller the subspace, the higher the efficiency

In Section 2.3, there is such a conclusion: as long as the subspace is large enough, the effect of random grid search must be highly similar to enumeration grid search. When the global parameter space is fixed, random grid search can make a trade-off between efficiency and accuracy .

To verify, set a larger search space, and then do a random grid search. Other codes do not need to be changed. You only need to modify the process of Step 3: Building the same global parameter space :

#创造参数空间 - 让整体参数空间变得更密
param_grid_simple = {'n_estimators': [*range(80,100,1)]
                     , 'max_depth': [*range(10,25,1)]
                     , "max_features": [*range(10,20,1)]
                     , "min_impurity_decrease": [*np.arange(0,5,10)]
                    }

model_rf1 = RandomForestRegressor(random_state=24, verbose=True,)
cv = KFold(n_splits=5, shuffle=True, random_state=24)

#定义随机搜索
search1 = RandomizedSearchCV(estimator=model_rf1
                            ,param_distributions=param_grid_simple
                            ,n_iter = 1536 #使用与枚举网格搜索类似的拟合次数
                            ,scoring = "neg_mean_squared_error"
                            ,verbose = True
                            ,cv = cv
                            ,random_state=24
                            ,n_jobs=-1
                           )

start = time.time()
search1.fit(X, y)
end = time.time()

elapsed_time = end - start # 得到的时间是秒级别的
minutes, seconds = divmod(elapsed_time, 60) # 将秒转换为分钟和秒
print(f"Elapsed time: {int(minutes)} minutes {int(seconds)} seconds")

Look at the execution results:

image-20230711104610725

image-20230711104633582

Compare again:

137

When the global parameter space is enlarged, random grid search can use similar or less time than grid search on a small space to explore a denser/larger space and obtain better results.

3.2 Accept continuous parameter spaces

For grid search, the points in the parameter space are evenly distributed and uniformly spaced, because grid search cannot extract data from a certain "distribution", and can only use combined parameters to combine points, while random search can Accepts 'distribution' as input.

If the lowest point of the loss function lies between the two sets of parameters, in this case enumeration grid search is 100% impossible to find the minimum. But for random grid search, since the parameter points are randomly selected on a distribution, in the same parameter space, it is more likely to get a better value.

Look at the code:

import scipy #使用scipy建立分布

param_grid_simple = {
    
    'n_estimators': [*range(80,100,1)]
                     , 'max_depth': [*range(10,25,1)]
                     , "max_features": [*range(10,20,1)]
                     , "min_impurity_decrease": scipy.stats.uniform(0,50)
                    }

model_rf1 = RandomForestRegressor(random_state=24, verbose=True,)
cv = KFold(n_splits=5, shuffle=True, random_state=24)

#定义随机搜索
search1 = RandomizedSearchCV(estimator=model_rf1
                            ,param_distributions=param_grid_simple
                            ,n_iter = 1536 #使用与枚举网格搜索类似的拟合次数
                            ,scoring = "neg_mean_squared_error"
                            ,verbose = True
                            ,cv = cv
                            ,random_state=24
                            ,n_jobs=-1
                           )

start = time.time()
search1.fit(X, y)
end = time.time()

elapsed_time = end - start # 得到的时间是秒级别的
minutes, seconds = divmod(elapsed_time, 60) # 将秒转换为分钟和秒
print(f"Elapsed time: {
      
      int(minutes)} minutes {
      
      int(seconds)} seconds")

Look at the results:

image-20230711105821576

image-20230711105841472

Compare again:

138

Theoretically, when the global parameter space used by enumerative grid search is large enough/dense enough, the optimal solution of enumerative grid search is the upper limit of random grid search, so in theory random grid search will not get Better results than enumeration grid search .

But the problem in reality is that because the speed of enumeration grid search is too slow, the global parameter space of enumeration grid search often cannot be set very large, nor can it be set very densely, so the results of grid search are difficult close to the theoretical optimal value. When the random grid search sets the space larger and denser, it can capture the distribution of a wider space, and it is naturally possible to capture the theoretical optimal value.

4. Overview of HalvingSearchCV

6.1 Basic concept of HalvingSearchCV

To enumerate the problem of slow grid search, there are two optimization methods in sklearn: one is to adjust the search space , and the other is to adjust the data for each training . The method of adjusting the search space is random grid search, and the method of adjusting each training data is half-grid search.

In HalvingSearchCV, a small part of data is used to quickly evaluate all parameter combinations, and then only the best performing parameter combinations are reserved for further evaluation using more data. In this way, poorly performing parameter combinations can be quickly ruled out, thereby focusing more resources on promising parameter combinations.

for example:

Suppose now there is a data set DDD , from datasetDDRandomly sample a subsetdd from Dd . If a set of parameters in the entire data setDDThe performance on D is poor, then there is a high probability that this set of parameters is in the subsetddThe performance on d will not be too good. Conversely, if a set of parameters is in the subsetddThe performance on d is not good, and this set of parameters will not be trusted in the full data setDDPerformance on D.

The performance of the parameters in the subset and the full data set is consistent . If this assumption is true, then in the grid search, instead of using all the data to verify a set of parameters every time, you can consider only bringing in the training data. Subsets are used to filter hyperparameters, which can greatly speed up the operation.

But in real data, this assumption is conditional, that is, the distribution of any subset is similar to the distribution of the full data set D. When the distribution of the subset is closer to the distribution of the full data set, the performance of the same set of parameters on the subset and the full data set is more likely to be consistent. According to the previous conclusions in the random grid search: the larger the subset, the closer its distribution is to the distribution of the full data set, but a large subset will lead to longer training time, so for the overall training efficiency, it is impossible to infinite increase the subset. This creates a contradiction: the results on large subsets are more reliable, but the computation on large subsets is slower.

6.2 Working principle of HalvingSearchCV

An exquisite process is designed for the semi-grid search algorithm, which can well balance the size of the subset and the computational efficiency. Let’s look at the specific process:

  1. In the first stage, HalvingSearchCV evaluates all parameter combinations using a small part of the training data. This phase can be done quickly, but since only a small portion of the data is used, the results of the evaluation may be less accurate.
  2. In the second stage, HalvingSearchCV only retains a part of the best-performing parameter combinations according to the evaluation results of the first stage. This phase will use more data to evaluate these parameter combinations, resulting in more accurate results.
  3. HalvingSearchCV will repeat the above process, and each stage will eliminate half of the parameter combinations until only one parameter combination remains. This parameter combination is the best parameter combination selected by HalvingSearchCV.

for example:

Suppose now there is a data set DDD , from datasetDDRandomly sample a subsetdd from Dd

  1. First, a small subset d 0 d_0 is randomly sampled from the full data set without replacementd0, and at d 0 d_0d0Verify the performance of all parameter combinations. According to d 0 d_0d0According to the verification results above, the half of the parameter combinations whose scores rank in the last 1/2 are eliminated
  2. Then, sample a ratio d 0 d_0 from the full data set without replacementd0twice as large subset d 1 d_1d1, and at d 1 d_1d1Verify the performance of the remaining half of the parameter combinations. According to d 1 d_1d1According to the verification results above, the parameter combinations with the last 1/2 of the scores are eliminated
  3. Then sample a ratio d 1 d_1 from the full data set without replacementd1twice as large subset d 2 d_2d2, and at d 2 d_2d2Verify the performance of the remaining 1/4 parameter combinations. According to d 2 d_2d2According to the verification results above, the parameter combinations with the last 1/2 scores are eliminated...

Continuous cycle. If S is used to represent the sample size of the subset in the first iteration, and C represents the number of all parameter combinations, then in the iterative process, the data subset used to verify the parameters is getting larger and larger, and the number of parameter combinations that need to be verified is Fewer and fewer:

iterations Subset sample size Number of parameter combinations
1 S C
2 2S 1 2 \frac{1}{2} 21C
3 4S 1 4 \frac{1}{4} 41C
4 8S 1 8 \frac{1}{8} 81C
……
(When C is not divisible, round up)

When there is only one set of candidate parameter combinations left, or insufficient data is available , the loop will stop.

Specifically, when 1 n \frac{1}{n}n1C <= 1 or nS > population sample size, the search stops .

In this mode, only the parameter combination that continuously obtains excellent results on different subsets can be retained to the later stage of the iteration, and the finally selected parameter combination must be the parameter combination that performs well on all subsets. The likelihood that such a combination of parameters will perform well on the full data is very high, and may also exhibit greater generalization capabilities than those derived from grid/random search.

The advantage of this approach is that poorly performing parameter combinations can be quickly ruled out without having to train and evaluate them completely. Therefore, HalvingSearchCV is usually faster than traditional grid search and random search, especially when the parameter space is large or the training data set is large.

6.3 Interpretation of HalvingSearchCV parameters in Sklearn

Let's first look at the parameters of HalvingSearchCV in Sklearn:

image-20230711125108735

Read it:

139

The parameters that need to be understood are:

factor

The proportion of the sample size added in each iteration is also the proportion of the parameter combination left after each iteration. For example, when factor=2, the sample size of the next iteration will be twice that of the previous iteration, and 1/2 of the parameter combinations will be left after each iteration. If factor=3, the sample size of the next iteration will be three times that of the previous iteration, and 1/3 of the parameter combinations will be left after each iteration. This parameter usually works better when it is set to 3.

resource

Set the type of verification resources added in each iteration, input as a string. The default is the sample size, the input is "n_samples", and it can also be a weak classifier that inputs positive integers in any ensemble algorithm, such as "n_estimators" or "n_iteration".

min_resource

The sample size r0 used to validate parameter combinations at the first iteration. You can enter a positive integer, or two strings "smallest", "exhaust".
Enter a positive integer n, indicating that n samples are used in the first iteration.
Enter "smallest", then calculate r0 according to the rules:

When the resource type is sample size, for regression algorithms, r0 = cross-validation fold n_splits * 2

When the resource type is sample size, for classification algorithms, r0 = number of categories n_classes_ * cross-validation fold n_splits * 2

Equal to 1 when resource type is not sample size

Enter "exhaust", then reverse r0 according to the maximum available resource in the last round of iteration. For example, when factor=2, when the sample size is 1000, when there are 3 iterations in total, the maximum available resource for the last round of iteration is 1000, the second last round is 500, and the third last round (first round) is 250. At this point r0 = 250. The "exhaust" mode is most likely to get good results, but the calculation will be slightly more expensive and the calculation time will be slightly longer.

6.4 Limitations of HalvingSearchCV

There will be a problem in the HalvingSearchCV process: the larger the subset, the more similar the distribution of the subset and the full data set D will be, but the entire half-search algorithm uses the smallest subset to screen out the most parameter combinations at the beginning . If the distribution of the initial subset differs greatly from the full data set, many parameters that are valid for the full data set D may be screened out in the first few iterations of the half-grid search, so the initial half-grid search The subset must not be too small.

6.5 Practical operation: Look at HalvingSearchCV from <Kaggle Competition Case: House Price Prediction>

According to the limitations mentioned above, under the premise that the initial subset must not be too small, and the half-search sampling is non-replacement sampling, the sample size of the overall data must be large .

In actual modeling experience, the performance of half grid search on small data sets is often not as good as random grid search and ordinary grid search. For example, the Kaggle house price prediction data set used above, using half grid search, will It is found that the search results are not as good as the enumeration grid search, and the search time is long. However, on large data sets (for example, data sets with a sample size greater than w), half-grid search shows great advantages in computing speed and accuracy.

Therefore, when implementing the semi-grid search, a set of expanded house price data sets with 2w9 samples are used.

Step 1: Import the basic library

import numpy as np
import pandas as pd
import matplotlib as mlp
import matplotlib.pyplot as plt
import time
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import KFold, HalvingGridSearchCV, cross_validate, RandomizedSearchCV

import re
import sklearn

Step 2: Import the data set (the training set that has completed the basic processing can be used directly)

data1 = pd.read_csv("./datasets/House Price/big_train.csv",index_col=0)

data1 

X = data1.iloc[:,:-1]
y = data1.iloc[:,-1]

image-20230711124804444

Step 3: Construct the same global parameter space

#创造参数空间 - 使用与网格搜索时完全一致的空间,以便于对比
param_grid_simple = {"criterion": ["squared_error","poisson"]
                     , 'n_estimators': [*range(20,100,5)]
                     , 'max_depth': [*range(10,25,2)]
                     , "max_features": ["log2","sqrt",16,32,64,"auto"]
                     , "min_impurity_decrease": [*np.arange(0,5,10)]
                    }

Step 4: Determine factor and parameter combination

For semi-grid search applications, the hardest part is deciding on the complex combination of parameters for the search itself.

When tuning parameters, if you want all alternative combinations in the parameter space to be fully verified, the number of iterations cannot be too small (for example, only 3 iterations), so the factor cannot be too large. However, if the factor is too small, the number of iterations will be increased, and the running time of the entire search will be lengthened. At the same time, the number of iterations will also affect the amount of data that can be used in the end, and the number of parameter combinations that need to be further verified after the iteration is completed, neither of which should be too small. Therefore, in general, when using a half-grid search, the following three points need to be considered:

1. The value of min_resources should not be too small, and it is hoped to use as much data as possible before all the iterations are over. 2. After the iteration is completed, the

remaining verification parameter combinations should not be too many. It is best to be below 10, and if it is not possible, below 30 It is also acceptable

3. The number of iterations should not be too many, otherwise the time may be too long

factor = 1.5
n_samples = X.shape[0]
min_resources = 500
space = 1536

for i in range(100):
    if (min_resources*factor**i > n_samples) or (space/factor**i < 1):
        break
    print(i+1,"本轮迭代样本:{}".format(min_resources*factor**i)
          ,"本轮验证参数组合:{}".format(space//factor**i + 1))

image-20230711130110342

Step 5: Instantiate and build a half grid search

model_rf2 = RandomForestRegressor(random_state=24, verbose=True,)
cv = KFold(n_splits=5, shuffle=True, random_state=24)

#定义对半搜索
search2 = HalvingGridSearchCV(estimator=model_rf2
                            ,param_grid=param_grid_simple
                            ,factor=1.5
                            ,min_resources=500
                            ,scoring = "neg_mean_squared_error"
                            ,verbose = True
                            ,random_state=1412
                            ,cv = cv
                            ,n_jobs=-1)

Step 6: Train and calculate time

Perform model training and print out the consumed time in the format of "minutes + seconds". Note that divmod()the function returns two values, the first value is the quotient (minutes) and the second value is the remainder (seconds).

start = time.time()
search2.fit(X, y)
end = time.time()

elapsed_time = end - start # 得到的时间是秒级别的
minutes, seconds = divmod(elapsed_time, 60) # 将秒转换为分钟和秒
print(f"Elapsed time: {
      
      int(minutes)} minutes {
      
      int(seconds)} seconds")

image-20230711132344156

Since different data sets are used, it is no longer comparable to GridSearchCV and RandomizedSearchCV.

V. Conclusion

This paper analyzes the limitations of traditional enumeration grid search from a theoretical point of view, and introduces two more efficient hyperparameter search methods: RandomizedSearchCV and HalvingSearchCV. It explains their working principles, parameter settings and usage scenarios in detail, and helps readers understand the advantages and limitations of these two methods more intuitively through specific Kaggle competition cases.

After reading this article, you should be able to grasp the basic concepts and working principles of RandomizedSearchCV and HalvingSearchCV, and be able to select and use appropriate hyperparameter search in specific machine learning problems according to your own needs and limitations of computing resources method. In addition, you will learn how to balance search accuracy and efficiency by choosing different parameter spaces.

The next article will introduce a more advanced and intelligent hyperparameter optimization method - Bayesian optimization. This method uses prior knowledge and Bayesian inference to find better parameter combinations in a smaller search space.

Finally, thank you for reading this article! If you feel that you have gained something, don't forget to like, bookmark and follow me, this is the motivation for my continuous creation. If you have any questions or suggestions, you can leave a message in the comment area, I will try my best to answer and accept your feedback. If there's a particular topic you'd like to know about, please let me know and I'd be happy to write an article about it.

Thank you for your support and look forward to growing up with you!

Guess you like

Origin blog.csdn.net/Lvbaby_/article/details/131666028