Machine Learning Tuning Parameters


Recently, machine learning is used to make predictions, and the results are always unsatisfactory, so I decided to learn about the content of parameter adjustment, borrowed from the notes of online masters, and shared my personal understanding. If there is any place where the understanding is not in place, please correct it in the comment area.

Why do you need to tune parameters?

The most difficult part in machine learning is to find the best hyperparameters for the model. The performance of the model has a great influence on the hyperparameters.

What parameters are tuned?

Random forest :
==n-estimators:==The number of decision trees, too small is easy to underfit, too large is not easy to improve the model
==bootstrp:==whether to sample the sample with replacement to build the tree, true or false
==oob_score:==Whether to use out-of-bag samples to evaluate the quality of the model, true or false
==max_depth:==Maximum depth of decision tree
==min_samples_leaf:==Least samples contained in leaf nodes
==min_samples_split:== The minimum number of samples that can be divided by a node
== min_sample_leaf: = = the minimum sample weight of a leaf node
= = max_leaf_nodes: = = the maximum number of leaf nodes
= = criterion: = = node division standard
= = max_features: = = optimal decision tree construction The maximum number of features considered when modeling.

How to adjust parameters?

manual tuning

As the name implies, it is to manually select the best parameter set through the training algorithm, but it is very time-consuming, and there is no guarantee that the best parameters will be found.

grid search

Search is a basic hyperparameter tuning technique where models are built for each permutation of all given hyperparameter values ​​specified in a grid, and the best model is evaluated for selection.
It is slower because it times out every combination of hyperparameters and selects the best combination based on the cross-validation score.

from sklearn.model_selection import GridSearchCV
clf=RandomForestRegressor()
grid=GridSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
grid.fit(x_train, y_train)
print(grid.best_params_)

random search

Random search would be better than grid search because in many cases all hyperparameters may not be equally important, random search randomly selects combinations of parameters from the hyperparameter space.
Random search is not guaranteed to give the best combination of parameters.

from sklearn.model_selection import RandomizedSearchCV
criterion=['mse','mae']
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {
    
    'criterion':criterion,
                'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
clf=RandomForestRegressor()
clf_random = RandomizedSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
clf_random.fit(x_train, y_train)
print(clf_random.best_params_)

Bayesian search

Bayesian optimization belongs to a class of optimization algorithms known as sequential model-based optimization (SMBO) algorithms that use previous observations of the loss to determine the next best point to sample. The algorithm can be summarized as follows:

  1. Using the previously evaluated points X1*:n*, compute the posterior expectation of the loss f.
  2. Some way of maximizing the expectation of f by sampling the loss f at a new point X that specifies which regions of the domain of f are best sampled.
    These steps are repeated until certain convergence criteria are met.

If the hyperparameter space is very large, random search is used to find potential combinations of hyperparameters, and then grid search is used locally to select the optimal features.

from sklearn.model_selection import BayesSearchCV
clf=RandomForestRegressor()
Bayes=BayesSearchCV(estimator=clf,param_distributions=random_grid,
                   n_iter=10,
                   cv=3,verbose=2,random_state=42,n_jobs=1)
Bayes.fit(x_train, y_train)
print(Bayes.best_params_)

K-fold cross-validation

If the model is too complex, it will lead to overfitting, and if it is too simple, it will lead to underfitting. In order to prevent the unreasonable division of data from causing large deviations in the final results, K-fold cross-validation is generally used. That is, the original data is divided into K parts, one of which is used as a test set, and the rest are used as a training set, repeated sampling K times, K models and evaluation results are obtained, and finally the average value is taken as the final performance evaluation.
Generally, K=10 by default. There are already encapsulated methods in sklearn, which can be called directly.

from sklearn.model_selection import cross_val_score
scores=cross_val_score(estimator=pipe_lr,x=x_train,y=y_train,cv=5,n_jobs=1)
print('CV accuracy scores: %s' % scores)

How to measure whether the parameters are appropriate

We generally look at some indicators to compare the quality of the model. Before and after adjusting the parameters, we also use some indicators to measure whether the model is better.

  1. Accuracy
  2. Recall rate (recall)
  3. F1-score
  4. Roc curve
  5. The confusion matrix
    is ​​generally implemented by directly calling the code
from sklearn.metrics import precision_score, recall_score, f1_score
rf=RandomForestRegressor(criterion='mae',max_depth=100,min_sample_split=5,min_samples_leaf=4)
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
print('precision:%.3f' %precision_score(y_true=y_test,y_pred=y_pred))
print('recall:%.3f' %recall_score(y_true=y_test,y_pred=y_pred))
print('F1:%.3f' %f1_score(y_true=y_test,y_pred=y_pred))

おすすめ

転載: blog.csdn.net/qq_45699150/article/details/123545046