Tuning model hyperparameters

The optimization of ridge regression to linear regression lies in adding the L2 regularization term to the loss function of linear regression to reduce the variance at the expense of unbiasedness. However, have you ever thought about the question: How many parameters should be selected in L2 regularization? Is it 0.01, 0.1, or 1? So far, we can only rely on experience or guessing, can we find a way to find the optimal parameters? In fact, the problem of finding the best parameter is essentially the content of optimization, because finding the best value from a set of parameters is itself one of the optimization tasks. The algorithm that emerges in our minds is nothing more than the gradient descent method. , Newton’s method and other unconstrained optimization algorithms or constrained optimization algorithms, but before concretely verifying whether this idea is feasible, we must first understand the difference between the two most essential concepts.

  • Parameters and hyperparameters:
    Our natural problem is the parameter λ \lambda in ridge regressionWhat is the difference between λ and parameter w? In fact, the parameter w is set by a specificλ \lambdaAfter λ is optimized using methods similar to least squares, gradient descent, etc., we always setλ \lambdaλ is the parameter w that is optimized after how much. Therefore, similar to the parameter w, the number optimized by the least square method or the gradient descent method is called the parameter, which is similar toλ \lambdaThe same as λ , we cannot use the least square method or gradient descent method to optimize the number that we call hyperparameters.

Model parameters are configuration variables inside the model, and their values ​​can be estimated based on data.

  • Parameters are required when making predictions.
  • Its parameters define the models that can be used.
  • Parameters are estimated or learned from data.
  • The parameters are usually not manually set by the programmer.
  • The parameters are usually saved as part of the learning model.
  • Parameters are the key to machine learning algorithms, and they are usually summarized from past training data.

Model hyperparameters are configurations external to the model, and their values ​​cannot be estimated from the data.

  • Hyperparameters are often used to help estimate model parameters.
  • Hyperparameters are usually specified manually.
  • Hyperparameters can usually be set using heuristics.
  • Hyperparameters are often adjusted to a given predictive modeling problem.

Grid Search GridSearchCV():
The idea of ​​grid search is very simple. For example, if you have 2 hyper-parameters to choose from, then you can list all the hyper-parameter choices for permutation and combination. For example: λ = 0.01, 0.1, 1.0 \lambda=0.01,0.1,1.0λ=0.01,0.1,. 1 . 0 andα = 0.01, 0.1, 1.0 \a=0.01,0.1,. 1 . 0 , you can make a permutation and combination, namely: {[0.01, 0.01], [0.01, 0.1], [0.01, 1], [0.1, 0.01], [0.1, 0.1], [0.1, 1.0], [1,0.01],[1,0.1],[1,1]}, and then establish a model for each group of hyperparameters, and then select the group of hyperparameters with the smallest test error. In other words, we need to find the optimal hyperparameters from the hyperparameter space, much like finding an optimal node in a grid, so it is called a grid search.

Randomized search RandomizedSearchCV():
Grid search is equivalent to violently trying each of the parameter spaces, and then selecting the optimal set of parameters. This method is obviously not efficient enough, because as the number of parameter categories increases , The number of attempts has increased exponentially. Is there a more efficient way of tuning? That is to use the random search method. This method is not only efficient, but also experimentally proved that the random search method is slightly better than the sparse grid method (sometimes it is extremely bad, and it needs to be weighed). Each parameter in the random search of parameters is sampled from the distribution of possible parameter values. Compared with grid search, this has two main advantages:

  • The calculation cost can be selected independently of the number of parameters and possible values.
  • Adding parameters that do not affect performance will not reduce efficiency.

Below we use the example of SVR for tuning:
Insert picture description here
Insert picture description here
Insert picture description here
After our unremitting efforts, from collecting data sets and selecting appropriate features, selecting indicators to measure model performance, selecting specific models and training to optimize the model to evaluating the performance of the model and tuning With reference, we learned how to use sklearn to build a simple regression model.

Thanks to the Datawhale team for their contributions to open source learning!
Reference:
https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning

Guess you like

Origin blog.csdn.net/weixin_43595036/article/details/115160758