The following are the Coursera How to Win the Data Science Competition A: Learn from Top Kagglers course notes.

Hyperparameter Optimization

List most important hyperparameters in major models; describe their impact
Understand the hyperparameter tuning process in general
Arrange hyperparameters by their importance

Hyperparameter tuning I

Plan for the lecture

Hyperparameter tuning in general
General pipeline
Manual and automatic tuning
What should we understand about hyperparameters?
Models,libraries and hyperparameter optimization
Tree-based models
Neural networks
Linear models

Plan for the lecture:models

Tree-based models
GBDT: XGBoost, LightGBM, CatBoost
RandomForest/ExtraTrees
Neural nets
Pytorch, Tensorflow, Keras...
Linear models
SVM, logistic regression
Vowpal Wabbit, FTRL
Factorization Machines(out of scope)
libfm, libFFM

How do we tune hyperparameters

1.Select the most influential parameters
a.There are tons of parameters and we can'ttune all of them
2.Understand,how exactly they influence the training
3.Tune them
a.Manually(change and examine)
b.Automatically(hyperopt, etc)
1. In any case, we never had time to adjust all the parameters, so we need to make a good subset of adjustment. Suppose we are xgboost novice, do not know which parameters need adjustment, you can search parameters are usually set in the previous Github or Kaggle Kernels.
2. To understand what a parameter change which will happen.
3. Most people manually adjust the reference work. You can also use ultra-parameter optimization tool, but it is often faster to perform manually.

Hyperparameter optimization software tool for the automatic parameter adjustment

Run parameter adjustment tool may take a long time, so the best strategy is to run it at night.

A lot of libraries to try:
Hyperopt
Scikit-optimize
Spearmint
GPyOpt
RoBO
SMAC3

Broadly speaking, different parameters will lead to three different results

1.Underfitting(bad)
2.Good fit and generalization(good)
3.Overfitting(bad)

So we need to want to adjust the parameters divided into two groups. A first set of parameters of the model is constrained, opposite to the first group and a second group effects.

A parameter in red
Increasing it impedes fitting
Increase it to reduce overfitting
Decrease to allow model fit easier
A parameter in green
Increasing it leads to a batter fit(overfit) on train set
Increase it, if model underfits
Decrease if overfits

Only the above-mentioned color video tag

Hyperparameter tuning II

Some tree-based model of ultra-parameter optimization

Tree-based models

Model	Where
GBDT	XGBoost-dmlc/xgboost LightGBM-Microsoft/LightGBM CatBoost-catboost/catboost
RandomForest/ExtraTrees	scikit-learn
Others	RGF- baidu / fast_rgf

GBDT

XGBoost	LightGBM
max_depth	max_depth/num_leaves
subsample	bagging_fraction
colsample_bytree, colsample_bylevel	frature_fraction
`min_child_weight,` `lambda,alpha`	`min_data_in_leaf,` `lambda_l1,lambda_l2`
eta num_round	learning_rate num_iterations
Others: seed	Others: *_seed

MAX_DEPTH :
The deeper the tree, the more fit the data set, but this can lead to over-fitting. Depending on the maximum depth may vary significantly from the task, sometimes two, sometimes 27. Recommend max_depth from about 7 start, the maximum depth is not over until it's fitting. Note that the depth increases, the learning time is longer.
num_leaves :
In LightGBM, you can control the number of leaves, instead of the maximum depth. Because the tree can be deep, but if a small number of leaves will not lead to over-fitting.
subsample, bagging_fraction :
This parameter controls the amount of each feed data model, values between 0 and 1. Each time data feed a small part of it, can make it not so over-fitting and generalization can get better results, but the training model will be slower. It's a bit like the regularization of the action.
colsample_bytree, colsample_bylevel :
This parameter can be controlled in a subsample split point. If the model is too fit, you can try to reduce these values.
min_child_weight, the lambda, Alpha :
regularization parameter.
min_child_weight :
experience, this is the most important parameter. It allows increased more conservative model, it will reduce the model has fewer constraints. Depending on the task, I found that the best value 0,5,15,300, so do not hesitate to try various values, depending on the data.
eta, num_round : a study on the weight eta essentially the same as the gradient descent. num_round is the number of learning steps that we want, in other words, how many trees we want to build. Each iteration will construct a new tree to the learning rate eta added to the model.
When we find the right number of rounds, you can do a fraction of the skills often improve. We will num_round multiplied by α, the eta divided by α, the model is usually better. Application parameters may also need to be adjusted, but can usually be left as is.

Other

the SEED :
random seed model for little impact under normal circumstances. However, if the effect of random seed for your very large, it suggests you may submit multiple times, according to the randomness or adjust your authentication scheme.

sklearn.RandomForest/ExtraTrees

n_estimators :
RandomForest each tree is constructed independently of the other tree, which means that the model has a large number of trees will not lead to over-fitting, as opposed to Gradient Boosting. We usually first n_estimators set to a very small number, such as 10 and see how much time it will take, if not too long, put it to a relatively large value, such as 300.
max_deep :
control the depth of the tree, in XGBoost different, it may be set to None, which corresponds to an infinite depth. When the data set having duplicate values and important feature interaction, it is practically very useful. In other cases, the model will immediately unconstrained depth of over-fitting. Depth recommended random forest begins around 7. Usually random deep forest of optimum depth than Gradient Boosting, do not hesitate to try all 10, 20 or higher value.
max_feature :
same XGBoost parameters.
min_samples_leaf :
is a similar regularization parameter, and the same min_data_leaf min_child_weight LightGBM of the XGBoost.

Other

Criterion :
In my experience, Gini is more common, but sometimes Entropy better.
random_state :
Random Seed parameters
n_jobs : Set have multiple cores. Sklearn by default RandomForest for some reason only one core.

Hyperparameter tuning III

Neural nets
Pytorch, Tensorflow, Keras...
Linear models
SVM, logistic regression
Vowpal Wabbit, FTRL

Neural Nets

Discussed herein is dense neural nets, i.e. containing only fully connected network layer

Adaptive algorithms have been highlighted in italics +

Number of neurons per layer
Number of layers
Optimizers
SGD + momentum
Adam / Adadelta / Adagrade / ..
- In pratice lead to more overfitting
Batch size
Learning rate
Regularization
L2/L1 for weights
Dropout/Dropconnect
Static Dropconect
Start simple suggestions, such as one or two layers, debug code to ensure that training time loss dropped
Then try to find a configuration capable of over-fitting, after adjusting something in the network
One key part of the neural network optimization method is
Adaptive optimization method really can make you faster to fit the data, but based on my experience, this can lead to severe over-fitting. SGD ordinary slow convergence, but the trained models usually have better generalization effect. Adaptive methods are useful, but in the settings others in classification and regression.
Batch Size: batch proved over the General Assembly led to more over-fitting. Rule of thumb, batch_size 500 can be considered great. It recommended to choose a value of about 32 or 64, if the network is still too fit, try reducing batch_size, on the contrary increase it. batch_size should not be too small, or there may be too much noise gradient. After adjusting batch_size, if necessary, should be to adjust the number of other networks.
Learning Rate: learning rate is not too high or too low. Therefore, the best learning rate depends on other parameters. Usually from the beginning of a larger learning rate, such as 0.1, and then gradually to reduce it. There is a rule of thumb, if you will batch_size alpha-fold increase, you can increase the learning rate alpha times.
Early, most people use the L1 and L2 regularization. Today, most people use dropout regularization. For me, that is, after several layers immediately dropout as the first layer.
static dropconnect: We usually have a dense layer of input connections, such as 128 units. We will be changed to a very large hidden layers, such as 4096 units, and for most of the game, this is a huge network, it will seriously over-fitting. Now in order to regulate it, we will have this level of random dropout 99%, which is very strong regularization, proved it is possible.

Linear models

Scikit-learn
SVC/SVR
- Sklearn wraps libLinear and libSVM
- Compile yourself for multicore support
LogisticRegression/LinearRegression + regularizers
SGDClassifier/SGDRegressor
Vowpal Wabbit
FTRL
SVM almost no parameter adjustment, this is the biggest benefit
The latest version of libLinearand libSVMsupport for multi-core processing, but does not support multi-core processing Sklearn in. So we need hands-on variation of these libraries to use this option.
Almost no one uses kernel SVC, so here only discuss SVM
对于不适合在内存中操作的数据，我们可以使用Vowpal Wabbit，它以在线的方式实现线性模型的学习。它只能直接从硬盘驱动器中逐行读取数据，永远不会将整个数据集加载到内存中。因此，允许学习非常庞大的数据集。
线性模型的在线学习方法（FTRL）在前段时间特别受欢迎，他是Vowpal Wabbit中的实现。

Linear models

Regularization parameter(X,alpha,lambda,..)
Start with very small value and increase it.
SVC starts to work sklowe as C increase
Regularization type
L1/L2/L1+L2 --try each
L1 can be used for feature selection
C：对于SVM，我通常会从一个非常小的值开始，比如$10^{-6}$，每次乘以10。从小的值开始，是因为参数C越大，训练时间越长。
选择L1还是L2？答案是尝试两者，在我看来，它们非常相识。并且L1还有一个好处，可以给我们提供一个稀疏权重，这可以用于特征选择。

Tips

Don't spend too much time tuning hyperparameters
Only if you don't have any more ideas or you have spare computational resources
Be patient
It can take thousands of rounds for GBDT or neural nets to fit.
Average everything
Over random seed
Or over small deviations from optimal parameters
- e.g.average max_depth=4,5,6for an optimal 5

Senior parameter adjustment techniques

Hyperparameter Optimization

Hyperparameter tuning I

How do we tune hyperparameters

Hyperparameter optimization software tool for the automatic parameter adjustment

Hyperparameter tuning II

GBDT

sklearn.RandomForest/ExtraTrees

Hyperparameter tuning III

Neural Nets

Linear models

Linear models

Tips

相关链接

Guess you like