Senior parameter adjustment techniques

The following are the Coursera How to Win the Data Science Competition A: Learn from Top Kagglers course notes.

Hyperparameter Optimization

  • List most important hyperparameters in major models; describe their impact
  • Understand the hyperparameter tuning process in general
  • Arrange hyperparameters by their importance

Hyperparameter tuning I

Plan for the lecture

  • Hyperparameter tuning in general
  • General pipeline
  • Manual and automatic tuning
  • What should we understand about hyperparameters?
  • Models,libraries and hyperparameter optimization
  • Tree-based models
  • Neural networks
  • Linear models

Plan for the lecture:models

  • Tree-based models
  • GBDT: XGBoost, LightGBM, CatBoost
  • RandomForest/ExtraTrees
  • Neural nets
  • Pytorch, Tensorflow, Keras...
  • Linear models
  • SVM, logistic regression
  • Vowpal Wabbit, FTRL
  • Factorization Machines(out of scope)
  • libfm, libFFM

How do we tune hyperparameters

  • 1.Select the most influential parameters
  • a.There are tons of parameters and we can'ttune all of them
  • 2.Understand,how exactly they influence the training
  • 3.Tune them
  • a.Manually(change and examine)
  • b.Automatically(hyperopt, etc)

  • 1. In any case, we never had time to adjust all the parameters, so we need to make a good subset of adjustment. Suppose we are xgboost novice, do not know which parameters need adjustment, you can search parameters are usually set in the previous Github or Kaggle Kernels.
  • 2. To understand what a parameter change which will happen.
  • 3. Most people manually adjust the reference work. You can also use ultra-parameter optimization tool, but it is often faster to perform manually.

Hyperparameter optimization software tool for the automatic parameter adjustment

Run parameter adjustment tool may take a long time, so the best strategy is to run it at night.

  • A lot of libraries to try:
  • Hyperopt
  • Scikit-optimize
  • Spearmint
  • GPyOpt
  • RoBO
  • SMAC3

Broadly speaking, different parameters will lead to three different results

  • 1.Underfitting(bad)
  • 2.Good fit and generalization(good)
  • 3.Overfitting(bad)

So we need to want to adjust the parameters divided into two groups. A first set of parameters of the model is constrained, opposite to the first group and a second group effects.

  • A parameter in red
  • Increasing it impedes fitting
  • Increase it to reduce overfitting
  • Decrease to allow model fit easier
  • A parameter in green
  • Increasing it leads to a batter fit(overfit) on train set
  • Increase it, if model underfits
  • Decrease if overfits

Only the above-mentioned color video tag

Hyperparameter tuning II

Some tree-based model of ultra-parameter optimization

  • Tree-based models
Model Where
GBDT XGBoost-dmlc/xgboost
LightGBM-Microsoft/LightGBM
CatBoost-catboost/catboost
RandomForest/ExtraTrees scikit-learn
Others RGF- baidu / fast_rgf

GBDT

XGBoost LightGBM
max_depth max_depth/num_leaves
subsample bagging_fraction
colsample_bytree,
colsample_bylevel
frature_fraction
min_child_weight,
lambda,alpha
min_data_in_leaf,
lambda_l1,lambda_l2
eta
num_round
learning_rate
num_iterations
Others:
seed
Others:
*_seed
  • MAX_DEPTH :
    The deeper the tree, the more fit the data set, but this can lead to over-fitting. Depending on the maximum depth may vary significantly from the task, sometimes two, sometimes 27. Recommend max_depth from about 7 start, the maximum depth is not over until it's fitting. Note that the depth increases, the learning time is longer.
  • num_leaves :
    In LightGBM, you can control the number of leaves, instead of the maximum depth. Because the tree can be deep, but if a small number of leaves will not lead to over-fitting.
  • subsample, bagging_fraction :
    This parameter controls the amount of each feed data model, values between 0 and 1. Each time data feed a small part of it, can make it not so over-fitting and generalization can get better results, but the training model will be slower. It's a bit like the regularization of the action.
  • colsample_bytree, colsample_bylevel :
    This parameter can be controlled in a subsample split point. If the model is too fit, you can try to reduce these values.
  • min_child_weight, the lambda, Alpha :
    regularization parameter.
  • min_child_weight :
    experience, this is the most important parameter. It allows increased more conservative model, it will reduce the model has fewer constraints. Depending on the task, I found that the best value 0,5,15,300, so do not hesitate to try various values, depending on the data.
  • eta, num_round : a study on the weight eta essentially the same as the gradient descent. num_round is the number of learning steps that we want, in other words, how many trees we want to build. Each iteration will construct a new tree to the learning rate eta added to the model.
  • When we find the right number of rounds, you can do a fraction of the skills often improve. We will num_round multiplied by α, the eta divided by α, the model is usually better. Application parameters may also need to be adjusted, but can usually be left as is.

Other

  • the SEED :
    random seed model for little impact under normal circumstances. However, if the effect of random seed for your very large, it suggests you may submit multiple times, according to the randomness or adjust your authentication scheme.

sklearn.RandomForest/ExtraTrees

  • n_estimators :
    RandomForest each tree is constructed independently of the other tree, which means that the model has a large number of trees will not lead to over-fitting, as opposed to Gradient Boosting. We usually first n_estimators set to a very small number, such as 10 and see how much time it will take, if not too long, put it to a relatively large value, such as 300.
  • max_deep :
    control the depth of the tree, in XGBoost different, it may be set to None, which corresponds to an infinite depth. When the data set having duplicate values and important feature interaction, it is practically very useful. In other cases, the model will immediately unconstrained depth of over-fitting. Depth recommended random forest begins around 7. Usually random deep forest of optimum depth than Gradient Boosting, do not hesitate to try all 10, 20 or higher value.
  • max_feature :
    same XGBoost parameters.
  • min_samples_leaf :
    is a similar regularization parameter, and the same min_data_leaf min_child_weight LightGBM of the XGBoost.

Other

  • Criterion :
    In my experience, Gini is more common, but sometimes Entropy better.
  • random_state :
    Random Seed parameters
  • n_jobs : Set have multiple cores. Sklearn by default RandomForest for some reason only one core.

Hyperparameter tuning III

  • Neural nets
  • Pytorch, Tensorflow, Keras...
  • Linear models
  • SVM, logistic regression
  • Vowpal Wabbit, FTRL

Neural Nets

Discussed herein is dense neural nets, i.e. containing only fully connected network layer

Adaptive algorithms have been highlighted in italics +

  • Number of neurons per layer
  • Number of layers
  • Optimizers
  • SGD + momentum
  • Adam / Adadelta / Adagrade / ..
    • In pratice lead to more overfitting
  • Batch size
  • Learning rate
  • Regularization
  • L2/L1 for weights
  • Dropout/Dropconnect
  • Static Dropconect

  • Start simple suggestions, such as one or two layers, debug code to ensure that training time loss dropped
  • Then try to find a configuration capable of over-fitting, after adjusting something in the network
  • One key part of the neural network optimization method is
  • Adaptive optimization method really can make you faster to fit the data, but based on my experience, this can lead to severe over-fitting. SGD ordinary slow convergence, but the trained models usually have better generalization effect. Adaptive methods are useful, but in the settings others in classification and regression.
  • Batch Size: batch proved over the General Assembly led to more over-fitting. Rule of thumb, batch_size 500 can be considered great. It recommended to choose a value of about 32 or 64, if the network is still too fit, try reducing batch_size, on the contrary increase it. batch_size should not be too small, or there may be too much noise gradient. After adjusting batch_size, if necessary, should be to adjust the number of other networks.
  • Learning Rate: learning rate is not too high or too low. Therefore, the best learning rate depends on other parameters. Usually from the beginning of a larger learning rate, such as 0.1, and then gradually to reduce it. There is a rule of thumb, if you will batch_size alpha-fold increase, you can increase the learning rate alpha times.
  • Early, most people use the L1 and L2 regularization. Today, most people use dropout regularization. For me, that is, after several layers immediately dropout as the first layer.
  • static dropconnect: We usually have a dense layer of input connections, such as 128 units. We will be changed to a very large hidden layers, such as 4096 units, and for most of the game, this is a huge network, it will seriously over-fitting. Now in order to regulate it, we will have this level of random dropout 99%, which is very strong regularization, proved it is possible.
    dropconnect.png

Linear models

  • Scikit-learn
  • SVC/SVR
    • Sklearn wraps libLinear and libSVM
    • Compile yourself for multicore support
  • LogisticRegression/LinearRegression + regularizers
  • SGDClassifier/SGDRegressor

  • Vowpal Wabbit
  • FTRL

  • SVM almost no parameter adjustment, this is the biggest benefit
  • The latest version of libLinearand libSVMsupport for multi-core processing, but does not support multi-core processing Sklearn in. So we need hands-on variation of these libraries to use this option.
  • Almost no one uses kernel SVC, so here only discuss SVM
  • 对于不适合在内存中操作的数据,我们可以使用Vowpal Wabbit,它以在线的方式实现线性模型的学习。它只能直接从硬盘驱动器中逐行读取数据,永远不会将整个数据集加载到内存中。因此,允许学习非常庞大的数据集。
  • 线性模型的在线学习方法(FTRL)在前段时间特别受欢迎,他是Vowpal Wabbit中的实现。

Linear models

  • Regularization parameter(X,alpha,lambda,..)
  • Start with very small value and increase it.
  • SVC starts to work sklowe as C increase
  • Regularization type
  • L1/L2/L1+L2 --try each
  • L1 can be used for feature selection

  • C:对于SVM,我通常会从一个非常小的值开始,比如$10^{-6}$,每次乘以10。从小的值开始,是因为参数C越大,训练时间越长。
  • 选择L1还是L2?答案是尝试两者,在我看来,它们非常相识。并且L1还有一个好处,可以给我们提供一个稀疏权重,这可以用于特征选择。

Tips

  • Don't spend too much time tuning hyperparameters
  • Only if you don't have any more ideas or you have spare computational resources

  • Be patient
  • It can take thousands of rounds for GBDT or neural nets to fit.

  • Average everything
  • Over random seed
  • Or over small deviations from optimal parameters
    • e.g.average max_depth=4,5,6for an optimal 5

相关链接

Guess you like

Origin www.cnblogs.com/ishero/p/11136374.html