Common problems in machine learning (1)

This article sorts out some common problems that may be encountered in machine learning, including the understanding of basic concepts, the selection of models in various scenarios, and the reasons behind some common concepts, results and improvement ideas.

1. Is it necessary to do a lot of feature engineering for xgboost?

Feature engineering is a broad concept, including feature screening, feature transformation, feature synthesis, feature extraction and so on.

For xgboost, it can do feature selection very well, so we don't have to worry too much about this.

As for feature transformation (discretization, normalization, standardization, log taking, etc.), we don't need to do too much, because xgboost is based on decision trees, and decision trees can naturally solve these. In contrast, linear models require discretization or log processing. Because xgboost (tree-like model) does not rely on linearity assumptions. But for categorical features, xgboost needs to one-hot encode them, otherwise the model cannot be trained.

xgboost can also be free from some feature synthesis work, such as the interaction term a:b in linear regression, which can be done automatically in tree models. However, for synthetic features such as addition a+b, subtraction ab, and division ab, it needs to be done manually.

One step that most models cannot automate is feature extraction. Many nlp problems or image problems do not have ready-made features. You need to extract these features yourself, which we need to do manually.

To sum up, xgboost does save a lot of feature engineering steps than linear models. But feature engineering is still very necessary.

2. When should I use LASSO and when should I use Ridge?

Both of these are means of regularization. LASSO is based on the one-norm of the regression coefficients, and Ridge is based on the square of the two-norm of the regression coefficients. According to Hastie, Tibshirani, Friedman's classic textbook, if you have many variables in your model that have a small effect on the model, use Ridge; if you have only a few variables in your model that greatly affect the model, then use LASSO.

LASSO can make the coefficients of many variables 0 (equivalent to dimensionality reduction), but Ridge cannot.

Because Ridge is faster to calculate, Ridge is more inclined to be used when the amount of data is particularly large.

The most versatile method is to try both LASSO and Ridge, and compare the results of Cross Validation between the two.

If there are many multicollinear variables, ridge works better than lasso.

As a final addition, you can also try a mix of the two - Elastic Net.

3. How to simply understand regularization?

From the perspective of model complexity, regularization is Occam's razor. Regularization is a specific implementation of Occam's razor, which reduces the complexity of the model while maintaining the same predictive ability.

4. Why does L2-norm make the model simpler?

One of Stanford's lecture notes said that in the case where the features are all standardized, the larger the regression coefficient, the greater the complexity of the model. Therefore, when the characteristic coefficient becomes 0, the complexity of the model will decrease; if the coefficient becomes smaller, of course, the complexity will decrease.

5. An example where two variables are not correlated but not independent?

Assuming that X is a random variable obeying the standard normal distribution, and another random variable Y satisfies Y=X2, then their covariance

the (X, Y) = E (XY) −E (X) E (Y) = E (X3) −E (X) E (X2) = 0−0 × E (X2) = 0

The covariance is 0, indicating that X and Y are not correlated. But obviously X and Y are not independent.

6. How does xgboost achieve regularization? The objective function of xgboost is the loss function + penalty term. As can be seen from the formula below, the more complex the tree, the heavier the penalty.

write picture description here

The complexity of a tree is defined as follows:

write picture description here

The higher the number of leaf nodes and the leaf node's score, the more complex the tree.

7. Why is lightgbm faster than xgb?

LightGBM adopts the gradient-based one-sided sampling (GOSS) method.

When filtering data samples to find segmentation values, LightGBM uses a new technique: gradient-based one-sided sampling (GOSS); while XGBoost determines the optimal segmentation through a pre-classification algorithm and a histogram algorithm. In Adaboost, sample weight is a good indicator of sample importance. However, in the gradient boosting decision tree (GBDT), there is no natural sample weight, so the sampling method used by Adaboost cannot be used directly here, and then we need a gradient-based sampling method.

The gradient characterizes the slope of the tangent to the loss function, so it is natural to reason that if the gradient of the data point is very large in some sense, then these samples are very important for finding the optimal split point because the loss is higher.

GOSS keeps all large gradient examples and takes random sampling on small gradient examples. For example, if there are 500,000 rows of data, of which 10,000 rows of data have a large gradient, then my algorithm will choose (the 10,000 rows of data with a large gradient + x% are randomly selected from the remaining 490,000 rows) . If x takes 10%, then the final selection result is obtained by determining the split value, 59,000 rows extracted from 500,000 rows. There is a basic assumption here: if the gradient of the training examples in the training set is small, then the training error of the algorithm on this training set will be small, because the training has already been completed. To use the same data distribution, GOSS introduces a constant factor on small gradient data samples when computing information gain. Therefore, GOSS strikes a good balance between reducing the number of data samples and maintaining the accuracy of the learned decision tree. The above part is reproduced from the Heart of the Machine. The link to the original paper is as follows

https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf

8. How to adjust parameters in xgboost?

Take XGBClassifier in python as an example. First of all, to determine which parameters to adjust, the following are the more important and commonly used parameters.

max_depth: The maximum depth of each tree. Too small will underfit, too large will overfit. Normal values ​​are 3 to 10.

learning_rate: The learning rate, which is the step size in gradient descent. If it is too small, the training speed is too slow and it is easy to fall into a local optimum. Usually between 0.0001 and 0.1.

n_estimators: The number of trees. Not more is better, usually between 50 and 1000.

colsample_bytree: The number of features to use when training each tree. 1 means using all features, 0.5 means using half of the features.

subsample: The number of samples to use when training each tree. Similar to the above, 1 means use the full sample and 0.5 means use half the sample.

reg_alpha: L1 regularization weight. Used to prevent overfitting. Usually between 0 and 1.

reg_lambda: Weights for L2 regularization. Used to prevent overfitting. Usually between 0 and 1.

min_child_weight: The number of samples (weighted) required for each child node. If it is set to a value greater than 1, it can have the effect of pruning and prevent overfitting.

The above is just for reference. Usually, we only adjust the parameters of a few of them, and the model can achieve good results.

The next step is to optimize these parameters, the most commonly used is Grid Search. Say we want to optimize max_depth and learning_rate. The candidate value of max_depth is [3, 4, 5, 6, 7], and the candidate value of learning_rate is [0.0001, 0.001, 0.01, 0.1]. Then we need to try all possible combinations of these two parameters (5*4=20 different combinations in total). Through cross-validation, we can get the prediction evaluation results of each combination, and finally select the best combination from these 20 combinations.

The idea of ​​Grid Search is simple and easy to implement, but the disadvantage is that when we need to optimize many parameters, we need to traverse too many combinations, which is very time-consuming. For example, we have 5 parameters to be optimized, and each parameter has 5 candidate values, then there are a total of 5*5*5*5*5=3125 different combinations to try, which is very time-consuming.

Another common method, Random Search, can solve this problem. Random Search is to randomly select k combinations from all combinations, perform cross-validation, compare their performance, and select the best one. Although the results of Random Search are not as good as Grid Search, Random Search usually achieves optimization results comparable to Grid Search at a small cost. Therefore, in actual parameter tuning, especially in high-dimensional parameter tuning, Random Search is more practical.

It is worth noting that this article only represents the views of some students. If you have more ideas, you are welcome to exchange and discuss.
Scan the QR code below to follow this official account to view subsequent updates~
write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325935992&siteId=291194637
Recommended