The basic idea of machine learning tuning



The basic idea of ​​machine learning tuning

Correct tuning idea: model tuning, the first step is to find the right target: what are we going to do? Generally speaking, this goal is to improve a certain model evaluation index. For example, for random forest, what we want to improve is the accuracy of the model on unknown data (measured by score or oob_score_). Once this goal is found, we need to think: What factors affect the accuracy of the model on unknown data? In machine
learning, the index we use to measure the accuracy of the model on unknown data is called Genelization error.

1. Generalization error

When the model performs poorly on unknown data (test set or out-of-bag data), we say that the generalization degree of the model is not enough, the generalization error is large, and the effect of the model is not good. The generalization error is affected by the structure (complexity) of the model. Look at the following picture, which accurately depicts the relationship between generalization error and model complexity. When the model is too complex, the model will overfit and the generalization ability is not enough, so the generalization error is large. When the model is too simple, the model will be under-fitting, and the fitting ability will be insufficient, so the error will be large. Only when the complexity of the model is just right, the goal of minimizing the generalization error can be achieved.

Insert picture description here
What does the complexity of the model have to do with our parameters?For the tree model, the more lush the tree, the deeper the depth, and the more branches and leaves, the more complex the model.. So 树模型是天生位于图的右上角的模型, 随机森林是以树模型为基础so 随机森林也是天生复杂度高的模型.The parameters of random forest are all towards one goal: reduce the complexity of the model, move the model to the left of the image, and prevent overfitting. Of course, there is no absolute parameter adjustment, there are also random forests that are born on the left side of the image, so before adjusting the parameters, we must first judge which side of the image the model is currently on.

Behind the generalization error is actually the "bias-variance dilemma", and the principle is very complicated. We need to remember these four points:

  • 1)模型太复杂或者太简单,都会让泛化误差高,我们追求的是位于中间的平衡点
  • 2)模型太复杂就会过拟合,模型太简单就会欠拟合
  • 3)对树模型和树的集成模型来说,树的深度越深,枝叶越多,模型越复杂
  • 4)树模型和树的集成模型的目标,都是减少模型复杂度,把模型往图像的左边移动

Back to top


2. Tuning sequence table

How does each specific parameter affect our complexity and model? We have always adjusted the parameters by taking turns to find the optimal value on the learning curve, hoping to correct the accuracy to a relatively high level. However, we now understand the direction of random forest tuning: reduce complexity, we can select those parameters that have a huge impact on complexity, study their monotonicity, and then focus on adjusting those parameters that can minimize complexity . For those parameters that are not monotonous, or parameters that will increase the complexity, we use them according to the situation, and most of the time, we can even avoid them. When we adjust the parameters, you can refer to this sequence table:

parameter Impact on the evaluation performance of the model on unknown data influence level
n_estimators Upgrade to stable, n_estimators↑, without affecting the complexity of a single model ⭐⭐⭐⭐
max_depth Increase or decrease, the default maximum depth, that is, the highest complexity, adjust the parameter max_depth↓ in the direction of decreasing complexity, the model is simpler, and move to the left of the image ⭐⭐⭐
min_samples_leaf There are increases and decreases. The default minimum limit is 1, which is the highest complexity. Adjust the parameter min_samples_leaf↑ in the direction of decreasing complexity. The model is simpler and moves to the left of the image. ⭐⭐
min_samples_split There are increases and decreases. The default minimum limit is 2, which is the highest complexity. Adjust the parameter min_samples_split↑ in the direction of reducing the complexity. The model is simpler and moves to the left of the image. ⭐⭐
max_features Increase or decrease. The default auto is the square root of the total number of features. It is located in the middle complexity. You can adjust the parameters in the direction of increasing complexity or decreasing complexity: max_features↓, the model is simpler, the image is left Shift; max_features↑, the model is more complex, and the image is shifted to the right; max_features is the only one, which can make the model simpler, and can also make the model more complex parameters, so when adjusting this parameter, we need to consider the direction of our tuning parameters
criterion Increase or decrease, generally use gini coefficient Depends on the specific situation

Back to top


Three, deviation VS variance

An ensemble model (f) on the unknown data set (D) is jointly determined 泛化误差E(f;D)by 方差(var), 偏差(bais)and 噪声(ε).

Insert picture description here

♦ The concept of deviation and variance

deviation: The difference between the 预测值and of the model 真实值, that is, the distance from each red dot to the blue line. In the integrated algorithm, each base evaluator will have its own bias 集成评估器的偏差是所有基评估器偏差的均值. The more accurate the model, the lower the deviation.

variance: It reflects the error between the model 每一次输出结果and the model 预测值的平均水平, that is, the distance between each red dot and the red dotted line, which
measures the stability of the model. The more stable the model, the lower the variance.

Observe the image below,Each point is the predicted value produced by a base evaluator in the integrated algorithm. The red dashed line represents the average of these predicted values, and the blue line represents the original appearance of the data .
Insert picture description here
Among them, the deviation measures whether the model predicts accurately, the smaller the deviation, the more "precise" the model; and the variance measures whether the results of each prediction of the model are close, that is, the smaller the variance, the more "stable" the model; noise is machine learning cannot interfere In order to make the world a little better, we will not study it. A good model must predict most unknown data "accurately" and "stable". In other words, when the deviation and variance are both low, the generalization error of the model is small, and the accuracy of the unknown data is high.

Big deviation Small deviation
Large variance The model is not suitable for this data; change the model Overfitting; the model is complex; the prediction is very accurate for some data sets, the prediction is bad for some data sets
Small variance Underfitting; the model is relatively simple; the prediction is stable but not accurate for all data predictions The generalization error is small, our goal

Generally 方差和偏差有一个很大,泛化误差都会很大speaking, . However, the variance and bias are trade-offs, and it is impossible to reach the minimum at the same time. How to understand this? Take a look at the picture below:

Insert picture description here
As can be seen from the figure,When the model complexity is large, 方差高, 偏差低. Low deviation means that the model is required to predict "precisely". The model will work harder to learn more information, which will be specific to the training data, which will cause the model to perform well on one part of the data, but perform poorly on another part of the data. The model has poor generalization and unstable performance on different data, so the variance is large. To learn the training set as much as possible, the establishment of the model will inevitably have more details and the complexity will inevitably increase. Therefore, the complexity is high, the variance is high, and the total generalization error is high.

relatively,When the complexity is low, 方差低, 偏差高. The variance is low, and the model is required to predict "stable" and have stronger generalization. For the model, it does not need to learn too deeply from the data. It only needs to build a relatively simple and broad-based model. Up. As a result, the model cannot achieve high accuracy on a certain type or set of data, so the deviation will be large. Therefore, the complexity is low, the deviation is high, and the total generalization error is high.

Our goal of tuning is: to achieve a perfect balance of variance and deviation! Although the variance and deviation cannot reach the minimum at the same time, the generalization error composed of them can have a minimum point, and we are looking for this minimum point. For models with large complexity, the variance should be reduced, and for relatively simple models, the deviation should be reduced. The base evaluators of random forests all have low deviation and high variance, because the decision tree itself is a relatively "precise" prediction model, which is easier to overfit. The bagging method itself also requires the accuracy of the base classifier to be There must be more than 50%. Therefore, the training process of the bagging method represented by the random forest aims to reduce the variance, that is, to reduce the complexity of the model, so the default setting of the random forest parameters is to assume that the model itself is on the right of the lowest point of generalization error. Therefore, when we reduce the complexity, the essence is to reduce the variance of the random forest. All the parameters of the random forest are also moving towards the goal of reducing the variance.

Back to top


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/113803466