Traditional Machine Learning Notes 5 - Random Forest

cutting edge

  In the last blog post, we introduced what is the algorithm principle of decision tree. If you don’t understand, you can go back to the front to see: Traditional machine learning notes 4 - decision tree . In this blog post, let’s continue to look at the traditional machine learning algorithm. random forest. Random forest is a (parallel) ensemble algorithm composed of decision trees, which belongs to Baggingthe type of random forest. By combining multiple weak classifiers, the final result is voted or averaged, so that the results of the overall model have high accuracy and generalization performance , but also has good stability and is widely used in various business scenarios. Literally, it can also be seen that random forest has two main characteristics: random and forest, one makes it resistant to overfitting, and the other makes it more accurate. Before introducing the random forest algorithm, let's look at what is integrated learning.

1. Integrated learning

1.1. Integrated learning

  When training data, we train multiple individual learners and integrate them by combining strategies to form a stronger learner, which is ensemble learning. Popular understanding, for example, there is an all-round student in a class who is good at all subjects, but there are a few partial subjects who are only good at one or a few subjects, so we can let these students be good at exams. Partial subjects do the test papers they are good at separately, and finally gather the answers of all the subjects completed. This is the idea of ​​integrated learning, that is, three cobblers, the best Zhuge Liang.

1.2. Individual learner

  We mentioned the individual learner above, which is a concept relative to ensemble learning. The multi-models we introduced before are all individual learners, such as decision trees, logistic regression, and Hackberry Bayesian. The individual learner represents a single learner, and the ensemble learning represents the combination of multiple learners.

  • If the ensemble only contains individual learners of the same type, it is called 同质an ensemble, and the individual learner is called an ensemble 基学习器. For example, random forests are full of decision tree ensembles.
  • If the ensemble contains different types of individual learners, it is called 异质an ensemble, and the individual learners are called ensembles 组件学习器. For example, it includes both decision trees and neural networks for integration.

1.3. The core problem of ensemble learning

1.3.1. What kind of individual learner to use

  • Individual learners cannot be too weak and need to have a certain degree of accuracy.
  • There must be diversity between individual learners, that is, there are differences.

1.3.2. How to choose an appropriate combination strategy to build a strong learner

  • Parallel combinations, such as random forests.
  • Traditional combination methods, such as boosting tree model.

1.4.Bagging

1.4.1.Bootstrap Sampling

  Random forests are parallel models and Baggingthe most famous representative of parallel ensemble learning methods. Let's introduce it first Bagging. Before the introduction Bagging, we still need to understand something called self-service sampling method, which is what we often say Bootstrap Sampling, what is self-service sampling method? Just look at the explanation below:

  • Given ma data set containing samples, we first randomly take a sample and put it into the sampling set, and then put the sample back into the original data set, so that the sample may still be selected in the next sampling.
  • The above process is repeated mrounds, and we get ma sampling set of samples. Some samples in the initial training set appear many times in the sampling set, and some never appear. About samples appear in the sampling set, and about samples 63.2%that do not appear can be used Use it as a validation set to make out -of-package estimates36.8% of subsequent generalization performance .

1.4.2.Bagging

  BaggingIt is Bootstrap aggregatingan abbreviation, and it is Boostrap Samplingbuilt on the basis. We can repeat the above sampling process T, sample Ta msample set containing a training sample, and then train a base learner based on each sample set, and then use these base learners to combine. When combining predicted outputs, Baggingit is common to use simple voting for classification tasks and simple averaging for regression tasks , which is Baggingthe basic flow of . As shown in the figure below:
insert image description here
  From the perspective of bias-variance decomposition , Baggingthe main focus is on reducing variance , so it is more effective on unpruned decision trees, neural networks and other learners that are susceptible to sample disturbances.

2. Random Forest

  After the above preliminary knowledge is introduced, we officially enter the study of random forest algorithm

2.1. Random Forest

  The random forest is referred to as RF( ), which is an optimized version Random Forestbased on the tree model . The core idea is still , but some unique improvements are made, that is, the decision tree is used as the base learner. The implementation process of the random forest algorithm is as follows:BaggingBaggingRFCART

  1. 输入为样本集 D = { ( x , y 1 ) , ( x 2 , y 2 ) , … , ( x m , y m ) } D=\left\{\left(x, y_1\right),\left(x_2, y_2\right), \ldots,\left(x_m, y_m\right)\right\} D={ (x,y1),(x2,y2),,(xm,ym)}
  2. For t = 1 , 2 , 3 , . . . T t=1,2,3,...Tt=1,2,3,...T:
  • Perform the ttth tt on the training setRandom sampling for t times, a total ofmmm times, get the information containingmmSampling setDT D_{T} of m samplesDT
  • Use the sampling set DT D_{T}DTTraining Section TTT decision tree modelsDT ( x ) D_{T}(x)DT( x ) , when training the nodes of the decision tree model, select a part of the sample features from all the sample features on the node, and select an optimal feature among these randomly selected part of the sample features to make the left and right subtrees of the decision tree divided.
  1. classification scene, then TTThe category with the most votes from the T base models (decision trees) is the final category.

Available as shown below:
insert image description here

2.2. Characteristics of Random Forest

  From his literal meaning, we can also see the two characteristics of random forest: randomness + forest. Randomness is mainly reflected in the following two aspects:
sample disturbance:

Directly based on the bootstrap sampling method ( Bootstrap Sampling), the approximate samples in the initial training set 63.2%appear in a sampling set. And bring the difference of the data set.

Attribute perturbation:

In the random forest, for each node of the base decision tree, first randomly select kk from the feature attribute set of the nodek attributes, and then from thiskkSelect an optimal attribute among the k attributes for division. This heavy randomness will also bring about differences in the base model.

Integration is mainly reflected in:
  multiple (differentiated) decision trees are trained based on multiple (differentiated) sampling sets, and simple voting or averaging methods are used to improve model stability and generalization capabilities.

Random Forest Decision Boundary Visualization

  The following is the result of classification using decision trees and random forests of different trees for the same data set (iris data set), and we visualized its decision boundary. insert image description here
  From the above figure, it can be clearly seen that as the number of decision trees in the random forest increases, the generalization ability of the model gradually increases, and the decision boundary tends to be smoother, that is, the stronger the robustness.

2.3. Advantages and disadvantages of random forest algorithm

advantage:

  • It is suitable for high-dimensional (many features) dense data, without dimensionality reduction and feature selection.
  • The process of building a random forest model can also help determine the importance of features.
  • Composite features can be built with the help of models.
  • Parallel integration to effectively control overfitting.
  • The engineering parallelism is simple and the training speed is fast.
  • It is friendly to unbalanced datasets and can balance errors.
  • It is indeed robust to features and can maintain good accuracy.

shortcoming:

  • Overfitting may still occur on noisy classification and regression datasets.
  • Compared to a single decision tree, model interpretation is a bit more complicated due to its randomness.

2.4. Parameters and tuning affecting random forest

  Above we have systematically understood the principles and mechanisms of random forests. Next, let’s take a look at some key points in engineering application practice. For example, the random forest model has many adjustable parameters, what impact do they have, and how to tune them.

2.4.1. Influencing parameters

The number of features when generating a single decision tree max_features:

  • Increasing max_features generally improves the performance of a single decision tree model, but reduces tree-to-tree variability and may slow down the algorithm.
  • Too small max_features will affect the performance of a single tree, thereby affecting the overall integration effect.
  • The best max_features need to be properly balanced and selected.

tree of decision treesn_estimators

  • More subtrees can make the model have better stability and generalization ability, but at the same time make the learning speed of the model slower.
  • We will choose a slightly larger subtree if the computing resources can support it.

deep treemax_depth

  • If the tree depth is too large, because each subtree is over-learned, there may be over-fitting problems.
  • If the model has many samples and features, we will limit the maximum tree depth to improve the generalization ability of the model.

2.4.1. Parameter tuning

RFThe maximum number of features considered when partitioning max_features:

  • The percentage of the total, the common selection interval is [0.5,0.9].

The tree of the decision tree n_estimators:

  • It may be set to a value >50, which can be adjusted according to computing resources.

Decision tree maximum depth max_depth:

  • Common choices are between 4-12.

The minimum number of samples required for internal node subdivision min_samples_split:

  • If the sample size is not large, there is no need to adjust this value.
  • If the sample size is of very large order of magnitude, we might set this value to 16, 32, 64, etc.

Minimum number of samples for leaf nodes min_samples_leaf:

  • To improve generalization, we may set this value > 1.

So far, the part about the random forest has been introduced, and everyone is welcome to criticize and correct.

Guess you like

Origin blog.csdn.net/qq_38683460/article/details/127488050