Random Forest Summary

Focus: Bagging + Decision Tree = Random Forest

1. Algorithm principle: Random forest is a parallel ensemble learning method based on bagging, which can be used for classification and regression. A random forest is a classifier that contains multiple decision trees, and its output class is determined by the mode of the class output by the base learner.
Through the bootstrap resampling technique, N samples are repeatedly randomly selected from the original training sample set N to generate a new training sample set, and then k classification trees are generated according to the bootstrap sample set to form a random forest.
When constructing the ith decision tree, m (usually log2d+1, d is the number of features) features are randomly selected at each node as candidate features for this point division. The classification result of the new data is determined by the score formed by the votes of the classification tree. In fact, this is an improvement on the decision tree algorithm, which combines multiple decision trees, and the establishment of each tree relies on an independently drawn sample. The classification ability of a single tree may be small, but after randomly generating a large number of decision trees, a test sample can select the most likely classification through the classification results of each tree.

(There are two aspects to the construction of a random forest: row sampling and column sampling—the random selection of samples, and the random selection of features to be selected.

That is , the randomness of the samples: randomly select n samples from the sample set with Bootstrap;

Randomization of features : randomly select K attributes from all attributes, and select the best segmentation attribute as a node to build a CART decision tree)


2. Advantages: Random forest is simple, easy to implement, low in computational overhead, and powerful in performance, and is widely welcomed.
(1) Integrated learning, high prediction accuracy
(2) The introduction of randomness makes random forest not easy to overfit, and has good anti-noise ability
(3) Can handle very high-dimensional data without feature selection

(4) It can handle both discrete data and continuous data;

         Tree model - only one variable is considered for each filter No normalization is required, no normalization is required to use the dataset

(5) The training speed is fast, and the variable importance ranking can be obtained

(6) The implementation is relatively simple, highly parallelized, and easy to implement distributed
3. Disadvantages:
(1) Ignore the correlation between attributes
(2) Random forest will overfit in some noisy classification or regression problems ; 
(3) For data with attributes with different values, attributes with more value divisions will have a greater impact on random forests. (How to solve it? The information gain ratio can be used)
(4) When the number of decision trees in the random forest is large, the space and time required for training will be larger
. 4. The construction process of the random forest:
(1) From the original training set Use the Bootstraping method to randomly select m samples with replacement sampling, perform n_tree sampling in total, and generate n_tree training sets
(2) For n_tree training sets, we train n_tree decision tree models respectively
(3) For a single decision tree model , assuming that the number of training sample features is n, then each split will select the best feature according to the information gain/information gain ratio/Gini index for splitting
(4) Each tree has been split in this way until all the nodes of the node are split. The training examples all belong to the same class. No pruning is required during the splitting of the decision tree

(5) The generated multiple decision trees are formed into a random forest. For classification problems, the final classification result is determined by voting by multiple tree classifiers; for regression problems, the final prediction result is determined by the mean of the predicted values ​​of multiple trees.

5. Random forest tuning parameters:

https://www.cnblogs.com/gczr/p/7141712.html

(Difference between broken thoughts:

What is the difference between a tree model and a linear model? The most important of these is that the tree model is processed one feature at a time, while the linear model is to add weights to all features to get a new value.

The classification difference between decision tree and logistic regression is also in this. Logistic regression transforms all features into probabilities, and divides them into one category by more than a certain probability threshold, and the other category is smaller than a certain probability threshold; and decision tree is Make a partition for each feature. In addition, logistic regression can only find linear splits (linear between input feature x and logit unless x is multi-dimensionally mapped), while decision trees can find nonlinear splits.

The tree model is closer to the human way of thinking, and can generate visual classification rules, and the generated model is interpretable (rules can be extracted). The function fitted by the tree model is actually a step function between partitions.

refer to:

https://blog.csdn.net/qq547276542/article/details/78304454

http://www.cnblogs.com/liuwu265/p/4690486.html

https://baike.baidu.com/item/%E9%9A%8F%E6%9C%BA%E6%A3%AE%E6%9E%97/1974765?fr=aladdin

https://www.cnblogs.com/fionacai/p/5894142.html

https://www.cnblogs.com/gczr/p/7141712.html

Also thanks to my boyfriend for the summary reference ⁄(⁄ ⁄•⁄ω⁄•⁄ ⁄)⁄

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324521432&siteId=291194637