Random Forest (RF)

Random Forest

Random forest is actually a special bagging method that uses decision trees as a model in bagging.

Random forest is an integration of decision trees, but there are two differences:

(1) The difference of sampling: from the data set containing m samples, there are replacement sampling, and the sampling set containing m samples is obtained for training. This can ensure that the training samples of each decision tree are not exactly the same.

First, take a sample with replacement from the original data set to construct a sub-data set. The data volume of the sub-data set is the same as the original data set . Elements in different sub-data sets can be repeated, and elements in the same sub-data set can also be repeated. Second, use sub-data sets to build sub-decision trees, put this data in each sub-decision tree, and each sub-decision tree outputs a result. Finally, if you have new data and need to get the classification result through the random forest, you can get the output result of the random forest by voting on the judgment results of the sub-decision tree.

(2) The difference of feature selection: the n classification features of each decision tree are randomly selected from all the features (n is a parameter that needs to be adjusted by ourselves)

Each splitting process of the subtree in the random forest does not use all the candidate features, but randomly selects certain features from all the candidate features, and then selects the optimal feature from the randomly selected features. In this way, the decision trees in the random forest can be different from each other, and the diversity of the system is improved, thereby improving the classification performance.
Insert picture description here

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
 
rfc = RandomForestClassifier(n_estimators=20,max_depth=None,random_state=0)
# n_estimator树的个数,
rfc = rfc.fit(Xtrain,Ytrain)
score_r = rfc.score(Xtest,Ytest)
rfc.predict(Xtest)  # 预测类别
rfc.predict_proba(Xtest)  # 各类预测概率 
 
print("Random Forest:{}".format(score_r))

The probability of each sample being drawn into a certain subset is (1 minus the probability of not drawing n times):
1 − (1 − 1 n) n 1-(1-\frac{1}{n})^n1(1n1)n
lim ⁡ n → ∞ ( 1 − ( 1 − 1 n ) n ) = ( 1 − 1 e ) = 0.632 \displaystyle\lim_{n\to\infty}(1-(1-\frac{1}{n})^n)=(1-\frac{1}{e})=0.632 nlim(1(1n1)n)=(1e1)=0 . . 6 . 3 2
so that a set of self-contained on average about 63% of the original data. Approximately 37% of the training data is wasted without participating in the modeling. These data are called out of bag data (oob). In addition to the test set we divided from the beginning, these data can also be used as the test set of the integrated algorithm. That is to say, when using random forest, we can not divide the test set and training set, and only need to use out-of-bag data to test our model.

rfc = RandomForestClassifier(n_estimators=25,oob_score=True) #默认为False
rfc = rfc.fit(wine.data,wine.target)
#重要属性 oob_score_ 使用袋外数据的模型评分
rfc.oob_score_
# 当n和n_estimators都不够大的时候,很可能就没有数据掉落在袋外,自然也就无法使用oob数据来测试模型了

advantage:

  1. Many data sets perform well, and the introduction of two randomness makes the random forest not easy to fall into overfitting;

  2. It can handle high-dimensional data without feature selection;

  3. Strong adaptability to data sets: it can handle both discrete data and continuous data, and data sets do not need to be standardized

  4. After training, it is more important to be able to give those features;

  5. The training speed is fast and the calculation is easy to parallelize.

  6. The introduction of two randomness makes the random forest have good anti-noise ability

  7. Implicitly creates multiple joint features and can solve nonlinear problems

  8. Comes with out-of-bag (oob) error evaluation function

Disadvantages;

  1. Over-fitting will occur in noisy classification or regression problems;

  2. For data with different levels of attributes, attributes with more levels will have a greater impact on the random forest, so the value produced by RF on this type of data is unreliable.

  3. Not suitable for small samples, only suitable for large samples

  4. Lower accuracy

  5. Suitable for decision-making boundary is rectangular, not suitable for diagonal

condition

  1. The base classifiers of random forest are independent and different from each other

Comparison of RF and GBDT

Same point:

  1. Are composed of multiple trees;
  2. The final result is determined by multiple trees together;

difference:

  1. Based on the bagging idea, and gbdt is the boosting idea, that is, the sampling method is different

  2. RF can be generated in parallel, while GBDT can only be serial;

  3. Output results, RF adopts majority voting, GBDT accumulates all results;

  4. RF is not sensitive to outliers, GBDT is sensitive

  5. Which one is easy to overfit, GBDT or RF? Answer: RF, because the decision tree of random forest tries to fit the data set, there is a potential risk of overfitting, while the decision tree of boosting GBDT is to fit the residual of the data set, and then update the residual, and make new decisions The tree fits the new residuals. Although it is slow, it is difficult to overfit.

  6. The tree that composes RF can be a classification tree or a regression tree; and GBDT is only composed of regression trees

  7. RF treats the training set equally, GBDT is an ensemble of weak classifiers based on weights

  8. RF is to improve performance by reducing model variance, and GBDT is to improve performance by reducing model bias (bias)

ET or Extra-Trees (Extremely randomized trees)

The ET algorithm is very similar to the random forest algorithm, and both are composed of many decision trees.
The main difference between limit tree and random forest:

  1. RandomForest uses the Bagging model. All the samples used by extraTree, but the features are randomly selected. Because the splitting is random, it is better than random forest to some extent.

  2. Random forest obtains the best bifurcation attribute in a random subset, while ET obtains the bifurcation value completely at random, so as to achieve bifurcation of the decision tree.

For the second point of difference:
just take the binary tree as an example. When the feature attribute is in the form of categories, randomly select samples with certain categories as the left branch, and take samples with other categories as the right branch; when the feature attribute is a numerical value In the form of, randomly select an arbitrary number between the maximum and minimum of the characteristic attribute. When the characteristic attribute value of the sample is greater than this value, it is regarded as the left branch, and when the value is less than the value, it is regarded as the right branch.
In this way, the purpose of randomly assigning samples to two branches under this characteristic attribute is achieved. Then calculate the bifurcation value at this time (if the characteristic attribute is in the form of a category, the Gini index can be applied; if the characteristic attribute is in the form of a value, the mean square error can be applied). Traverse all the feature attributes in the node, and obtain the bifurcation values ​​of all feature properties according to the above method. We choose the form with the largest bifurcation value to implement the bifurcation of the node. As can be seen from the above introduction, this method is more random than random forest.

For a certain decision tree, because its best bifurcation attribute is randomly selected, the prediction results with it are often inaccurate, but a combination of multiple decision trees can achieve a good prediction effect.

After the ET is constructed, we can also apply all the training samples to get the prediction error of the ET. This is because although the decision tree is constructed and the prediction application is the same training sample set, since the best bifurcation attribute is randomly selected, we will still get completely different prediction results, and the prediction results can be compared with the sample's The real response value is compared to obtain the prediction error. If it is analogous to random forest, in ET, all training samples are OOB samples, so calculating the prediction error of ET, that is, calculating this OOB error.

Guess you like

Origin blog.csdn.net/weixin_42764932/article/details/111405355