[SkLearn classification, regression algorithm] Random Forest Classifier RandomForestClassifier



RandomForestClassifier

class sklearn.ensemble.RandomForestClassifier (n_estimators=10, criterion=’gini’, max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’,
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

Random forest is a very representative Bagging ensemble algorithm. All its base evaluators are decision trees. The forest composed of classification trees is called random forest classifier, and the forest integrated by regression trees is called random forest regressor.


Ⅰ. Basic parameters

parameter meaning
criterion A measure of impurity, there are two options: Gini coefficient and information entropy
max_depth The maximum depth of the tree, branches that exceed the maximum depth will be cut off
min_samples_leaf Each child node of a node after branching must contain at least min_samples_leaf training samples, otherwise branching will not occur.
min_samples_split A node must contain at least min_samples_split training samples before this node is allowed to be branched, otherwise branching will not occur
max_features Limit the number of features considered when branching. Features that exceed the limit will be discarded. The default value is the square root of the total number of features.
min_impurity_decrease Limit the size of the information gain, the branch with the information gain less than the set value will not occur

The meaning of these parameters in the random forest is exactly the same as what we explained when we uploaded the decision tree. The higher the accuracy of a single decision tree, the higher the accuracy of the random forest, because the bagging method depends on the average value or The minority obeys the principle of majority to determine the result of the integration.

Back to top


Ⅱ. Important parameters n_estimators

This is 随机森林中树木的数量the number of basic evaluators.This parameter has a monotonic effect on the accuracy of the random forest model. The larger the n_estimators, the better the effect of the model.. But after Accordingly, the model has any decision boundary, n_estimators up to a certain extent, the accuracy of random forests tend not rise or began to fluctuate, and, n_estimators越大,需要的计算量和内存也越大,训练的时间也会越来越长. For this parameter, the demand is eager to strike a balance between the difficulty of training and the effect of the model.

Back to top


Ⅲ. Random forest exploration wine data set

import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# 加载数据集
wine = load_wine()
x = wine.data
y = wine.target

# 划分测试、训练集
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3)

# 构建模型训练数据集
clf = DecisionTreeClassifier(random_state = 0)
rfc = RandomForestClassifier(random_state = 0)
clf = clf.fit(xtrain,ytrain)
rfc = rfc.fit(xtrain,ytrain)

# 进行模型评估
score_clf = clf.score(xtest,ytest)
score_rfc = rfc.score(xtest,ytest)

print("决策树的分类评分:",score_clf)
print("随机森林的分类评分:",score_rfc)
决策树的分类评分: 0.8888888888888888
随机森林的分类评分: 0.9814814814814815

It can be seen from the results that the model evaluation of the random forest is higher than that of the decision tree. This is because the random forest is composed of multiple different decision trees, and its results are often integrated by multiple decision trees, and the corresponding results Be precise.

Back to top


Ⅳ. Cross-validation draws learning curve and compares decision tree and random forest

Cross-validation: It is a method of dividing the data set into n parts, taking each one as a test set and every n-1 as a training set, and training the model many times to observe the stability of the model.

# 交叉验证
from sklearn.model_selection import cross_val_score

rfc = RandomForestClassifier(n_estimators=25)
rfc_cross = cross_val_score(rfc,x,y,cv=10)

clf = DecisionTreeClassifier()
clf_cross = cross_val_score(clf,x,y,cv=10)

plt.plot(range(1,11),rfc_cross,label="RandomForestClassifier")
plt.plot(range(1,11),clf_cross,label="DecisionTreeClassifier")
plt.legend()
plt.show()

Insert picture description here
The result of verifying through a set of price forks is not enough to explain, so we continue to cross-validate, conduct ten sets, each set is tenfold, and take the average of the results to compare the results with the decision tree again.

rfc_l = []
clf_l = []
for i in range(10):
    rfc = RandomForestClassifier(n_estimators=25)
    rfc_cross = cross_val_score(rfc,x,y,cv=10).mean()
    rfc_l.append(rfc_cross)
    clf = DecisionTreeClassifier()
    clf_cross = cross_val_score(clf,x,y,cv=10).mean()
    clf_l.append(clf_cross)  
plt.plot(range(1,11),rfc_l,label="RandomForestClassifier")
plt.plot(range(1,11),clf_l,label="DecisionTreeClassifier")
plt.legend()
plt.show()

FIG apparent by the following, substantially random forests rating is much higher than the decision tree, the decision tree and the change of the random forest is very similar, because: 随机森林是多个决策树的集成结果,单个决策树的准确率越高,随机森林的准确率也会越高且装袋法是依赖于平均值或者少数服从多数原则来决定集成的结果的. Then the trend of a single basic evaluator may also affect the trend of the overall random forest integration result.
Insert picture description here

Back to top


Ⅴ. Draw the learning curve of n_estimator

# n_estimator学习曲线
superpa = []
for i in range(200):
    rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1)
    rfc_s = cross_val_score(rfc,x,y,cv=10).mean()
    superpa.append(rfc_s)
print("最高准确率:",max(superpa),"此时的列表索引为:"superpa.index(max(superpa)))
plt.figure(figsize=[20,5])
plt.plot(range(1,201),superpa)
plt.show()

最高准确率:0.9888888888888889 此时的列表索引为:44

It can be seen that when n_estimator reaches a certain value, the overall result 0.96-0.98fluctuates between up and down and tends to be stable. And we have obtained the highest classification evaluation result, the index in superpa is 44, that is, at n_estimator=45the time, the random forest classification model works best.

Any model has a decision boundary. After n_estimators reaches a certain level, the accuracy of the random forest often does not rise or begin to fluctuate. Moreover n_estimators越大,需要的计算量和内存也越大,训练的时间也会越来越长, this graph has been running for more than 3 minutes~
Insert picture description here

Back to top


Ⅵ. Important parameters, attributes and interfaces

The essence of random forest is a bagging algorithm. The bagging algorithm is to average the prediction results of the base evaluator or use the majority voting principle to determine the result of the integrated evaluator. In the red wine example just now, we have established 25 trees. For any sample, under the principle of average or majority voting, if and only if there are more than 13 trees that are wrong, the random forest will judge the error. . The accuracy of a single decision tree on the red wine data set is about 0.85. Assuming that the probability of a tree's judgment error is 0.2 (E), the probability of a judgment error of more than 20 trees is:
Insert picture description here
where i is the judgment The number of errors is also the number of trees that are judged wrong. E is the probability of a tree being judged incorrectly, and (1-E) is the probability of judging correct. There are 25-i correct judgments in total. The combination is used because of the 25 trees, any tree is judged incorrectly

import numpy as np
from scipy.special import comb
np.array([comb(25,i)*(0.2**i)*((1-0.2)**(25-i)) for i in range(13,26)]).sum()

0.00036904803455582827

It can be seen that the probability of judgment error is very small, which makes the performance of random forest on the red wine data set far better than a single decision tree.
Now there is a problem: we say that the bagging method obeys the principle of majority voting or averages the results of the base classifier. This means that we default each tree in the forest should be different and will return different results. . Imagine that if the judgment results of all trees in the random forest are the same (all judgments are correct or all judgments are wrong), then no matter what integration principle is applied to the random forest to obtain the result, it should not be better than a single decision tree. The effect is right. But we use the same class DecisionTreeClassifier, the same parameters, the same training set and test set, why do many trees in the random forest have different judgment results?

Asked this question, many small partners might think: sklearnDecision中的分类树 TreeClassifier自带随机性,所以随机森林中的树天生就都是不一样的. We mentioned when we explained the classification tree,The decision tree randomly selects a feature from the most important features to branch, so the decision tree generated is different each time, this function is controlled by the parameter random_state. In fact, there is also random_state in random forest, and its usage is similar to that in classification trees, exceptIn the classification tree, a random state only controls the generation of one tree, and the random_state in the random forest controls 生成森林的模式, instead of having only one tree in a forest


Parameter random_state, attribute estimators_

# 重要属性和接口
rfc = RandomForestClassifier(n_estimators=25,random_state=2)
rfc = rfc.fit(x,y)

# 随机森林重要的属性之一:estimators_ 用来查看森林中的数的情况
rfc.estimators_

It can be seen that the function of this attribute is to check the number in the random forest. Basically, the information of the number is not much different. The main difference is that the random_state is different.
Insert picture description here
Tree traversal get 25 random number seed:
Insert picture description here
Insert picture description here
it can be observed that when random_statefixed, 随机森林中生成是一组固定的树,但每棵树依然是不一致的this is a "randomly selected characteristics branch" approach to get randomness. And we can prove that the effect of the bagging method will generally get better and better when the randomness is greater. When the bagging method is used for integration, the base classifiers should be independent of each other, and they are not the same. But the limitation of this approach is very strong. When we need thousands of trees, the data may not be able to provide thousands of features to allow us to build as many different trees as possible. Therefore, in addition to random state, we also need other randomness.

Back to top


Parameter bootstrap, parameter oob_score, attribute oob_score_

To make the base classifiers as different as possible, an easy-to-understand method is to use different training sets for training, and the bagging method uses random sampling techniques with replacement to form different training data bootstrap就是用来控制抽样技术的参数.

In an original training set containing n samples, we perform random sampling, sample one sample at a time, and return the sample to the original training set before drawing the next sample, which means that the sample may still be sampled next time. Collected, so collected n times, and finally got a composition of n samples as large as the original training set 自助集. Because it is random sampling, the self-service set is different from the original data set each time, and it is also different from other sampling sets. In this way, we can freely create inexhaustible and different self-help sets. Using these self-help sets to train our base classifiers, our base classifiers will naturally be different.

The bootstrap parameter defaults to True, which represents the use of this random sampling technique with replacement. Usually, this parameter will not be set to False by us (as shown in the ball model below).
Insert picture description here
However, sampling with replacement also has its own problems. Due to replacement, some samples may appear multiple times in the same self-help set, while others may be ignored. Generally speaking, the self-help set contains about 63% of the original data on average. Because each sample is pumped to a probability of self-focus is:
Insert picture description here
Insert picture description here
when n足够大the time, this 概率收敛于1-(1/e), 约等于0.632. Therefore, about 37% of the training data will be wasted,Not involved in modeling,These onesThe data is called out of bag data (oob). In addition to the test set that we divided from the beginning, these data can also be used as the test set of the integrated algorithm. That is to say, when using random forest, we can not divide the test set and training set, only need to use out-of-bag data to test our model. Of course, this is not absolute. When n and n_estimators are not big enough, it is likely that no data will fall out of the bag, and naturally it will not be possible to use oob data to test the model. ( 如上图球模型二)

If you want to test with out-of-bag data, you need to oob_scoreadjust this parameter to True when instantiating. After training, we can use another random forest 重要属性:oob_ score_来查看我们的在袋外数据上测试的结果:

# 重要属性oob_score
rfc = RandomForestClassifier(n_estimators=25,oob_score=True)
rfc = rfc.fit(x,y)
out_score = rfc.oob_score_
out_score

0.9606741573033708

Back to top


feature_importance, important interface

The usage and meaning of feature_importances_ and .feature_importances_ in the decision tree are the same, and it is the importance of the returned feature.

Random Forests decision tree interface is exactly the same, there were still four common interfaces: apply, fit, predictand score. In addition, you also need to pay attention to the random forest predict_proba接口, this interfaceReturns the probability of each test sample being assigned to each category of labels. If there are several categories of labels, return several probabilities. If it is a two-category problem, if the value returned by predict_proba is greater than 0.5, it is divided into 1, and if it is less than 0.5, it is divided into 0 .

Insert picture description here

The traditional random forest uses the rules in the bagging method, the average or the minority obeys the majority to determine the result of the integration, andThe random forest in sklearn averages the probability returned by predict_proba corresponding to each sample, and obtains an average probability to determine the classification of the test sample

Back to top


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/113766822