Article Directory
- RandomForestClassifier
RandomForestClassifier
class sklearn.ensemble.RandomForestClassifier (n_estimators=’10’, criterion=’gini’, max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’,
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
Random forest is a very representative Bagging ensemble algorithm. All its base evaluators are decision trees. The forest composed of classification trees is called random forest classifier, and the forest integrated by regression trees is called random forest regressor.
Ⅰ. Basic parameters
parameter | meaning |
---|---|
criterion | A measure of impurity, there are two options: Gini coefficient and information entropy |
max_depth | The maximum depth of the tree, branches that exceed the maximum depth will be cut off |
min_samples_leaf | Each child node of a node after branching must contain at least min_samples_leaf training samples, otherwise branching will not occur. |
min_samples_split | A node must contain at least min_samples_split training samples before this node is allowed to be branched, otherwise branching will not occur |
max_features | Limit the number of features considered when branching. Features that exceed the limit will be discarded. The default value is the square root of the total number of features. |
min_impurity_decrease | Limit the size of the information gain, the branch with the information gain less than the set value will not occur |
The meaning of these parameters in the random forest is exactly the same as what we explained when we uploaded the decision tree. The higher the accuracy of a single decision tree, the higher the accuracy of the random forest, because the bagging method depends on the average value or The minority obeys the principle of majority to determine the result of the integration.
Ⅱ. Important parameters n_estimators
This is 随机森林中树木的数量
the number of basic evaluators.This parameter has a monotonic effect on the accuracy of the random forest model. The larger the n_estimators, the better the effect of the model.. But after Accordingly, the model has any decision boundary, n_estimators up to a certain extent, the accuracy of random forests tend not rise or began to fluctuate, and, n_estimators越大,需要的计算量和内存也越大,训练的时间也会越来越长
. For this parameter, the demand is eager to strike a balance between the difficulty of training and the effect of the model.
Ⅲ. Random forest exploration wine data set
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# 加载数据集
wine = load_wine()
x = wine.data
y = wine.target
# 划分测试、训练集
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3)
# 构建模型训练数据集
clf = DecisionTreeClassifier(random_state = 0)
rfc = RandomForestClassifier(random_state = 0)
clf = clf.fit(xtrain,ytrain)
rfc = rfc.fit(xtrain,ytrain)
# 进行模型评估
score_clf = clf.score(xtest,ytest)
score_rfc = rfc.score(xtest,ytest)
print("决策树的分类评分:",score_clf)
print("随机森林的分类评分:",score_rfc)
决策树的分类评分: 0.8888888888888888
随机森林的分类评分: 0.9814814814814815
It can be seen from the results that the model evaluation of the random forest is higher than that of the decision tree. This is because the random forest is composed of multiple different decision trees, and its results are often integrated by multiple decision trees, and the corresponding results Be precise.
Ⅳ. Cross-validation draws learning curve and compares decision tree and random forest
Cross-validation: It is a method of dividing the data set into n parts, taking each one as a test set and every n-1 as a training set, and training the model many times to observe the stability of the model.
# 交叉验证
from sklearn.model_selection import cross_val_score
rfc = RandomForestClassifier(n_estimators=25)
rfc_cross = cross_val_score(rfc,x,y,cv=10)
clf = DecisionTreeClassifier()
clf_cross = cross_val_score(clf,x,y,cv=10)
plt.plot(range(1,11),rfc_cross,label="RandomForestClassifier")
plt.plot(range(1,11),clf_cross,label="DecisionTreeClassifier")
plt.legend()
plt.show()
The result of verifying through a set of price forks is not enough to explain, so we continue to cross-validate, conduct ten sets, each set is tenfold, and take the average of the results to compare the results with the decision tree again.
rfc_l = []
clf_l = []
for i in range(10):
rfc = RandomForestClassifier(n_estimators=25)
rfc_cross = cross_val_score(rfc,x,y,cv=10).mean()
rfc_l.append(rfc_cross)
clf = DecisionTreeClassifier()
clf_cross = cross_val_score(clf,x,y,cv=10).mean()
clf_l.append(clf_cross)
plt.plot(range(1,11),rfc_l,label="RandomForestClassifier")
plt.plot(range(1,11),clf_l,label="DecisionTreeClassifier")
plt.legend()
plt.show()
FIG apparent by the following, substantially random forests rating is much higher than the decision tree, the decision tree and the change of the random forest is very similar, because: 随机森林是多个决策树的集成结果,单个决策树的准确率越高,随机森林的准确率也会越高且装袋法是依赖于平均值或者少数服从多数原则来决定集成的结果的
. Then the trend of a single basic evaluator may also affect the trend of the overall random forest integration result.
Ⅴ. Draw the learning curve of n_estimator
# n_estimator学习曲线
superpa = []
for i in range(200):
rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1)
rfc_s = cross_val_score(rfc,x,y,cv=10).mean()
superpa.append(rfc_s)
print("最高准确率:",max(superpa),"此时的列表索引为:"superpa.index(max(superpa)))
plt.figure(figsize=[20,5])
plt.plot(range(1,201),superpa)
plt.show()
最高准确率:0.9888888888888889 此时的列表索引为:44
It can be seen that when n_estimator reaches a certain value, the overall result 0.96-0.98
fluctuates between up and down and tends to be stable. And we have obtained the highest classification evaluation result, the index in superpa is 44, that is, at n_estimator=45
the time, the random forest classification model works best.
Any model has a decision boundary. After n_estimators reaches a certain level, the accuracy of the random forest often does not rise or begin to fluctuate. Moreover n_estimators越大,需要的计算量和内存也越大,训练的时间也会越来越长
, this graph has been running for more than 3 minutes~
Ⅵ. Important parameters, attributes and interfaces
The essence of random forest is a bagging algorithm. The bagging algorithm is to average the prediction results of the base evaluator or use the majority voting principle to determine the result of the integrated evaluator. In the red wine example just now, we have established 25 trees. For any sample, under the principle of average or majority voting, if and only if there are more than 13 trees that are wrong, the random forest will judge the error. . The accuracy of a single decision tree on the red wine data set is about 0.85. Assuming that the probability of a tree's judgment error is 0.2 (E), the probability of a judgment error of more than 20 trees is:
where i is the judgment The number of errors is also the number of trees that are judged wrong. E is the probability of a tree being judged incorrectly, and (1-E) is the probability of judging correct. There are 25-i correct judgments in total. The combination is used because of the 25 trees, any tree is judged incorrectly
import numpy as np
from scipy.special import comb
np.array([comb(25,i)*(0.2**i)*((1-0.2)**(25-i)) for i in range(13,26)]).sum()
0.00036904803455582827
It can be seen that the probability of judgment error is very small, which makes the performance of random forest on the red wine data set far better than a single decision tree.
Now there is a problem: we say that the bagging method obeys the principle of majority voting or averages the results of the base classifier. This means that we default each tree in the forest should be different and will return different results. . Imagine that if the judgment results of all trees in the random forest are the same (all judgments are correct or all judgments are wrong), then no matter what integration principle is applied to the random forest to obtain the result, it should not be better than a single decision tree. The effect is right. But we use the same class DecisionTreeClassifier, the same parameters, the same training set and test set, why do many trees in the random forest have different judgment results?
Asked this question, many small partners might think: sklearnDecision中的分类树 TreeClassifier自带随机性,所以随机森林中的树天生就都是不一样的
. We mentioned when we explained the classification tree,The decision tree randomly selects a feature from the most important features to branch, so the decision tree generated is different each time, this function is controlled by the parameter random_state. In fact, there is also random_state in random forest, and its usage is similar to that in classification trees, exceptIn the classification tree, a random state only controls the generation of one tree, and the random_state in the random forest controls 生成森林的模式
, instead of having only one tree in a forest。
Parameter random_state, attribute estimators_
# 重要属性和接口
rfc = RandomForestClassifier(n_estimators=25,random_state=2)
rfc = rfc.fit(x,y)
# 随机森林重要的属性之一:estimators_ 用来查看森林中的数的情况
rfc.estimators_
It can be seen that the function of this attribute is to check the number in the random forest. Basically, the information of the number is not much different. The main difference is that the random_state is different.
Tree traversal get 25 random number seed:
it can be observed that when random_state
fixed, 随机森林中生成是一组固定的树,但每棵树依然是不一致的
this is a "randomly selected characteristics branch" approach to get randomness. And we can prove that the effect of the bagging method will generally get better and better when the randomness is greater. When the bagging method is used for integration, the base classifiers should be independent of each other, and they are not the same. But the limitation of this approach is very strong. When we need thousands of trees, the data may not be able to provide thousands of features to allow us to build as many different trees as possible. Therefore, in addition to random state, we also need other randomness.
Parameter bootstrap, parameter oob_score, attribute oob_score_
To make the base classifiers as different as possible, an easy-to-understand method is to use different training sets for training, and the bagging method uses random sampling techniques with replacement to form different training data bootstrap就是用来控制抽样技术的参数
.
In an original training set containing n samples, we perform random sampling, sample one sample at a time, and return the sample to the original training set before drawing the next sample, which means that the sample may still be sampled next time. Collected, so collected n times, and finally got a composition of n samples as large as the original training set 自助集
. Because it is random sampling, the self-service set is different from the original data set each time, and it is also different from other sampling sets. In this way, we can freely create inexhaustible and different self-help sets. Using these self-help sets to train our base classifiers, our base classifiers will naturally be different.
The bootstrap parameter defaults to True, which represents the use of this random sampling technique with replacement. Usually, this parameter will not be set to False by us (as shown in the ball model below).
However, sampling with replacement also has its own problems. Due to replacement, some samples may appear multiple times in the same self-help set, while others may be ignored. Generally speaking, the self-help set contains about 63% of the original data on average. Because each sample is pumped to a probability of self-focus is:
when n足够大
the time, this 概率收敛于1-(1/e)
, 约等于0.632
. Therefore, about 37% of the training data will be wasted,Not involved in modeling,These onesThe data is called out of bag data (oob). In addition to the test set that we divided from the beginning, these data can also be used as the test set of the integrated algorithm. That is to say, when using random forest, we can not divide the test set and training set, only need to use out-of-bag data to test our model. Of course, this is not absolute. When n and n_estimators are not big enough, it is likely that no data will fall out of the bag, and naturally it will not be possible to use oob data to test the model. ( 如上图球模型二
)
If you want to test with out-of-bag data, you need to oob_score
adjust this parameter to True when instantiating. After training, we can use another random forest 重要属性:oob_ score_来查看我们的在袋外数据上测试的结果
:
# 重要属性oob_score
rfc = RandomForestClassifier(n_estimators=25,oob_score=True)
rfc = rfc.fit(x,y)
out_score = rfc.oob_score_
out_score
0.9606741573033708
feature_importance, important interface
The usage and meaning of feature_importances_ and .feature_importances_ in the decision tree are the same, and it is the importance of the returned feature.
Random Forests decision tree interface is exactly the same, there were still four common interfaces: apply
, fit
, predict
and score
. In addition, you also need to pay attention to the random forest predict_proba接口
, this interfaceReturns the probability of each test sample being assigned to each category of labels. If there are several categories of labels, return several probabilities. If it is a two-category problem, if the value returned by predict_proba is greater than 0.5, it is divided into 1, and if it is less than 0.5, it is divided into 0 .
The traditional random forest uses the rules in the bagging method, the average or the minority obeys the majority to determine the result of the integration, andThe random forest in sklearn averages the probability returned by predict_proba corresponding to each sample, and obtains an average probability to determine the classification of the test sample。