sklearn machine learning library (2) random forest in sklearn

sklearn machine learning library (2) random forest in sklearn

The ensemble algorithm considers the modeling results of multiple evaluators and aggregates them to obtain a comprehensive result to obtain better regression or classification performance than a single model .

The model integrated by multiple models is called an ensemble estimator, and each model that makes up the ensemble estimator is called a base estimator. Generally speaking, there are three types of integration algorithms: Bagging, Boosting and stacking.

  • 装袋法(Bagging)The core idea is to construct multiple independent evaluators , and then average their predictions or use a majority vote to determine the result of the integrated evaluator. The representative model of the bagging method is 随机森林.

  • 提升法(Boosting), the base evaluators are related and constructed one by one in order. The core idea is to combine the power of weak evaluators to
    predict difficult-to-evaluate samples again and again, thereby forming a strong evaluator. There are representative models of the lifting method Adaboost和梯度提升树(GBDT).

The integrated algorithm module ensemble in sklearn

kind class function
ensemble.AdaBoostClassifier AdaBoost classification
ensemble.AdaBoostRegressor Adaboost is back
ensemble.BaggingClassifier bagging classifier
ensemble.BaggingRegressor Bagging returner
ensemble.ExtraTreesClassifier Extra-trees classification (super tree, extreme random tree)
ensemble.ExtraTreesRegressor Extra-trees return
ensemble.GradientBoostingClassifier Gradient Boosting Classification
ensemble.GradientBoostingRegressor gradient boosting regression
ensemble.IsolationForest Isolate forest
ensemble.RandomForestClassifier Random forest classification
ensemble.RandomForestRegressor random forest regression
ensemble.RandomTreesEmbedding Ensembles of completely random trees
ensemble.VotingClassifier Soft-voting/majority rule classifiers for unfit estimators

Among the ensemble algorithms, more than half are tree ensemble models.

There are two core issues in the decision tree, one is how to find the correct features for branching, and the other is when the tree should stop growing.

  • For the first question, we define the index impurity to measure the quality of branches, the impurity of classification trees 基尼系数或信息熵is used to measure, and
    the impurity of regression trees MSE均方误差is used to measure. Each time it branches, the decision tree calculates the impurity of all features, selects the feature with the lowest impurity
    for , and then calculates the impurity of each feature under different values ​​of the branches. Purity, continue to select the feature with the lowest impurity for branching.
  • Decision trees are very easy to overfit. In order to prevent overfitting of the decision tree, we need to prune the decision tree. sklearn provides a large number of pruning parameters.

1 RandomForestClassifier

Random forest is a very representative bagging integration algorithm. All its base evaluators are decision trees. The forest composed of classification trees is called a random forest classifier, and the forest integrated with regression trees is called a random forest regressor.

sklearn.ensemble.RandomForestClassifier(
    n_estimators=100, 
    *, 
    criterion='gini', 
    max_depth=None, 
    min_samples_split=2, 
    min_samples_leaf=1, 
    min_weight_fraction_leaf=0.0, 
    max_features='sqrt', 
    max_leaf_nodes=None, 
    min_impurity_decrease=0.0, 
    bootstrap=True, 
    oob_score=False, 
    n_jobs=None, 
    random_state=None, 
    verbose=0, 
    warm_start=False, 
    class_weight=None, 
    ccp_alpha=0.0, 
    max_samples=None
)

1.1 Important parameters

1.1.1 Control parameters of the base estimator

parameter meaning
criterion Impurity measure, {"gini", "entropy", "log_loss"}, default is "gini"
max_depth The maximum depth of the tree, branches exceeding the maximum depth will be cut off
min_samples_leaf Each child node of a node after branching must contain at least min_samples_leaf training samples, otherwise branching will not happen,
min_samples_split A node must contain at least min_samples_split training samples before this node is allowed to be branched, otherwise the branch will not happen
max_features max_features limits the number of features considered when branching. Features that exceed the limit will be discarded. The default value is the square root of the total number of features (sqrt). The optional values ​​are **{"sqrt", "log2" , None}**
min_impurity_decrease Branches whose information gain is less than the set value will not occur

For detailed explanation, please refer to the official website. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

1.1.2 n_estimators

  • This is the number of trees in the forest, i.e. the number of base evaluators. This parameter has a monotonic effect on the accuracy of the random forest model. The larger n_estimators, the better the model effect .

  • But correspondingly, any model has a decision boundary. After n_estimators reaches a certain level, the accuracy of the random forest often stops rising or starts to fluctuate. Moreover, the larger the n_estimators, the greater the amount of calculation and memory required, and the training time also increases. It will get longer and longer. For this parameter, we are eager to strike a balance between training difficulty and model effect.

  • The default value of n_estimators is changed from 10 to 100 in version 0.22. This correction shows the user's tendency to tune parameters:要更大的n_estimators。

from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split,cross_val_score
%matplotlib inline


# 导入红酒数据集
wine = load_wine()

Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)

# 训练模型
clf = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=0)

clf = clf.fit(Xtrain, Ytrain)
rfc = rfc.fit(Xtrain, Ytrain)


tree_score = clf.score(Xtest,Ytest)
forest_score = rfc.score(Xtest,Ytest)

print('单棵树:{}'.format(tree_score),'随机森林:{}'.format(forest_score))
单棵树:0.9444444444444444 随机森林:1.0
# 交叉验证
rfc = RandomForestClassifier(n_estimators=25)
rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10)

clf = DecisionTreeClassifier()
clf_s = cross_val_score(clf,wine.data,wine.target,cv=10)

plt.plot(range(1,11),rfc_s,label = "RandomForest")
plt.plot(range(1,11),clf_s,label = "Decision Tree")
plt.legend()
plt.show()

insert image description here

# n_estimators的学习曲线
forest_scores = []

for i in range(100):
    rfc = RandomForestClassifier(n_estimators=i+1)
    s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
    forest_scores.append(s)


# 打印最大值及最大值的索引
print(max(forest_scores),forest_scores.index(max(forest_scores)))
plt.figure(figsize=(10,8))
plt.plot(range(1,101),forest_scores)
plt.legend()
plt.show()

insert image description here

1.1.3 random_state

  • The essence of random forest is a bagging integration algorithm (bagging). The bagging integration algorithm averages the prediction results of the base evaluator or uses the majority voting principle to determine the result of the integrated evaluator.

  • In the red wine example just now, we established 25 trees. For any sample, under the average or majority voting principle, the random forest will make a wrong judgment if and only if more than 13 trees make a wrong judgment. The classification accuracy of a single decision tree for the red wine data set fluctuates around 0.85. Assuming that the probability of a tree being wrong is 0.2(ε), the probability that more than 20 trees are wrong is 0.000369. 判断错误的几率非常小,这让随机森林在红酒数据集上的表现远远好于单棵决策树.

insert image description here

  • There is actually random_state in random forests, and its usage is similar to that in classification trees.
    • It’s just that in a classification tree, a random_state only controls the generation of one tree.
    • The random_state in the random forest controls the mode of generating the forest, rather than having only one tree in a forest.
rfc = RandomForestClassifier(n_estimators=20,random_state=2)
rfc = rfc.fit(Xtrain, Ytrain)

#随机森林的重要属性之一:estimators,查看森林中树的状况
for i in range(len(rfc.estimators_)):
    print(rfc.estimators_[i].random_state)
1872583848
794921487
111352301
1853453896
213298710
1922988331
......
  • When random_state is fixed, a fixed set of trees is generated in the random forest, but each tree is still inconsistent. This is the randomness 随机挑选特征进行分枝obtained by the method used. And we can prove that when the randomness is larger, the effect of the bagging method will generally become better and better. When integrating using the bagging method, the base classifiers should be independent of each other and different .

  • But the limitations of this approach are very strong. When we need thousands of trees, the data may not necessarily provide thousands of features to allow us to build as many different trees as possible. Therefore, 除了random_state。我们还需要其他的随机性.

1.1.4 bootstrap & oob_score

  • To make the base classifiers as different as possible, another easy-to-understand method is to use different training sets for training. The bagging method is used to 有放回的随机抽样技术form different training data. Bootstrap is used to control the sampling technology. parameter.

  • In an original training set containing n samples, we perform random sampling, sampling one sample at a time, and put the sample back into the original training set before extracting the next sample, which means that the sample may still be sampled next time. Collected, due to random sampling, each self-help set is different from the original data set, and also different from other sampling sets. Using these bootstrap sets to train our base classifiers, our base classifiers will naturally vary.

  • bootstrap参数默认True, represents the use of this random sampling technique with replacement.

  • Generally speaking, the bootstrap set contains about 63% of the original data on average. Therefore, about 37% of the training data will be wasted and not involved in modeling. This data is called袋外数据(out of bag data,简写为oob)

1 − ( 1 − 1 / n ) n When n is large enough, this probability converges to 1 − ( 1 / e ), which is approximately equal to 0.632. 1-(1-1/n)^n \\ When n is large enough, this probability converges to 1-(1/e), approximately equal to 0.632.1(11/n)nWhen n is large enough, this probability converges to 1( 1/ e ) , approximately equal to 0.632 .

  • When using random forest, we do not need to divide the test set and training set, and only need to use out-of-bag data to test our model.
    • Of course, this is not absolute. When n and n_estimators are not large enough, it is likely that no data will fall out of the bag, and naturally it will not be possible to use oob data to test the model.
    • If you want to test with out-of-bag data, you need to实例化时就将oob_score这个参数调整为True
    • After training, you can oob_score_use view the results of testing on out-of-bag data.
# 无需划分训练集和测试集
rfc = RandomForestClassifier(n_estimators=25,oob_score=True)
rfc = rfc.fit(wine.data, wine.target)

# 重要属性oob_score_
print(rfc.oob_score_) 

1.2 Important interfaces

  • The interface of random forest is exactly the same as that of decision tree, so there are still four common interfaces: apply, fit, predict和score.

  • In addition, you also need to pay attention to the random forest predict_probainterface. This interface returns the probability of being assigned to each category of labels corresponding to each test sample. If the label has several categories, it returns several probabilities. If it is a two-classification problem, the value returned by predict_proba will be divided into 1 if it is greater than 0.5, and it will be divided into 0 if it is less than 0.5.

  • The traditional random forest uses the rules in the bagging method, and the average or the majority obeys the majority to determine the result of the integration.sklearn中的随机森林是平均每个样本对应的predict_proba返回的概率,得到一个平均概率,从而决定测试样本的分类

insert image description here

insert image description here

注意:

When the error rate of the base classifier is less than 0.5, that is, when the accuracy rate is greater than 0.5, the integration effect is better than the base classifier.

On the contrary, when the error rate of the base classifier is greater than 0.5, the bagged ensemble algorithm fails.

Therefore, before using random forest, be sure to check whether the classification trees used to form the random forest have at least 50% prediction accuracy.

2 RandomForestRegressor

DecisionTreeRegressor(*,
                      criterion='squared_error', 
                      splitter='best', 
                      max_depth=None, 
                      min_samples_split=2, 
                      min_samples_leaf=1, 
                      min_weight_fraction_leaf=0.0, 
                      max_features=None, 
                      random_state=None, 
                      max_leaf_nodes=None, 
                      min_impurity_decrease=0.0, 
                      ccp_alpha=0.0
)

All parameters, attributes and interfaces are consistent with the random forest classifier. The only difference is that the regression tree is different from the classification tree, and the impurity index and
parameter Criterion are inconsistent.

2.1 Criterion

Criterion is an indicator to measure the quality of regression tree branches:

  • The mean square error squared_error, the difference between the mean square error between the parent node and the leaf node will be used as the criterion for feature selection, this method minimizes the L2 loss by using the mean value of the leaf nodes

  • friedman_mse, this metric uses Friedman's modified mean squared error for problems in latent branches

  • Mean Absolute Error absolute_error, this metric uses the median of the leaf nodes to minimize the L1 loss

  • As well poisson, it uses the reduction of Poisson deviation to find splits

Although the mean square error is always positive, when the mean square error is used as the criterion in sklearn, it is calculated 负均方误差.

This is because sklearn will consider the nature of the indicator itself when calculating the model evaluation index. The mean square error itself is an error, so it is divided into a loss of the model by sklearn. Therefore, in sklearn, all negative numbers express. The true value of the mean square error MSE is actually the number of neg_mean_squared_error with the negative sign removed.

# 读取波士顿数据集,注意:新版本的sklearn中自带的已经删除
data = pd.read_csv("boston_housing.data", sep='\s+', header=None)

x = data.iloc[:, :-1]
y = data.iloc[:, -1]


regressor = RandomForestRegressor(n_estimators=100,random_state=0)
cross_val_score(regressor, x, y, cv=10,scoring = "neg_mean_squared_error")

insert image description here

Return the results of ten times of cross-validation. Note that if you do not fill in scoring = "neg_mean_squared_error", the default model measurement indicator for cross-validation is R squared, so the results of cross-validation may be positive or negative.

And if scoring is written, the measurement standard is negative MSE, and the result of cross-validation can only be negative.

2.2 Use Random Forest to Fill Missing Values

The data we collect from reality often has some missing values. In the face of missing values, many people choose to directly delete the samples containing missing values, which is an effective method.

Sometimes it is better to fill in missing values ​​than to discard the samples directly. In sklearn, we sklearn.impute.SimpleImputeruse it to easily impute the mean, median, or other most commonly used values ​​to the data.

In this case, we will use mean, 0, and random forest regression to impute missing values, and verify the fit under four conditions to find the best imputation method for the dataset used.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# 读取波士顿数据集,注意:新版本的sklearn中自带的已经删除
data = pd.read_csv("boston_housing.data", sep='\s+', header=None)

x = data.iloc[:, :-1]
y = data.iloc[:, -1]

print(x.shape)
x.info()  # 均非空

insert image description here

# 人为填充缺失值
#所有数据要随机遍布在数据集的各行各列当中,而一个缺失的数据会需要一个行索引和一个列索引
#如果能够创造一个数组,包含3289个分布在0~506中间的行索引,和3289个分布在0~13之间的列索引,那我们就可以利用索引来为数据中的任意3289个位置赋空值

n_samples = x.shape[0]
n_features = x.shape[1]

rng = np.random.RandomState(0)
miss_rate = 0.5 # 缺失50%,即3289数据

n_missing_samples = int(np.floor(n_samples * n_features * miss_rate))

missing_features = rng.randint(0,n_features,n_missing_samples)
missing_samples = rng.randint(0,n_samples,n_missing_samples)
X_missing = x.copy().to_numpy()
y_missing = y.copy()
X_missing[missing_samples,missing_features] = np.nan
X_missing = pd.DataFrame(X_missing)


X_missing.head(10)

insert image description here

# 使用0和均值填补缺失值

# 1、使用均值进行填充
s_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

x_miss_mean = s_mean.fit_transform(X_missing)


# 2、使用常数0进行填充
zero_mean = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=0)

x_miss_zero = zero_mean.fit_transform(X_missing)

# 3、使用随机森林填充缺失值
'''
特征T不缺失的值对应的其他n-1个特征 + 本来的标签:X_train
特征T不缺失的值:Y_train

特征T缺失的值对应的其他n-1个特征 + 本来的标签:X_test
特征T缺失的值:未知,我们需要预测的Y_test

这种做法,对于某一个特征大量缺失,其他特征却很完整的情况,非常适用

那如果数据中除了特征T之外,其他特征也有缺失值怎么办?
答案是遍历所有的特征,从缺失最少的开始进行填补(因为填补缺失最少的特征所需要的准确信息最少)。
填补一个特征时,先将其他特征的缺失值用0代替,每完成一次回归预测,就将预测值放到原本的特征矩阵中,
再继续填补下一个特征。每一次填补完毕,有缺失值的特征会减少一个,所以每次循环后,需要用0来填补的特征就越来越少

'''
X_missing_reg = X_missing.copy()
sortindex = np.argsort(X_missing_reg.isnull().sum(axis=0))

for i in sortindex:
    #构建我们的新特征矩阵和新标签
    df = X_missing_reg
    fillc = df.iloc[:,i]
    df = pd.concat([df.iloc[:,df.columns != i],y],axis=1)

    #在新特征矩阵中,对含有缺失值的列,进行0的填补
    df_0 =SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df)

    #找出我们的训练集和测试集
    Ytrain = fillc[fillc.notnull()]
    Ytest = fillc[fillc.isnull()]

    Xtrain = df_0[Ytrain.index,:]
    Xtest = df_0[Ytest.index,:]

    #用随机森林回归来填补缺失值
    rfc = RandomForestRegressor(n_estimators=100)
    rfc = rfc.fit(Xtrain, Ytrain)
    Ypredict = rfc.predict(Xtest)

    #将填补好的特征返回到我们的原始的特征矩阵中
    X_missing_reg.loc[X_missing_reg.iloc[:,i].isnull(),i] = Ypredict
# 对填补好的数据进行建模
X = [x,x_miss_mean,x_miss_zero,X_missing_reg]


mse = []
std = []
for x in X:
    estimator = RandomForestRegressor(random_state=0, n_estimators=100)
    scores = cross_val_score(estimator,x,y,scoring='neg_mean_squared_error',cv=5).mean()
    mse.append(scores * -1)
# 画图
x_labels = ['Full data','Zero Imputation','Mean Imputation','Regressor Imputation']
colors = ['r', 'g', 'b', 'orange']

plt.figure(figsize=(12, 6))

ax = plt.subplot(111)
for i in np.arange(len(mse)):
    ax.barh(i, mse[i],color=colors[i], alpha=0.6, align='center')


ax.set_title('Imputation Techniques with Boston Data')
ax.set_xlim(left=np.min(mse) * 0.9,right=np.max(mse) * 1.1)
ax.set_yticks(np.arange(len(mse)))
ax.set_xlabel('MSE')
ax.set_yticklabels(x_labels)
plt.show()

insert image description here

3 Detailed explanation of generalization error

Model tuning, the first step is to find the target: what are we going to do?

Generally speaking, this goal is to improve a certain model evaluation index. For example, for random forests, what we want to improve is the accuracy of the model on unknown data (measured by score or oob_score_).

Once we have identified this goal, we need to think about: What factors affect the accuracy of the model on unknown data?

In machine learning, the indicator we use to measure the accuracy of a model on unknown data is called generalization error.

3.1 The relationship between generalization error and model structure

When the model performs poorly on unknown data (test set or out-of-bag data), we say that the model's generalization degree is not enough, the generalization error is large, and the model's effect is not good
. Generalization error is affected by the structure (complexity) of the model.

As shown below:

  • When the model is too complex, the model will overfit and the generalization ability is not enough, so the generalization error is large.

  • When the model is too simple, the model will be underfitted and the fitting ability will be insufficient, so the error will be large.

  • Only when the complexity of the model is just right can the goal of minimizing the generalization error be achieved

insert image description here

模型的复杂度与参数的关系

  • For the tree model, 树越茂盛,深度越深,枝叶越多,模型就越复杂.

  • Therefore 树模型是天生位于图的右上角的模型, random forest is based on the tree model, so random forest is also a model with inherently high complexity. The parameters of random forest all go towards one goal: 减少模型的复杂度,把模型往图像的左边移动,防止过拟合.

  • Of course, there is no absolute in parameter adjustment, and there are also random forests that are naturally located on the left side of the image. Therefore, before adjusting parameters, we must first determine which side of the image the model is currently on.

3.2 Random Forest and Generalization Error

We now know:

1) If the model is too complex or too simple, the generalization error will be high. What we are after is the balance point in the middle. 2) If the
model is too complex, it will be over-fitting, and if the model is too simple, it will be under-fitting.
3) For tree models and For the tree integrated model, the deeper the tree, the more branches and leaves, the more complex the model
4) The goal of the tree model and the tree integrated model is to reduce the complexity of the model and move the model to the left of the image

The direction of random forest tuning:降低模型的复杂度

We can select those parameters that have a huge impact on complexity, study their monotonicity, and then focus on adjusting those parameters that can minimize complexity. When the complexity can no longer be reduced, we don't have to adjust.

parameter Impact on model evaluation performance on unknown data influence level
n_estimators Improve to stationary, n_estimators↑, does not affect the complexity of a single model ⭐⭐⭐⭐
max_depth There are increases and decreases. The default maximum depth is the highest complexity. Adjust the parameter max_depth↓ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐⭐
min_samples _leaf There are increases and decreases. The default minimum limit is 1, which is the highest complexity. Adjust the parameter min_samples_leaf↑ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐
min_samples _split There are increases and decreases. The default minimum limit is 2, which is the highest complexity. Adjust the parameter min_samples_split↑ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐
max_features There are increases and decreases. The default auto is the square root of the total number of features. It is located in the middle complexity. You can adjust the parameter max_features↓ in the direction of increasing complexity or decreasing complexity. The model is simpler and the image is moved to the left. max_features↑, the model is more complex, and the image is shifted to the right. max_features is the only parameter that can not only make the model simpler, but also make the model more complex. Therefore, when adjusting this parameter, we need to consider the direction of our parameter adjustment.
criterion There are increases and decreases, generally use gini

3.3 Variance and bias

The generalization error E of the integrated model on the unknown data set is 方差(var),偏差(bais)和噪声(ε)jointly determined.

偏差:模型的预测值与真实值之间的差异, the more accurate the model, the lower the bias.

方差:反映的是模型每一次输出结果与模型预测值的平均水平之间的误差, the more stable the model, the lower the variance.

The deviation measures whether the model predicts accurately. The smaller the deviation, the more "accurate" the model is; while the variance measures whether the results of each prediction of the model are close, that is to say, the smaller the variance, the more "
stable" the model is.

Large deviation Small deviation
Large difference in direction The model is not suitable for this data Overfitting models are complicated Predicts well on some datasets Predicts poorly on some datasets
Small variance Underfitting, the model is relatively simple and the prediction is very stable but the prediction is not accurate for all data. Small generalization error, our goal

泛化误差和方差及偏差的关系

insert image description here

  • When the model complexity is large, the variance is high and the bias is low.

    • Low bias means that the model is required to predict "accurately". The model will work harder to learn more information and will be specific to the training data, which will result in the model performing well on some data and performing poorly on other data.
    • The generalization of the model is poor, and the performance is unstable on different data, so the variance is large. To learn as much as possible from the training set, the establishment of the model must be more detailed and the complexity must rise. Therefore, the complexity is high, the variance is high, and the total generalization error is high.
  • In contrast, when complexity is low, variance is low and bias is high.

    • Low variance requires the model to predict "stable" and have stronger generalization. For the model, it does not need to study the data too deeply. It only needs to build a relatively simple model with a wider range of judgments. .
    • The result is that the model cannot achieve high accuracy on a certain type or set of data, so the deviation will be large. Therefore, the complexity is low, the bias is high, and the total generalization error is high.

随机森林中的方差与偏差

The base evaluators of random forests all have lower bias and higher variance, because the decision tree itself is a model that is relatively "accurate" in prediction and easier to overfit. The bagging method itself also requires that the accuracy of the base classifier must be It must be more than 50%.

Therefore, the training process of the bagging method represented by random forest is aimed at reducing variance, that is, reducing model complexity. Therefore, the default settings of random forest parameters assume that the model itself is to the right of the lowest point of generalization error .

4 Parameter adjustment of random forest on breast cancer data set

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = load_breast_cancer()

# 乳腺癌数据集有569条记录,30个特征,单看维度虽然不算太高,但是样本量非常少。过拟合的情况可能存在
data.data.shape
# 进行一次简单的建模,看看模型本身在数据集上的效果
rfc = RandomForestClassifier(n_estimators=100,random_state=20)
score_pre = cross_val_score(rfc,data.data,data.target,cv=10).mean()

print(score_pre)  # 0.9648809523809524
# 随机森林调整的第一步:无论如何先来调n_estimators

scorel = []
for i in range(0,200,10):
    rfc = RandomForestClassifier(n_estimators=i+1,
                                 n_jobs=-1,
                                 random_state=20)
    score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)

# 打印最大分数及所在的索引
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

insert image description here

# 在确定好的范围内,进一步细化学习曲线
scorel = []
for i in range(45,80):
    rfc = RandomForestClassifier(n_estimators=i,
                                 n_jobs=-1,
                                 random_state=20)
    score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)

print(max(scorel),([*range(45,80)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(45,80),scorel)
plt.show()

insert image description here

# 接下来就进入网格搜索,使用网格搜索对参数一个个进行调整。
# 为什么我们不同时调整多个参数呢?
# 1)同时调整多个参数会运行非常缓慢。
# 2)同时调整多个参数,会让我们无法理解参数的组合是怎么得来的,所以即便网格搜索调出来的结果不好,我们也不知道从哪里去改。
# 在这里,为了使用复杂度-泛化误差方法(方差-偏差方法),我们对参数进行一个个地调整。


# 为网格搜索做准备,书写网格搜索的参数
"""
有一些参数是没有参照的,很难说清一个范围,这种情况下我们使用学习曲线,看趋势从曲线跑出的结果中选取一个更小的区间,再跑曲线

param_grid = {'n_estimators':np.arange(0, 200, 10)}
param_grid = {'max_depth':np.arange(1, 20, 1)}
param_grid = {'max_leaf_nodes':np.arange(25,50,1)}


有一些参数是可以找到一个范围的,或者说我们知道他们的取值和随着他们的取值,模型的整体准确率会如何变化,这样的参数我们就可以直接跑网格搜索

param_grid = {'criterion':['gini', 'entropy']}
param_grid = {'min_samples_split':np.arange(2, 2+20, 1)}
param_grid = {'min_samples_leaf':np.arange(1, 1+10, 1)}
param_grid = {'max_features':np.arange(1,30,1)}
"""

# 按照参数对模型整体准确率的影响程度进行调参,首先调整max_depth

# 调整max_depth
param_grid = {
    
    'max_depth':np.arange(1, 20, 1)}

# 一般根据数据的大小来进行一个试探,乳腺癌数据很小,所以可以采用1~10,或者1~20这样的试探
# 但对于大型数据来说,我们应该尝试30~50层深度(或许还不足够更应该画出学习曲线,来观察深度对模型的影响)
rfc = RandomForestClassifier(n_estimators=59,random_state=20)

GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)
print(GS.best_params_)
print(GS.best_score_) # 模型评分和之前一模一样,说明已经在最佳参数了,无需再调了

#{'max_depth': 9}
#0.968421052631579
# 调整max_features
param_grid = {
    
    'max_features':np.arange(1,30,1)}
"""
max_features是唯一一个即能够将模型往左(低方差高偏差)推,也能够将模型往右(高方差低偏差)推的参数。

max_features的默认最小值是sqrt(n_features)。
"""
rfc = RandomForestClassifier(n_estimators=59
                                ,random_state=20
                                )

GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)

print(GS.best_params_)
print(GS.best_score_)  # 最佳模型评分和之前一模一样,说明模型达到了上限
# {'max_features': 5}
# 0.968421052631579

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/132326251