[SkLearn exercise] Random forest tuning application-breast cancer data set



Random Forest Tuning Application-Breast Cancer Data Set

Refer to the following tuning process: the basic idea of ​​machine learning tuning

Ⅰ. Obtain the data set

# 1.导包
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 2.导入数据集
breast = load_breast_cancer()
x = breast.data # 获取数据集
y = breast.target # 获取标签

Back to top


Ⅱ. Accuracy of modeling and evaluation

Through preliminary cross-validation evaluation, it can be found that the model is already quite good.

# 3.建模、简单评估
rfc = RandomForestClassifier(n_estimators=100,random_state=90)
score = cross_val_score(rfc,x,y,cv=10).mean() # 使用交叉验证默认精确度评估模型
score # 输出交叉验证的分数
0.9648809523809524

Back to top


Ⅲ. Tuning parameters — n_estimators

Here we conduct a preliminary tuning test, the most important thing is the targeted n_estimators参数evaluation.
We traverse the 20number, range (1,201), different n_estimatorswere 交叉验证十折取平均值evaluated, and ultimately 绘制评分曲线, more intuitive reflect the trends.

# 4.调参 --- n_estimators
scores = []
for i in range(0, 200, 10):
    rfc = RandomForestClassifier(n_estimators=i + 1,
                                 n_jobs=-1,
                                 random_state=90)
    score = cross_val_score(rfc, x, y, cv=10).mean()
    scores.append(score)

print("最高评分为:", max(scores), "此时的n_estimators是:",
      scores.index(max(scores)) * 10 + 1)
# 可视化
plt.figure(figsize=(10, 6))
plt.plot(range(1, 201, 10), scores)
# 设置图形坐标轴颜色为白色
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
plt.show()

Insert picture description here

Back to top


Ⅳ. Refined parameter adjustment ---- n_estimators

Through preliminary evaluation, it can be concluded n_estimators=71that there is a maximum value when it is nearby. Next, we will narrow the scope, refine the learning curve, and see if we can further select the optimal n_estimators.
Here we take the range (65,75),

# 5.细化调参 --- n_estimators
scores = []
for i in range(65,75):
    rfc = RandomForestClassifier(n_estimators=i,
                                 n_jobs=-1,
                                 random_state=90)
    score = cross_val_score(rfc, x, y, cv=10).mean()
    scores.append(score)

print("最高评分为:", max(scores), "此时的n_estimators是:",
      [*range(65,75)][scores.index(max(scores))])
# 可视化
plt.figure(figsize=(10, 6))
plt.plot(range(65,75), scores)
# 设置图形坐标轴颜色为白色
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
plt.show()

Insert picture description here

n_estimatorsThe effect of the adjustment is significant, and the accuracy of the model immediately rises 0.03. Next, enter the grid search, we will use the grid search to adjust the parameters one by one.

Why don't we adjust multiple parameters at the same time? There are two reasons:

  • 1) Adjusting multiple parameters at the same time will run very slowly.
  • 2) Adjusting multiple parameters at the same time will make us unable to understand how the combination of parameters is obtained, so even if the result of the grid search is not good, we don't know where to change it. Here, in order to use the complexity-generalization error method (variance-bias method), we adjust the parameters one by one.

Back to top


Ⅴ. Grid search tuning

There are some parameters that are not referenced, and it is difficult to tell a range. In this case, we use the learning curve to see the trend and select a smaller interval from the results of the curve, and then run the curve.

param_grid = {
    
    'n_estimators':np.arange(0, 200, 10)}
param_grid = {
    
    'max_depth':np.arange(1, 20, 1)}
param_grid = {
    
    'max_leaf_nodes':np.arange(25,50,1)}

For a large data set, you can try to build from 1000, enter 1000 first, and an interval for every 100 leaves, and then gradually narrow the range. There are some parameters that can be found in a range, or we know their values ​​and how the overall accuracy of the model will change with their values. For such parameters, we can directly run a grid search:

param_grid = {
    
    'criterion':['gini', 'entropy']}
param_grid = {
    
    'min_samples_split':np.arange(2, 2+20, 1)}
param_grid = {
    
    'min_samples_leaf':np.arange(1, 1+10, 1)}
param_grid = {
    
    'max_features':np.arange(5,30,1)}

The following tuning parameters are combined with the graph of generalization error to analyze:
Insert picture description here


• Adjust max_depth

It can be found that when the max_depth is adjusted here, there is 0.9648809523809524a significant improvement compared to the original , indicating that the adjustment of this parameter has a greater improvement in the overall model.

# 6.1 调整max_depth
# 一般根据数据的大小来进行一个试探,乳腺癌数据很小,所以可以采用1~10,或者1~20这样的试探
param_grid = {
    
    'max_depth':np.arange(1,20,1)}

rfc = RandomForestClassifier(n_estimators=73,random_state=90)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(x,y)

print("最佳参数:",GS.best_params_)
print("对应评分:",GS.best_score_)

最佳参数: {
    
    'max_depth': 8}
对应评分: 0.9666353383458647

Limiting max_depth is to make the model simple, push the model to the left. Generally speaking, the random forest should be on the right of the lowest point of generalization error, and the tree model should be inclined to overfitting rather than underfitting. According to the current point of view, it is more consistent. However, since we pursue the lowest generalization error, we will keep the best n_estimators before, unless there are other factors that can help us achieve higher accuracy.

When the model is on the right side of the image, what we need is 降低方差,增加偏差an option to reduce model complexity ( ), so max_depth should be as small as possible, and min_samples_leaf and min_samples_split should be as large as possible. This is almost an illustration, in addition to max_features, we can also try min_samples_leafand min_samples_split, becausemax_depth, min_samples_leaf and min_samples_split are pruning parameters, which are parameters to reduce complexity

Here, we can predict that we are very close to the upper limit of the model, and the model may be able to improve further. Then let’s adjust max_features to see how the model changes

Back to top


• Adjust max_features

After adjusting max_features, it can be found that the score has further increased significantly. And it can be seen that is 最佳的特征数是24less than the maximum number of features 30 (the number of features in the data set itself), which means that the number of features in the original data set is too much, and there may be overfitting. In the generalization error of the above figure, the model The complexity of is shifted slightly from the original higher (on the right) to the left, and gradually moved to the best model complexity in the middle.

# 6.2 调整max_features
"""
max_features是唯一一个即能够将模型往左(低方差高偏差)推,也能够将模型往右(高方差低偏差)推的参数。我
们需要根据调参前,模型所在的位置(在泛化误差最低点的左边还是右边)来决定我们要将max_features往哪边调。
现在模型位于图像左侧,我们需要的是更高的复杂度,因此我们应该把max_features往更大的方向调整,可用的特征
越多,模型才会越复杂。max_features的默认最小值是sqrt(n_features),因此我们使用这个值作为调参范围的
最小值。
"""

param_grid = {
    
    'max_features':np.arange(5,30,1)}

rfc = RandomForestClassifier(n_estimators=73,random_state=90)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(x,y)

print("最佳参数:",GS.best_params_)
print("对应评分:",GS.best_score_)

最佳参数: {
    
    'max_features': 24}
对应评分: 0.9666666666666668

Back to top


• Adjust min_sample_leafes, min_sample_split, criterion

It can be found that the adjustment of any of the above parameters will have a higher improvement than the original model. But compared to the best max_features tuning parameter, it is slightly lower, which shows that the upper limit of the model is basically determined.

Insert picture description here
Insert picture description here
Insert picture description here
So overall our best model is as follows:

# 最佳模型
rfc = RandomForestClassifier(n_estimators=73,max_features=24,random_state=90)
score = cross_val_score(rfc,x,y,cv=10).mean()
score

0.9666666666666668

Back to top


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/113808101