sklearn's system learning - random forest tuning (including cases and complete python code)

Table of contents

1. The core problem of parameter adjustment

2. Random forest tuning direction

 3. Random Forest Parameter Tuning Method

 1. Draw a learning curve

 2. Grid search

Four, detailed code


       For parameter tuning, you first need to understand what the core problem of parameter tuning is, then clarify your thinking, and then proceed with parameter tuning. Adjusting parameters is not an easy task. Many experts rely on years of accumulated experience and clear processing ideas. For us, we should also have an understanding of the thinking and direction of parameter adjustment, and then keep trying.

1. The core problem of parameter adjustment

1. What is the purpose of parameter tuning?

2. What factors affect the accuracy of the model on unknown data?

Generalization error : Measures the accuracy of the model on unknown data (the higher the accuracy, the smaller the generalization error), which is affected by the complexity of the model.

The relationship between model complexity and accuracy is like the relationship between stress and test scores . The greater the pressure or the lower the test scores, the higher the scores will be only when the pressure is appropriate. In the same way, the more complex or simpler the model, the result will often be unsatisfactory. Then our goal is clear, that is, the model will not be too complicated or too simple. For example, when the complexity of the model is increased, the accuracy rate increases and the generalization error decreases, which means that the model is a bit simple at this time. On the contrary, if the complexity of the model is reduced, the accuracy rate increases, which means that the model is more complex at this time, and should be adjusted appropriately. It's as simple as it gets.

For the tree model or the integrated model of the tree, the deeper the tree is, the more branches and leaves are, and the more complex the model is. Often the tree model or the integrated model of the tree is generally more complicated. What we need to do is to reduce the complexity and improve the accuracy.

2. Random forest tuning direction

       To reduce complexity, select parameters that have a great impact on complexity, study their monotonicity , and then focus on adjusting those parameters that can reduce complexity to the greatest extent. For those parameters that are not monotonous or increase complexity, Depending on the situation, most of the time it can even be avoided. (From top to bottom in the table, the degree of recommended parameter adjustment gradually decreases)

 3. Random Forest Parameter Tuning Method

 1. Draw a learning curve

       Some parameters have no reference, so it is difficult to tell the range clearly. In this case, use the learning curve to see the trend, select a smaller interval from the results of the curve, and then run the curve, and so on (it is recommended to print out the maximum value and its value value).

#调参第一步:n_estimators
cross = []
for i  in range(0,200,10):
    rf = RandomForestClassifier(n_estimators=i+1, n_jobs=-1,random_state=42)
    cross_score = cross_val_score(rf, xtest, ytest, cv=5).mean()
    cross.append(cross_score)
plt.plot(range(1,201,10),cross)
plt.xlabel('n_estimators')
plt.ylabel('acc')
plt.show()
print((cross.index(max(cross))*10)+1,max(cross))

 

2. Grid search

       Some parameters have a certain range, or we know their values ​​and how the accuracy of the model will change with their values. It is worth noting here that grid search, if multiple parameters and corresponding values ​​are written in the parameter list at one time, it will not discard any of the parameter values ​​we set, and will try to combine them, and sometimes the effect may not be Too good, and time consuming. The recommended operation is that one or two parameters and their values ​​can be set at a time.

from sklearn.model_selection import GridSearchCV
#调整max_depth
param_grid = {'max_depth' : np.arange(1,20,1)}
#一般根据数据大小进行尝试,像该数据集 可从1-10 或1-20开始
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_  #最佳参数组合
GS.best_score_   #最佳得分

Four, detailed code

It is recommended to run the code in sections         in jupyter notebook , because at least it can guarantee that the divided test set and training set will not change, so that parameter adjustment is meaningful.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()  #乳腺癌案例
print(data.data.shape)

xtrain,xtest,ytrain,ytest = train_test_split(data.data,data.target,test_size=0.3)
# GridSearchCV
rf = RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(xtrain,ytrain)
score = rf.score(xtest,ytest)
cross_s = cross_val_score(rf,xtest,ytest,cv=5).mean()
print('rf:',score)
print('cv:',cross_s)

#调参第一步:n_estimators
cross = []
for i  in range(0,200,10):
    rf = RandomForestClassifier(n_estimators=i+1, n_jobs=-1,random_state=42)
    cross_score = cross_val_score(rf, xtest, ytest, cv=5).mean()
    cross.append(cross_score)
plt.plot(range(1,201,10),cross)
plt.xlabel('n_estimators')
plt.ylabel('acc')
plt.show()
print((cross.index(max(cross))*10)+1,max(cross))
# n_estimators缩小范围
cross = []
for i  in range(0,25):
    rf = RandomForestClassifier(n_estimators=i+1, n_jobs=-1,random_state=42)
    cross_score = cross_val_score(rf, xtest, ytest, cv=5).mean()
    cross.append(cross_score)
plt.plot(range(1,26),cross)
plt.xlabel('n_estimators')
plt.ylabel('acc')
plt.show()
print(cross.index(max(cross))+1,max(cross))

#调整max_depth
param_grid = {'max_depth' : np.arange(1,20,1)}
#一般根据数据大小进行尝试,像该数据集 可从1-10 或1-20开始
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

#调整max_features
param_grid = {'max_features' : np.arange(5,30,1)}
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

#调整min_samples_leaf
param_grid = {'min_samples_leaf' : np.arange(1,1+10,1)}
#一般是从其最小值开始向上增加10或者20
# 面对高维度高样本数据,如果不放心,也可以直接+50,对于大型数据可能需要增加200-300
# 如果调整的时候发现准确率怎么都上不来,那可以放心大胆调一个很大的数据,大力限制模型的复杂度
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

#调整min_samples_split
param_grid = {'min_samples_split' : np.arange(2,2+20,1)}
#一般是从其最小值开始向上增加10或者20
# 面对高维度高样本数据,如果不放心,也可以直接+50,对于大型数据可能需要增加200-300
# 如果调整的时候发现准确率怎么都上不来,那可以放心大胆调一个很大的数据,大力限制模型的复杂度
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

#调整criterion
param_grid = {'criterion' :['gini','entropy']}
#一般是从其最小值开始向上增加10或者20
# 面对高维度高样本数据,如果不放心,也可以直接+50,对于大型数据可能需要增加200-300
# 如果调整的时候发现准确率怎么都上不来,那可以放心大胆调一个很大的数据,大力限制模型的复杂度
rf = RandomForestClassifier(n_estimators=11,random_state=42)
GS = GridSearchCV(rf,param_grid,cv=5)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

I hope you gain something, welcome to leave a message~

Guess you like

Origin blog.csdn.net/weixin_44904136/article/details/126221854