[Machine Learning] Random Forest and Parameter Adjustment Study Notes

Table of contents

Ensemble learning

Boosting:

 represents algorithm

Bagging (bagging method):

 random forest

 Random forest parameters

Random forest parameter adjustment practice:

Summary of advantages and disadvantages of random forest


 

Ensemble learning

Ensemble learning completes the task by building multiple learners, and uses multiple weak learners to form a strong learner.

Boosting:

There is a strong dependence between individual learners. Through the serialization method generated byserial, the weak classifier ( base evaluator) for linear combination.

Increase the weight of samples that were misclassified by the weak classifier in the previous round, and reduce the weight of the samples that were misclassified by the weak classifier in the previous round.

 represents algorithm

  • Adaboost
  • GBDT
  • XGBoost
  • LightGBM

Bagging (bagging method):

There is no strong dependency between individual learners, so parallelization is adopted

 In a random forest, a certain number of features are randomly selected.

 random forest

All base evaluators of random forest are decision trees. The forest composed of classification trees is called random forest classifier, and the forest integrated with regression trees is called Random Forest Regression a> device.

 Random forest parameters

 Parameters of trees in random forests

data set:

Link: https://pan.baidu.com/s/1wUK0-u6pTsMqKy5dqcfQew?pwd=ectd 
Extraction code: ectd

Random forest parameter adjustment practice:

import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

df = pd.read_csv("creditcard.csv")
data=df.iloc[:,1:31]
data.head()

X = data.loc[:,data.columns != 'Class']
y = data.loc[:,data.columns == 'Class']

num_record_fraud = len(data[data.Class==1])#欺诈的样本数目
fraud_indices = np.array(data[data.Class==1].index)#样本等于1的索引值
normal_indices= np.array(data[data.Class==0].index)#样本等于0的索引值
##随机抽样与正样本同数量的负样本
random_normal_indices = np.random.choice(normal_indices,num_record_fraud,replace = True)
random_normal_indices = np.array(random_normal_indices)
#合并正负样本索引
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
#按索引抽取数据
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.loc[:,under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:,under_sample_data.columns == 'Class']
X_train,X_test,y_train,y_test = train_test_split(X_undersample,y_undersample,test_size = 0.3)

rf0 = RandomForestClassifier(oob_score = True,random_state = 666)
rf0.fit(X_train,y_train)
print(rf0.oob_score_)#袋外样本
'''predict返回的是一个预测的值,predict_proba返回的是对于预测为各个类别的概率。
predict_proba返回的是一个 n 行 k 列的数组, 第 i 行 j列的数值是模型预测 第 i 个预测样本为某个标签的概率
并且每一行的概率和为1。'''
y_pred = rf0.predict_proba(X_test)[:,1]#返回模型预测样本标签为1的概率
print('AUC Score(Train): %f' % roc_auc_score(y_test,y_pred))
#0.936046511627907
# AUC Score(Train): 0.976516


#网格搜索
param1 = {'n_estimators':range(10,101,10)}
search1 = GridSearchCV(estimator = RandomForestClassifier(oob_score = True,random_state = 666,n_jobs = 2),
                      param_grid = param1,scoring = 'roc_auc',cv = 5)
search1.fit(X_train,y_train)
search1.cv_results_,search1.best_params_,search1.best_score_
# {'n_estimators': 70},
#  0.9707089376393281)

#网格搜索
param2 = {'max_depth':range(2,12,2)}
search2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators = 70,oob_score = True,random_state = 666,n_jobs = 2),
                      param_grid = param2,scoring = 'roc_auc',cv = 5)
search2.fit(X_train,y_train)
search2.cv_results_,search2.best_params_,search2.best_score_
# {'max_depth': 10},
#  0.9710400329003688)

param3 = {'min_samples_split':range(2,8,1)}
search3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators = 70,
                                                          max_depth = 10, oob_score = True,
                                                          random_state = 666,n_jobs = 2),
                      param_grid = param3,scoring = 'roc_auc',cv = 5)
search3.fit(X_train,y_train)
search3.cv_results_,search3.best_params_,search3.best_score_


# {'min_samples_split': 4},
#  0.972142760065589)
rf1 =  RandomForestClassifier(n_estimators = 70,max_depth = 10, oob_score = True,
                              min_samples_split = 4,
                              random_state = 666,n_jobs = 2)
rf1.fit(X_train,y_train)
print(rf1.oob_score_)
y_pred = rf1.predict_proba(X_test)[:,1]#返回模型预测样本标签为1的概率
print('AUC Score(Train): %f' % roc_auc_score(y_test,y_pred))
# 0.9433139534883721
# AUC Score(Train): 0.987851

Summary of advantages and disadvantages of random forest

RF Advantages
1. It is not prone to overfitting because not all samples are selected when training samples are selected.
2. It can process quantities whose attributes are discrete values, such as the ID3 algorithm to construct a tree, or it can process quantities whose attributes are continuous values, such as the C4.5 algorithm to construct a tree.
3. The processing capabilities for high-dimensional data sets are exciting. It can process thousands of input variables and determine the most important variables, so it is considered a good dimensionality reduction method. In addition, the model is able to output the importance of variables, which is a very convenient feature.
4. When the classification is unbalanced, random forest can provide an effective method to balance the error of the data set.
RF disadvantages
1 .Random forest does not perform as well in solving regression problems as it does in classification because it does not give a continuous output. When performing regression, Random Forest is unable to make predictions beyond the range of the training set data, which may lead to overfitting when modeling some data that also has certain noise.
2. For many statistical modellers, random forest feels like a black box - you have almost no control over the internal operation of the model and can only adjust the parameters and randomness between different parameters. Try between seeds.

sklearn-random forest

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/126592390