Decision tree, random forest, boost tree for model training


This section includes:

  • The sklearn use of the tree model: Decision Tree\Random Forest\GBDT
  • Classification evaluation index description: Accuracy\TPR and FPR\ROC curve\PR curve\AP\F1 score
  • Cross validation: k-fold cross validation\leave one cross validation
  • Hyperparameter search: grid search\random search\hyperopt automated search

table of Contents

1. Import the toolkit:

import numpy as np
import pandas as pd
from matplotlib.pyploy as plt 
import time
import warnings
warnings.filterwarnings('ignore')

2. Experimental data generation

To generate experimental data, make_blobs will generate the desired data according to the total number of samples specified by the user, the number of center points, and the degree of dispersion of the distribution.
Parameters of function make_blobs:

  • n_samples: the total number of samples;
  • centers: the number of cluster centers;
  • random_state: Determine the random seed to ensure that the data is the same every time;
  • cluster_std: The degree of dispersion of each cluster, the larger the dispersion, the smaller the concentration.
from sklearn.datasets.samples_generator import make_blobs
X,y=make_blobs(n_samples=500,centers=2,random_state=21,cluster_std=3.5)
# 画出数据散点图
plt.scatter(X[:,0],X[:,1],c=y,s=50)
plt.show()

Insert picture description here

3. Data set division

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=32)
print(X_train.shape,X_test.shape)
(350, 2) (150, 2)# print 输出结果

4. Use of tree model

The machine learning models used include:

  • Decision Tree
  • Random Forest
  • Gradient Boosting Tree (GBDT), the most commonly used framework lightGBM

4.1 Decision Tree

from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(
	critterion='entropy',
	splitter='best',
	max_depth=4,
	max_leaf_nodes=12,
	min_samples_leaf=30,
	presort=True,# 此参数项在实际代码运行时可以不写,默认为False
)
dt.fit(X_train,y_train)
y_pred_dt=dt.predict(X_test)
# 画出样本的散点图
plt.scatter(X_test[:,0],X_test[:,1],c=y_pred_dt,s=20)
plt.title('Decision tree classification result')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Insert picture description here

4.2 Random Forest

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(
	n_estimators=10,
	criterion='entropy',
	max_depth=4,
	max_leaf_nodes=12,
	min_samples_leaf=30,
	bootstrap=True,
	n_jobs=1,
	max_samples=0.8
)
rf.fit(X_train,y_train)
y_pred_rf=rf.predict(X_test)
# 画出样本的散点图
plt.scatter(X_test[:,0],X_test[:,1],c=y_pred_rf,s=20)
plt.title('Random forest classification result')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Insert picture description here

4.3 LightGBM

  • LightGBM is a fast, distributed, high-performance decision tree-based gradient boosting framework developed by Microsoft, which is the implementation of GBDT algorithm engineering. It has the following advantages:
    • Fast training efficiency
    • Low memory usage
    • High accuracy
    • Support parallel learning
    • Can handle large-scale data
  • Run the following command to install lightgbm:
    • pip3 install lightgbm
  • The following are the general parameters that need to be set:
    • boosting_type: boosting algorithm type, the default gbdt is the traditional GBDT algorithm. If the amount of data is large, it can be set to'gross', which will speed up the training of the model at the expense of certain accuracy.
    • objective: Specify the learning task, the options are:
      • For LGBMRegressor (regression task), select'regression';
      • For LGBMClassifier (classification task), if it is a two-class classification, select'binary', if it is a multi-classification, select'multiclass';
      • For LGBMRanker (sorting task), select'lambdarank'.
import lightgbm as lgb
params={
    
    
	'boosting_type':'gbdt',
	'objective':'binary',
	'n_estimators':200,
	'learning_rate':0.1,
	'max_depth':5,
	'num_leaves':25,
	'min_child_samples':14,
	'subsample':0.8,
	'colsample_bytree':0.7,
	'subsample_freq':10,
	'reg_alpha':1.0,
	'reg_lambda':0.1,
}
model=lgb.LGBMClassifier(**params,random_state=50)
#训练模型
model.fit(X_train,y_train,eval_metric='auc',eval_set=[(X_test,y_test)],eval_names=['test'])
y_pred_lgb=model.predict(X_test)
y_pred_proba_lgb=model.predict_proba(X_test)
# 画出样本的散点图
plt.scatter(X_test[],X_test[],c=y_pred_lgb,s=20)
plt.title('LightGBM classification result')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Insert picture description here

5. Classification model evaluation index

The evaluation indicators used include: Confusion matrix accuracy rate-
\True positive rate and false positive rate\ROC\AUC\PR curve\AP\F1 score

5.1 Confusion matrix

from sklearn.metrics import confusion_matrix
c_matric_dt=confusion_matrix(y_test,y_pred_dt)
c_matric_rf=confusion_matrix(y_test,y_pred_rf)
c_matric_lgb=confusion_matrix(y_test,y_pred_lgb)
print('Decison tree confusion_matrix:\n{}\n'.format(c_matric_dt))
print('Random Forest confusion_matrix:\n{}\n'.format(c_matric_rf))
print('LightGBM confusion_matrix:\n{}\n'.format(c_matric_lgb))
# print的输出结果
Decison tree confusion_matrix:
[[77  2]
 [ 3 68]]

Random Forest confusion_matrix:
[[79  0]
 [ 6 65]]

LightGBM confusion_matrix:
[[77  2]
 [ 3 68]]

5.2 Accuracy

from sklearn.metrics import accuracy_score
accuracy_dt=accuracy_score(y_test,y_pred_dt)
accuracy_rf=accuracy_score(y_test,y_pred_rf)
accuracy_lgb=accuracy_score(y_test,y_pred_lgb)
print('Decison tree accuracy:\n{}\n'.format(accuracy_dt))
print('Random Forest accuracy:\n{}\n'.format(accuracy_rf))
print('LightGBM accuracy:\n{}\n'.format(accuracy_lgb))
# print输出结果
Decison tree accuracy:
0.9666666666666667

Random Forest accuracy:
0.96

LightGBM accuracy:
0.9666666666666667

5.3 True positive rate and false positive rate

import pandas as pd
from sklearn.metrics import roc_curve
# 计算fpr和tpr
fpr,tpr,thresholds=roc_cure(y_test,y_pred_proba_lgb[:,1])
# 把fpr,tpr,thresholds用DataFrame表格保存,方便显示
result=pd.DataFrame([thresholds,tpr,fpr],index=['thresholds','tpr','fpr'])
print(result)
# print输出结果:
                  0         1         2         3         4         5   \
thresholds  1.992467  0.992467  0.943519  0.937773  0.813253  0.407940   
tpr         0.000000  0.887324  0.915493  0.929577  0.957746  0.957746   
fpr         0.000000  0.000000  0.012658  0.025316  0.025316  0.037975   

                  6         7         8         9        10  
thresholds  0.343985  0.185725  0.132040  0.044848  0.00497  
tpr         0.971831  0.985915  0.985915  0.985915  1.00000  
fpr         0.037975  0.075949  0.101266  0.227848  1.00000 

5.4 ROC

import matplotlib.pyplot as plt
plt.figure()
# 画出散点图,标出点的位置
plt.scatter(fpr,tpr)
# 画出ROC曲线图
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.xlable('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LightGBM ROC')
plt.legend(loc='lower right')
plt.show()

Insert picture description here

5.5 AUC

from sklearn.metrics import roc_auc_score
auc_dt=roc_auc_score(y_test,y_pred_dt)
auc_rf=roc_auc_score(y_test,y_pred_rf)
auc_lgb=roc_auc_score(y_test,y_pred_lgb)
print('Decision tree AUC:{:.3f}\n'.format(auc_dt))
print('Random Forest AUC:{:.3f}\n'.format(auc_rf))
print('LightGBM AUC:{:.3f}\n'.format(auc_lgb))
# print 输出结果
Decision tree AUC:0.966
Random Forest AUC:0.958
LightGBM AUC:0.966

5.6 Precision rate and recall rate

from sklearn.metrcis import precision_recall_curve
# 计算precision和recall
precision,recall,thresholds=precision_recall_curve(y_test,y_pred_proba_lgb[:,1])
# 把precision,recall,thresholds用DataFrame表格保存,方便显示
result=pd.DataFrame([thresholds,precision,recall],index=['thresholds','precision','recall'])
print(result)
# print输出显示
                  0         1         2         3         4         5   \
thresholds  0.004970  0.044848  0.132040  0.185725  0.343985  0.407940   
precision   0.473333  0.795455  0.897436  0.921053  0.958333  0.957746   
recall      1.000000  0.985915  0.985915  0.985915  0.971831  0.957746   

                  6         7         8         9         10   11  
thresholds  0.813253  0.897236  0.937773  0.943519  0.992467  NaN  
precision   0.971429  0.971014  0.970588  0.984848  1.000000  1.0  
recall      0.957746  0.943662  0.929577  0.915493  0.887324  0.0

5.7 PR curve

#规定画布大小
plt.figure(figsize=(12,8))
# 画填充图
plt.fill_between(recall,precision,alpha=0.2,color='b',step='post')
# 画散点图,凸显坐标点位置
plt.scatter(recall,precision,alpha=0.8,color='r')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.05])
plt.show()

Insert picture description here

5.8 AP

from sklearn.metrics import average_precision_score
ap_dt=average_precision_score(y_test,y_pred_dt)
ap_rf=average_precision_score(y_test,y_pred_rf)
ap_glb=average_precision_score(y_test,y_pred_glb)
print('Decision tree AP:{:3f}\n'.formata(ap_dt))
print('Random Forest AP:{:3f}\n'.formata(ap_rf))
print('LightGBM AP:{:3f}\n'.formata(ap_glb))
# print 输出结果
Decision tree AP:0.950382
Random Forest AP:0.955493
LightGBM AP:0.950382

5.9 F1 score

from sklearn.metrics import f1_score
f1_score_dt=f1_score(y_test,y_pred_dt)
f1_score_rf=f1_score(y_test,y_pred_rf)
f1_score_lgb=f1_score(y_test,y_pred_lgb)
print('Decision tree f1_score:{:3f}\n'.format(f1_score_dt))
print('Random Forest f1_score:{:3f}\n'.format(f1_score_rf))
print('LightGBM f1_score:{:3f}\n'.format(f1_score_lgb))
# print 输出结果
Decision tree f1_score:0.964539
Random Forest f1_score:0.955882
LightGBM f1_score:0.964539

6. Cross validation

  • Question thinking: Separating a part from the overall data as a verification set will reduce the data that the model can train, especially when the amount of data is small. Is there any better way to effectively use the data set for training?
  • Answer: cross-validation
  • Cross-validation definition:
    • The idea of ​​cross-validation is actually to reuse data, divide the obtained sample data, and combine them into different training sets and test sets. The training set is used to train the model, and the test set is used to evaluate the prediction of the model.
  • When to use cross-validation:
    • Cross-validation is usually used when the data is not sufficient. If the data sample size is less than 10,000, cross-validation is needed to evaluate the model. If the sample is larger than 10,000, only a part of the data is used as a verification set for model evaluation. can.
    • According to different data segmentation methods, commonly used cross-validation can be divided into 2 situations:
      • K-fold cross validation
      • Leave one to cross-validate

6.1 K-fold cross validation

  • The verification process is also a huge amount of engineering, this article will continue to be updated

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/113815145