Implementation of decision tree and random forest in Sklearn

Tree-based learning algorithms are very popular and widely used non-parametric supervised learning algorithms. These algorithms can be used for both classification and regression. This blog mainly summarizes the implementation of the tree model in Sklearn, and does not involve too much algorithm theory.

1. Train a decision tree classifier

Use DecisionTreeClassifier in scikit-learn to train a decision tree classifier.

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建决策树分类器对象
decisiontree = DecisionTreeClassifier(random_state=0)
#训练模型
model = decisiontree.fit(features,target)

The trainer of the decision tree will try to find the decision rule that can minimize the impurity of the data at the node. The Gini index is used by default.

#创建新样本
observation = [[5,4,3,2]]
#预测样本分类
model.predict(observation)
--->
array([1])

You can also use the predict_proba method to view the probability that the sample belongs to each category (predicted category):

#查看样本分别属于三个分类的概率
model.predict_proba(observation)
--->
array([[0., 1., 0.]])

If you want to use other impurity measures, you can modify the parameter criterion.

2. Train the decision tree regression model

Use DecisionTreeRegression to train a decision tree regression model:

from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets
#加载仅有两个特征的数据
boston = datasets.load_boston()
features = boston.data[:,0:2]
target = boston.target
#创建决策树回归模型对象
decisiontree = DecisionTreeRegressor(random_state=0)
#训练模型
model = decisiontree.fit(features,target)

The decision tree regression model defaults to the reduction of the mean square error (MSE) as the evaluation criterion of the split rule:
MSE = 1 n ∑ i = 1 n (yi − yi ^) 2 MSE = {1\over n}\sum_ {i=1}^n(y_i-\hat{y_i})^2M S E=n1i=1n( andiandi^)2

#创建新样本
observation = [[0.02,16]]
#预测样本值
model.predict(observation)
--->
array([33.])

The parameter criterion can be used to select the split quality, such as the mean absolute error (MAE).

3. Visual decision tree

Export the decision tree model to DOT format and visualize:

import pydotplus
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from IPython.display import Image
from sklearn import tree
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建决策树分类器对象
decisiontree = DecisionTreeClassifier(random_state=0)
#训练模型
model = decisiontree.fit(features,target)
#创建DOT数据
dot_data = tree.export_graphviz(decisiontree,
                                out_file=None,
                                feature_names=iris.feature_names,
                                class_names=iris.target_names)
#绘制图形
graph=pydotplus.graph_from_dot_data(dot_data)
#显示图形
Image(graph.create_png())

Insert picture description here
If you want to use the decision tree in other applications, you can export the visualized decision tree to PDF or PNG format:

#创建PDF
graph.write_pdf("iris.pdf")
#创建PNG
graph.write_png("iris.png")

4. Train a random forest classifier

Use RandomForestClassifier to train a random forest classifier model:

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建决策树分类器对象
RF = RandomForestClassifier(random_state=0)
#训练模型
model = RF.fit(features,target)

Similar to DecisionTreeClassifier:

#查看样本分别属于三个分类的概率
model.predict_proba(observation)
--->
array([[0., 1., 0.]])

However, as a forest rather than a separate decision tree, RandomForestClassifier has some unique and important parameters. First of all, the parameter max_features determines the maximum number of features that each node needs to consider. The allowed input variable types include integer (number of features), floating point (percentage of features)\ and sqrt (square root of the number of features). By default, the value of the parameter max_features is set to auto.
Second, the parameter bootstrap is used to set the sample subset used when creating the tree, whether it is sampling with replacement or sampling without replacement. Third, the parameter n_estimators sets the number of decision trees contained in the forest.

5. Train the random forest regression model

Use RandomForestRegressor to train a random forest regression model:

from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
#加载仅有两个特征的数据
boston = datasets.load_boston()
features = boston.data[:,0:2]
target = boston.target
#创建决策树回归模型对象
RF = RandomForestRegressor(random_state=0)
#训练模型
model = RF.fit(features,target)

Just like creating a random forest classifier, you can also create a random forest regression model, in which each tree uses a bootstrap sample subset, and only a part of the features are considered in the decision rules at each node. Like RandomForestClassifier, the random forest regression model also has several important parameters:

  • max_features sets the maximum number of features to be considered for each node, the default value is p \sqrt pp Pieces, where p is the total number of features
  • Bootstrap sets whether to use sampling with replacement, the default value is true.
  • n_estimators sets the number of decision trees, the default value is 10.

6. Identify important features in random forest

Calculate and visualize the importance of each feature (Feature importance)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建随机森林分类器对象
randomforest = RandomForestClassifier(random_state=0,n_jobs=-1)
#训练模型
model = randomforest.fit(features,target)
#计算特征的重要性
importances = model.feature_importances_
#将特征的重要性按降序排列
indices = np.argsort(importances)[::-1]
#按照特征重要性对特征名称重新排序
names = [iris.feature_names[i] for i in indices]
#创建图
plt.figure()
#创建图表题
plt.title("Feature Importance")
#添加数据
plt.bar(range(features.shape[1]),importances[indices])
#将特征名称添加为x轴标签
plt.xticks(range(features.shape[1]),names,rotation = 90)
#显示图
plt.show()

Insert picture description here

In sklearn, the classification and regression models of decision trees and random forests can be used to view the importance of each feature in the model through feature_importances_:

#查看特征的重要程度
model.feature_importances_
--->
array([0.09090795, 0.02453104, 0.46044474, 0.42411627])

The larger the value, the more important the feature is (the importance of all features adds up to 1). Plotting these values ​​helps explain the random forest model.

7. Choose important features in random forest

Before performing model training, first identify important features, and then use them to retrain the model:

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建随机森林分类器
RF = RandomForestClassifier(random_state=0,n_jobs=-1)
#创建对象,选择重要性大于或等于阈值的特征
selector = SelectFromModel(RF,threshold=0.3)
#使用选择器创建新的特征矩阵
features_important = selector.fit_transform(features,target)
#使用重要的特征训练随机森林模型
model = RF.fit(features_important,target)

When using this method to select features, you need to pay attention: first, the importance of the nominal classification feature after one-hot encoding is diluted into the binary feature; second, the importance of a pair of highly correlated features is concentrated On one of the features, rather than evenly distributed on the two features.

8. Deal with uneven classification

If you want to train a random forest model on highly imbalanced data, you can use the parameter class_weight="balance" to train a decision tree or random forest model:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#删除前40个样本以获得高度不均衡的数据
features = features[40:,:]
target = target[40:]
#创建目标向量表明分类是0还是1
target = np.where((target==0),0,1)
#创建随机森林分类器对象
randomforest = RandomForestClassifier(random_state=0,n_jobs=-1,class_weight="balanced")
#训练模型
model= randomforest.fit(features,target)

In practical applications, it is easy to encounter uneven classification problems. If this problem is not solved, unbalanced classification will reduce the performance of the model. If you use class_weight="balanced", you can increase the weight of the smaller category (decrease the weight of the larger category).

9. Control the size of the decision tree

Manually control the structure and scale of the decision tree:

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建决策树分类器对象
DT = DecisionTreeClassifier(random_state=0,
                            max_depth=None,
                            min_samples_split=2,
                            min_samples_leaf=1,
                            min_weight_fraction_leaf=0,
                            max_leaf_nodes=None,
                            min_impurity_decrease=0)
#训练模型
model = DT.fit(features,target)
  • max_depth: The maximum depth of the tree. If it is None, the tree keeps growing until all leaf nodes are pure nodes. If an integer is provided as the value of this parameter, the tree will be effectively "pruned" to the depth indicated by the integer.
  • min_samples_split: The minimum number of samples on the node before the node is split. If an integer is provided, this represents the minimum number of samples; if a floating point number is provided, the minimum sample is the total number of samples multiplied by the floating point number.
  • min_samples_leaf: The minimum number of samples required by leaf nodes, using the same parameter format as min_samples_split.
  • max_leaf_nodes: Maximum number of leaf nodes.
  • min_impurity_split: The minimum impurity reduction required to perform the split.

10. Improve performance through boosting

This part of the content belongs to integrated learning and will involve xgboost, lightgbm and other content, so it is not posted here. Separately organize integrated learning knowledge points later.

11. Use Out-of-Bag Error to evaluate the random forest model

Evaluate the random forest model without using cross-validation:

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
#加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
#创建随机森林分类器对象
randomforest = RandomForestClassifier(random_state=0,n_jobs=-1,n_estimators=1000,oob_score=True)
#训练模型
model= randomforest.fit(features,target)
#查看袋外误差
randomforest.oob_score_
output--->
0.9533333333333334

In random forest, each decision tree is trained using a subset of bootstrap samples. This means that for each tree, there is a subset of samples that have not participated in training. These samples are called Out-of-bag (OOB) samples. Out-of-bag samples can be used as a test set to evaluate the performance of the random forest model.

For each sample, the algorithm compares its true value with the predicted value produced by a subset of the tree model that has not been trained with that sample. Calculate the total score of all samples, you can get a random forest performance index. The OOB score evaluation method can be used as an alternative to cross-validation.

  • Reference: Python Machine Learning Manual

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/109161102