[Machine Learning] Decision Tree (actual combat)

Decision tree (actual combat)

The practical part will be combined with the theoretical part to help understand and strengthen the practical operation (the following code will be based on jupyter notebook).

1. Preparatory work (setting the font size style in jupyter notebook, etc.)

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')

2. Visual display of tree model

Decision trees are not only easy to understand in theory (the "most friendly" algorithm for machine learning), but they can also be implemented to visualize the construction process (algorithms such as neural networks are themselves black-box models, and it is more difficult to visualize the construction of the model). Therefore, another great advantage of decision trees is the ability to use related packages to view the constructed tree model. Here is a package that can visualize decision trees:

Download link: Graphviz .

Note: In order to call commands in cmd, environment variables need to be configured during installation (download the exe file, and check Add PATH during the installation process).

1. Build a decision tree model through the iris data set

# 导入鸢尾花数据集 和 决策树的相关包
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# 加载鸢尾花数据集
iris = load_iris()

# 选用鸢尾花数据集的特征
# 尾花数据集的 4 个特征分别为：sepal length:、sepal width、petal length:、petal width
# 下面选用 petal length、petal width 作为实验用的特征
X= iris.data[:,2:]

# 取出标签
y = iris.target

# 设置决策树的最大深度为 2（也可以限制其他属性）
tree_clf = DecisionTreeClassifier(max_depth = 2)

# 训练分类器
tree_clf.fit(X, y)

insert image description here

2. Specific steps for visualizing the decision tree

# 导入对决策树进行可视化展示的相关包
from sklearn.tree import export_graphviz

export_graphviz(
    # 传入构建好的决策树模型
    tree_clf,
    
    # 设置输出文件（需要设置为 .dot 文件，之后再转换为 jpg 或 png 文件）
    out_file="iris_tree.dot",
    
    # 设置特征的名称
    feature_names=iris.feature_names[2:],
    
    # 设置不同分类的名称（标签）
    class_names=iris.target_names,
    
    rounded = True,
    filled = True
)

# 该代码执行完毕后，此 .ipython 文件存放路径下将生成一个 .dot 文件（名字由 out_file 设置，此处我设置的文件名为 iris_tree.dot）

Next, you can open cmd and use the dot command in the graphviz package to convert the .dot file into various formats, such as .png or .jpg. The command is:
$ dot -T [image format, such as 'jpg'] [target dot file, such as 'iris_tree.dot'] -o [converted file name, such as 'iris_tree.jpg'] For example: dot -Tjpg
iris_tree .dot -o iris_tree.jpg

dot command format

After entering the command and pressing Enter, if there is no response (as shown in the figure above), it means the execution is successful! Open the directory where the iris_tree.dot file is stored, and you will see a picture file: iris_tree.jpg. Click to view, as follows:

iris_tree

image interpretation

In the figure above, the data in the first row of the root node is the partition feature. It can be seen that the root node uses the feature of "petal length" as the partition feature, and the judgment condition is "whether the petal width is not greater than 2.45";
The data in the second row of the root node is the Gini coefficient, which shows that the Gini coefficient of the root node is 0.667;
The data in the third line of the root node is the classification of the current data set. It can be seen that in the initial situation, there are 3 types of flowers (50 for each type);
The data in the fourth line of the root node indicates which category the data set is classified under the current node.

3. Probability estimation

Estimated class probability: Suppose the input data is: petal length 5cm, width 1.5cm. As can be seen from the figure above, it should be classified as a versicolor flower. To see its classification probabilities, the output of the decision tree is as follows:

iris-Setosa for 0% (0/54);
90.7% (49/54) of iris-Versicolor;
iris-Virginica was 9.3% (5/54).

Let's call the function to view each probability value:

# 调用函数查看概率估计
tree_clf.predict_proba([[5,1.5]])

Out
	array([[0.        , 0.90740741, 0.09259259]])

It can be seen that the estimated value calculated by the function is basically consistent with the result calculated by us.

# 调用函数查看预测结果
tree_clf.predict([[5,1.5]])

Out
	array([1])

Since the three types of flowers and their indexes are:

0 —— iris-Setosa;
1 —— iris-Versicolor;
2 —— iris-Virginica。

So the output is: array([1]).

3. Decision Boundary Display

Define a function to draw the decision boundary:

from matplotlib.colors import ListedColormap

# 定义绘制决策边界的函数
def plot_decision_boundary(clf,X, y, axes=[0,7,0,3], iris=True,legend=False,plot_training=True):
    
    # 构建坐标棋盘
    # 等距选 100 个居于 axes[0],axes[1] 之间的点
    x1s = np.linspace(axes[0],axes[1],100)
    # x1s.shape = (100,)
    
    # 等距选 100 个居于 axes[2],axes[3] 之间的点
    x2s = np.linspace (axes[2],axes[3],100)
    # x2s.shape = (100,)
    
    # 构建棋盘数据
    x1,x2 = np.meshgrid(x1s,x2s)
    # x1.shape = x2.shape = (100,100)
    
    # 将构建好的两个棋盘数据分别作为一个坐标轴上的数据（从而构成新的测试数据）
    # x1.ravel() 将拉平数据（得到的是个列向量（矩阵）），此时 x1.shape = (10000,)
    # 将 x1 和 x2 拉平后分别作为两条坐标轴
    # 这里用到 numpy.c_() 函数，以将两个矩阵合并为一个矩阵
    X_new = np.c_[x1.ravel(),x2.ravel()]
    # 此时 X_new.shape = (10000,2)
    
    # 对构建好的新测试数据进行预测
    y_pred = clf.predict(X_new).reshape(x1.shape)
    
    # 选用背景颜色
    custom_cmap = ListedColormap(['#fafab0', '#9898ff', '#a0faa0'])
    
    # 执行填充
    plt.contourf(x1,x2,y_pred,alpha=0.3,cmap=custom_cmap)
    
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1,x2, y_pred,cmap=custom_cmap2,alpha=0.8)
    if plot_training:
        plt.plot(X[:,0][y==0],X[:,1][y==0],"yo", label="Iris-Setosa")
        plt.plot(X[:,0][y==1],X[:,1][y==1],"bs", label="Iris-Versicolor")
        plt.plot(X[:,0][y==2],X[:,1][y==2],"g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel (r"$x_2$", fontsize=18,rotation = 0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)

Draw the decision boundary:

# 绘制决策边界
plt.figure(figsize=(8,4))
plot_decision_boundary(tree_clf, X, y)

# 为了便于理解，这里还绘制了决策边界的切割线（基于前面得到的图片）
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75,1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)

# 绘制决策边界划分出的类别所处深度
plt.text(1, 1.0, "Depth=1", fontsize=15)
plt.text(3.2, 2, "Depth=2", fontsize=13)
plt.text(3.2,0.5, "Depth=2", fontsize=13)
plt.title('Decision Tree decision boundaries')
          
plt.show()

insert image description here

4. Regularization of decision tree (pre-pruning)

The DecisionTreeClassfilter class has the following parameters that limit the shape of the decision tree (can be used for pre-pruning operations):

max_depth (the maximum depth of the decision tree)
min_samples_split (minimum number of samples a node must have before splitting)
min_samples_leaf (the minimum number of samples a node must have for its leaf nodes after splitting)
max_leaf_nodes (maximum number of leaf nodes)
max_features (maximum number of features to evaluate for splitting at each node, usually no limit on this parameter)

Next, use an experiment to test the regularization effect of "restricting the min_samples_leaf attribute" on the tree model (other parameters can be tested in the same way, so we won't do them one by one here):

# 这里选用一个难度稍微大一点的数据集
from sklearn.datasets import make_moons

# 构建数据集：X.shape = (100,2)  y.shape = (100,)
X,y = make_moons(n_samples = 100, noise = 0.25, random_state = 43)

# 构建决策树
tree_clf1 = DecisionTreeClassifier(random_state = 6)
tree_clf2 = DecisionTreeClassifier(min_samples_leaf = 5, random_state = 16)
tree_clf1.fit(X,y)
tree_clf2.fit(X,y)

# 画图展示：绘制决策边界
plt.figure(figsize=(12,4))
plt.subplot(121)
plot_decision_boundary(tree_clf1,X,y,axes = [-1.5, 2.5, -1, 1.5], iris = False)
plt.title('No Restrictions')
plt.subplot(122)
plot_decision_boundary(tree_clf2,X,y,axes = [-1.5, 2.5, -1, 1.5], iris = False)
plt.title('min_samples_leaf = 5')

Its output is as follows:

insert image description here

The following conclusions can be drawn from the above figure:

The left picture has no restrictions when building a decision tree, so it can classify all data points (no outliers will appear), so the decision tree model it gets will be quite complicated; however, it has a serious overshoot. Fitting phenomenon.
Due to the restrictions on the right picture, the decision tree model constructed by it will not be too complicated (Occam’s razor principle: the simpler the better), but its classification effect is naturally not as high as the left picture.

5. Experiment: explore the sensitivity of the tree model to data

The following is an experiment to explore whether the constructed model is still stable when the sample data changes.

# 构建随机测试数据
np.random.seed(6)
Xs = np.random.rand(100,2) - 0.5
ys =(Xs[:, 0]> 0).astype(np.float32)*2

# 定义数据的旋转角度
angle = np.pi/ 4

# 旋转原始测试数据矩阵
rotation_matrix = np.array([[np.cos(angle),-np.sin(angle)],[np.sin(angle),np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)

# 构建分类器 tree_clf_s 对原始数据进行训练
tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs,ys)

# 构建分类器 tree_clf_sr 对处理后的数据进行训练
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr,ys)

# 绘图展示这两个分类器的测试效果
plt.figure(figsize=(11,4))

plt.subplot(121)
plot_decision_boundary(tree_clf_s,Xs,ys,axes=[-0.7,0.7,-0.7,0.7], iris=False)
plt.title('Sensitivity to training set rotation')

plt.subplot(122)
plot_decision_boundary(tree_clf_sr,Xsr,ys,axes=[-0.7,0.7,-0.7,0.7],iris=False)
plt.title('Sensitivity to training set rotation')
plt.show()

Its output is as follows:

insert image description here

The left side of the figure above shows the decision boundary obtained after cutting a data set. Next, "rotate" the data in the data set by 45°. Logically speaking, its decision boundary will also "rotate" by 45°. However, according to the results on the right side of the above figure, it can be seen that the decision boundary does not follow the rotation, but A redraw is taken in the manner shown on the right.

It can be seen that the decision tree model is very sensitive to data.

6. Experiment: Solving regression problems with decision trees

When the decision tree is used for classification or regression tasks, the corresponding data sets are discrete and continuous respectively, so this leads to the need to use different mathematical methods in its evaluation: classification tasks use entropy or Gini coefficient (
from the perspective of calculation speed Look at the more commonly used Gini coefficient); while regression tasks use variance (variance is the simplest indicator to measure the similarity between a set of data).

# 导入相关库函数
from sklearn.tree import DecisionTreeRegressor

# 构造数据（一维数据）
m = 200
X = np.random.rand(m,1)
y = 4*(X-0.5)**2 + np.random.randn(m,1)/10

# 训练分类器
tree_reg = DecisionTreeRegressor(max_depth = 2)
tree_reg.fit(X,y)

insert image description here

# 树模型的展示
export_graphviz(
    tree_reg,
    out_file=("regression_tree.dot"),
    feature_names=["x"],
    rounded=True,
    filled=True
)

Next, use the dot command in the graphviz package to get the following decision tree picture:

regression_tree

Note: In Sklearn, the decision tree generated by default is a binary tree (CART).

7. Experiment: Explore the effect of the depth of the decision tree on its fitting ability

The following experiment will explore the difference in the fitting ability of two decision trees with different depths to sample data by controlling variables.

Set up two trees that differ only in depth:

# 构建两棵深度不一致的决策树
tree_reg1 = DecisionTreeRegressor(random_state=42,max_depth=2)
tree_reg2 = DecisionTreeRegressor(random_state=42,max_depth=3)

# 训练分类器
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

Define related functions:

# 定义预测回归值并绘制的函数
def plot_regression_predictions(tree_reg, X, y, axes=[0,1,-0.2,1],ylabel="$y$"):
    # 获取棋盘坐标（这里的对象是一维坐标）
    # reshape() 函数的 -1 表示对全部数据执行此函数执行后，数据由 (100,) 变为 (100,1)
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)
    
    # 得到预测值
    y_pred = tree_reg.predict(x1)
    
    # 绘图
    plt.axis(axes)
    plt.xlabel("$x_1$", fontsize=18)
    if ylabel:
        plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(X, y, "b.")
    plt.plot(x1, y_pred,"r.-", linewidth=2, label = r"$\hat{y}$")

View the differences:

# 开始绘制两个分类器的回归效果
plt.figure(figsize=(11,4))

plt.subplot(121)
plot_regression_predictions(tree_reg1, X, y)
for split,style in ((0.1973, "k-"),(0.0917,"k--"),(0.7718,"k--")):
    plt.plot([split,split],[-0.2,1],style,linewidth=2)
plt.text(0.21,0.65,"Depth=0", fontsize=15)
plt.text(0.01,0.2,"Depth=l", fontsize=13)
plt.text(0.65,0.8,"Depth=1", fontsize=13)
plt.legend(loc="upper center",fontsize=18)
plt.title("max_depth=2",fontsize=14)

plt.subplot(122)
plot_regression_predictions(tree_reg2,X, y,ylabel=None)
for split,style in ((0.1973,"k-"),(0.0917,"k--"),(0.7718,"k--")):
    plt.plot([split,split],[-0.2,1], style,linewidth=2)
for split in (0.0458,0.1298,0.2873,0.9040):
    plt.plot([split,split],[-0.2,1],"k:",linewidth=1)
plt.text(0.3,0.5,"Depth=2", fontsize=13)
plt.title("max_depth=3",fontsize=14)

plt.show()

The resulting output is as follows:

insert image description here

It can be seen from the above figure that the larger the max_depth is set, the finer the division of the obtained decision tree model is, and the better the fitting effect is.

Next, try increasing the depth to see what happens:

# 不限制深度
tree_reg1 = DecisionTreeRegressor(random_state=42)

# 深度限制为 10
tree_reg2 = DecisionTreeRegressor(random_state=42,min_samples_leaf=10)

# 训练分类器
tree_reg1.fit(X,y)
tree_reg2.fit(X,y)

# 设置坐标数据（作为测试数据）
x1 = np.linspace(0,1,500).reshape(-1,1)

# 得到模型的预测值
y_pred1 = tree_reg1.predict(x1)
y_pred2 = tree_reg2.predict(x1)

Drawing display effect:

# 绘制两个分类器的回归效果
plt.figure(figsize=(11,4))

plt.subplot(121)
plt.plot(X, y,"b.")
plt.plot(x1,y_pred1,"r.-",linewidth=2,label=r"$\hat{y}$")
plt.axis([0,1,-0.2,1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.legend(loc="upper center", fontsize=18)
plt.title("No restrictions", fontsize=14)

plt.subplot(122)
plt.plot(X, y,"b.")
plt.plot(x1,y_pred2,"r.-",linewidth=2,label=r"$ \hat{y}$")
plt.axis([0,1,-0.2,1.1])
plt.xlabel("$x_1$",fontsize=18)
plt.title("min_samples_leaf={}".format(tree_reg2.min_samples_leaf),fontsize=14)

plt.show ()

The resulting output is as follows:

insert image description here

Since there is no restriction on the left side of the above figure, there is a serious overfitting phenomenon (in order to satisfy each point, the model is set quite complicated); and the right figure is due to the "minimum number of samples of leaf nodes". limit, so the risk of overfitting is avoided.