[Machine Learning] Practical Cases of Decision Trees

foreword

In the previous section, we briefly introduced the basic principles and use of decision trees. Now we will introduce decision trees in detail through practical cases.

1. Regression decision tree visualization diagram construction

This time we use the house dataset to build

  • Import package data set preparation
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
#%%
from sklearn.datasets._california_housing import fetch_california_housing

housing=fetch_california_housing(as_frame=True)

- Check the data.
insert image description here
There are 20640 data and 8 eigenvalues. We select all eigenvalues ​​for regression tasks.

from sklearn import tree

dtr=tree.DecisionTreeRegressor(max_depth=2)
dtr.fit(housing.data,housing.target)

To construct a decision-making visualization tree diagram, you need to downloadgraphviz(You can search for tutorials on the Internet to download), for simplicity, directly use the visual drawing tool in sklearn

from sklearn.tree import plot_tree
plot_tree(dtr,filled=True)
plt.show()

insert image description here

2. Classifier construction

Here we still use the iris data set, which has 4 eigenvalues. We divide the four eigenvalues ​​into 6 groups. Different combinations, to see which two eigenvalues ​​build the best classifier

  • Import Data
from sklearn.datasets import load_iris

iris=load_iris()
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.style as ms
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('TkAgg')
import numpy as np

DecisionBoundaryDisplay is a class in the sklearn.inspection module of the scikit-learn library. It is used to visualize the decision boundary of a machine learning model.
The DecisionBoundaryDisplay class provides a convenience method to draw the classifier's decision boundary next to the data points. It takes a classifier as input and can display the classifier's predictions on a given dataset.

  • Model training and drawing
n_cluser=3
plot_step=0.02
ms.use('seaborn-bright')
for index,pari in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]):
    X=iris.data[:,pari]
    y=iris.target

    clf=DecisionTreeClassifier().fit(X,y)

    ax=plt.subplot(2,3,index+1)
    plt.tight_layout(h_pad=0.5,w_pad=0.5,pad=2.0)


    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        response_method='predict',
        ax=ax,

        xlabel=iris.feature_names[pari[0]],
        ylabel=iris.feature_names[pari[1]],
    )
    for i,color in zip(range(n_cluser),'rbg'):

        idx=np.where(y==i)
        plt.scatter(
            X[idx,0],
            X[idx,1],
            c=color,
             label=iris.target_names[i],
            edgecolors='black',
            s=12
        )

plt.suptitle("Decision surface of decision trees trained on pairs of features")
plt.legend()
    
plt.show()

insert image description here
It can be seen that in general, it is easier to classify when we use petal width and petal length as eigenvalues.
Let’s take these two eigenvalues ​​and build a decision tree to see

#%%
from sklearn.tree import plot_tree


clf=DecisionTreeClassifier(max_depth=2).fit(iris.data[:,2:4],iris.target)
plot_tree(clf,filled=True, class_names=iris.target_names,)
plt.title("Decision tree trained on all the iris features")
plt.show()

insert image description hereIt can be seen that the classification is carried out by the Gini coefficient, and the maximum depth is 2 to classify the categories. Of course, this is the case when the eigenvalues ​​are relatively small, and the eigenvalues ​​are more, and the depth of the corresponding tree will also increase.

  • Probability estimation
    insert image description hereWhen the feature value is [5,3], the model is divided into the third category.
    insert image description here
    Through this operation, it can also be observed that the feature value [5,3] belongs to the probability of each category, and the one with the highest probability is selected.

3. The impact of data sensitivity on decision tree model construction

  • build dataset
np.random.seed(6)
Xs = np.random.rand(100, 2) - 0.5
ys = (Xs[:, 0] > 0).astype(np.float32) * 2
angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)
Xsr

Xs is a two-dimensional array containing 100 samples, each sample has two features, generated by randomly sampling from a uniform distribution, and then subtracting 0.5, the range becomes [-0.5, 0.5]. ys is the label for the binary classification of the first feature of Xs, when the first feature is greater than 0, the label is 2, otherwise it is 0. Then define a rotation angle angle and convert it to the rotation matrix rotation_matrix. By multiplying the Xs matrix with the rotation matrix using the dot() function, a new matrix Xsr is generated, which is the result of the rotation of Xs.
Therefore, the function of this code is to generate a two-dimensional data set Xs, and perform a rotation operation on the data set to generate a new data set Xsr for subsequent experimental comparisons.

  • Compare

def plot_show(clf,X,ax):

    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        response_method='predict',
        ax=ax,
    )

    plt.scatter(X[:,0],X[:,1],c=ys,edgecolors='black')

from sklearn.inspection import DecisionBoundaryDisplay
tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs, ys)
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr, ys)

plt.figure(figsize=(11, 4))
ax=plt.subplot(121)
plot_show(tree_clf_s,Xs,ax)


plt.title('Sensitivity to training set rotation')

ax=plt.subplot(122)
plot_show(tree_clf_sr,Xsr,ax)

plt.title('Sensitivity to training set rotation')

plt.show()

insert image description hereWe can find that the decision tree is very sensitive to the rotation of the data, and even a slight rotation may cause the performance of the model to degrade. This is because decision trees divide the feature space into rectangular regions by axis-alignment, and data rotation causes the shape and position of these rectangles to change, which affects the performance of the model. This also shows that in practical applications, we should try our best to ensure the original direction of the data, or use a model that is not affected by the direction of the data, such as support vector machine (SVM).

4. Regression Model Construction

  • Dataset construction
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 2 * (0.5 - np.random.rand(16))
plt.scatter(X,y,c='r',edgecolor="black")

plt.show()

We added some noise to these data,
let's visualize our data set,
insert image description here
we test the effect of different tree depths on model performance here,
the depth of the tree is from 2 to 10, for comparison

max_depath=[i for i in range(2,11)]
plt.tight_layout(h_pad=0.5,w_pad=0.5,pad=2.0)
plt.figure(figsize=(10,8))
for depath in max_depath:
    regr=DecisionTreeRegressor(max_depth=depath)
    regr.fit(X,y)
    x_test=np.arange(0.0,5.0,0.01)[:,np.newaxis]
    y_petect=regr.predict(x_test)
    plt.subplot(331+depath-2)
    plt.scatter(X,y,s=15,edgecolors='black',c='r',label='data')
    plt.plot(x_test,y_petect, color="cornflowerblue",
             label=f"max_depth={
      
      depath}")

    plt.title(f"max_depth={
      
      depath}")
    plt.axis('off')
# plt.legend()
plt.show()

insert image description hereFrom the figure, we find that the deeper the tree is, the more noise will be learned excessively, making the model poorer and unable to obtain true rules.
In regression problems, the deeper the decision tree, the more often it results in the following effects:

过拟合:当决策树的深度很大时,决策树将变得非常复杂,容易过拟合训练数据,导致模型在新数据上表现较差。这是因为模型过度学习了训练数据中的噪声,而不是捕获其真实的规律。

偏差:当决策树的深度不足时,模型可能无法捕获输入特征和输出之间的非线性关系,从而导致偏差较高,模型在训练集和测试集上的表现都会较差。

计算开销:当决策树深度增加时,树的大小和计算开销也会增加。这会导致需要更长的时间来训练和测试模型,并且需要更多的计算资源。
  • Not only the depth of the tree may have an impact on our regression model, but the minimum number of tree node samples may also have an impact.
    Still construct a regression data
np.random.seed(42)
m=200
X=np.random.rand(m,1)
y = 3*(X-0.4)**2
y = y + np.random.randn(m,1)/10

Train two models, one is without min_samples_leaf by default, the other is min_samples_leaf=10, we observe the difference

tree_reg1 = DecisionTreeRegressor(random_state=42)
tree_reg2 = DecisionTreeRegressor(random_state=42, min_samples_leaf=10)
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

Then we construct a new data to predict the outcome

x1 = np.linspace(0, 1, 600).reshape(-1, 1)
y_pred1 = tree_reg1.predict(x1)
y_pred2 = tree_reg2.predict(x1)

plt.figure(figsize=(10, 4))

plt.subplot(121)
plt.plot(X, y, "r.")
plt.plot(x1, y_pred1, "g.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.legend(loc="upper center", fontsize=18)
plt.title("No restrictions", fontsize=14)

plt.subplot(122)
plt.plot(X, y, "r.")
plt.plot(x1, y_pred2, "b.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.title("min_samples_leaf={}".format(tree_reg2.min_samples_leaf), fontsize=14)

plt.show()

insert image description hereIt can be seen that when we do not set the min_samples_leaf minimum node sample number, the prediction model results will be overfitting

In the decision tree algorithm, the min_samples_leaf parameter is used to control the minimum number of samples on the leaf nodes. The smaller its value, the easier it is for the decision tree to overfit, as the nodes become more subdivided and more likely to capture the noise in the training data. On the contrary, when the value is large, the decision tree will tend to be a simpler model, but some detailed information will be lost.
Specifically, when the value of min_samples_leaf is small, the number of samples on each leaf node will decrease, which may cause some unimportant features to be selected, and thus overfitting occurs. Conversely, when the value of min_samples_leaf is larger, the model will tend to be simpler and the possibility of overfitting will be reduced, but the accuracy and generalization ability of the model will also be reduced.
Therefore, adjusting the min_samples_leaf parameter can control the complexity of the decision tree and affect the accuracy and generalization ability of the model. In practical applications, methods such as cross-validation can be used to determine the optimal min_samples_leaf parameter value.

Summarize

This article briefly introduces the cases of decision trees in classification and regression. We should pay attention to the following issues

  • Data preprocessing: The decision tree has a strong ability to deal with missing values, but it will affect the predictive ability of the model for outliers and noise data. Therefore, before training the model, data cleaning and preprocessing are required to remove outliers and noise data to improve the accuracy of the model.
  • Feature Selection: Feature selection is an important step in building a decision tree model. During feature selection, features that are highly correlated with the target variable should be selected to improve the accuracy of the model.
  • Model evaluation: To evaluate the quality of the model, you can use cross-validation methods, such as k-fold cross-validation, to ensure the generalization ability of the model.
  • Parameter setting: There are many parameters in the decision tree model that need to be set, such as the depth of the decision tree, the minimum number of samples for split nodes, etc. In actual combat, these parameters need to be adjusted according to the characteristics of the data set to improve the accuracy and generalization ability of the model.
    In terms of parameter setting, common parameters include:
    criterion: evaluation criteria for split nodes, commonly used are Gini coefficient and information entropy.
    max_depth: The maximum depth of the decision tree, which controls the complexity of the decision tree.
    min_samples_split: The minimum number of samples to split a node, if the number of node samples is less than this value, no split will be performed.
    min_samples_leaf: The minimum sample number of leaf nodes, if the number of leaf node samples is less than this value, no splitting will be performed.
    max_leaf_nodes: The maximum number of leaf nodes, which controls the complexity of the decision tree.
    random_state: Random number seed, which controls the results of each run to be consistent.
  • Prevent overfitting: The decision tree model is prone to overfitting problems, so measures need to be taken to prevent overfitting, such as pruning, setting the minimum number of samples for leaf nodes, etc.
  • Visualization: The decision tree model has good interpretability. In actual combat, the decision-making process of the model can be understood more intuitively by visualizing the decision tree model.

For data preprocessing and model evaluation, please refer to previous articles.
Due to my limited ability, please correct me if there are any mistakes in the above.
I hope you can support me and work hard together

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/130181282