The full text has more than 8,000 words in total, and the expected reading time is about 16~27 minutes | Full of dry goods (with code), it is recommended to collect!

insert image description here

1. Introduction

CART (Classification and Regression Trees) is an important machine learning algorithm that can be used for both classification and regression problems, so it has high flexibility in practical applications. However, to understand and use CART trees proficiently, some key concepts and skills need to be mastered.

For the theoretical part of the CART tree, please see here:

Machine Learning (16): Decision Tree

This article explains in detail how to implement the CART regression tree and classification tree in Sklearn, including how to import data, preprocess data, build models, train models, make predictions, and evaluate models.

2. Detailed explanation of CART classification tree evaluator parameters

The CART classification tree is a binary tree, and each internal node is either divided into two child nodes, or has no child nodes (that is, leaf nodes). This is different from other decision tree algorithms (such as ID3, C4.5, etc.) that may have more than two child nodes. In the CART classification tree, each internal node corresponds to an input feature and a split point, and then the data is assigned to the left or right child node according to whether the value of the feature is greater than the split point. Each leaf node corresponds to a predicted class, which is the most common class among the training samples it contains.

The process of constructing the CART classification tree is a recursive process.

First, the algorithm will select the features and segmentation points that make the divided subsets have the highest purity (or the largest reduction in impurity) among all possible features and all possible segmentation points. Then, the algorithm will use this feature and the split point to divide the data into two subsets, corresponding to the left child node and the right child node. The algorithm then repeats this process on each of these two child nodes until a certain stopping condition is met, such as the tree reaching a maximum depth, or the number of samples in a node being less than a certain threshold, etc.

The DecisionTreeClassifier evaluator has many parameters, and most of them are related to the model structure of the decision tree. It is still viewed in this way in Sklearn:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

DecisionTreeClassifier?

Look at the results:

Look at the parameters in detail:

For these parameters, they can be divided into three categories to understand.

2.1 Model Evaluation Parameters

criterion: impurity measure

In sklearn, the tree model is a CART tree by default, and the default evaluation indicator of the CART tree is "gini", but information entropy can also be used to measure impurity.

In most cases, which indicator to choose will not substantially affect the structure of the tree model, but compared with information entropy, the Gini coefficient is less complex and faster to calculate. Generally, it is recommended to use the Gini coefficient . If you must look for the difference in the use of the two, it is generally believed that in some cases, Gini impurity is more inclined to split the majority of classes in the data set, while information entropy is more inclined to generate a more balanced tree.

ccp_alpha: structural risk weight

CCP is the abbreviation of Cost-Complexity Pruning. It was only added in version 0.22 of sklearn. It is the only parameter set to realize the pruning process in the original principle of CART.

This parameter is not a mandatory parameter. The pruning with the ccp term is also called the minimum complexity pruning. Its principle is to add a structural risk term to the loss function of the decision tree, which is similar to the regularization term in the loss function of the linear equation.

For example, let T be a decision tree, $R (T)$ is the overall impurity of the decision tree on the training set, which represents the empirical risk of the model, so that $\alpha|\widetilde{T}|$ represents the risk of the model structure, where $\alpha$ is a parameter, $|\widetilde{T}|$ is the number of leaf nodes of the tree, then the model loss function is as follows:
$R_\alpha(T) = R(T) + \alpha|\widetilde {T}| \tag{1}$
where $R_\alpha(T)$ is the loss function after adding the risk structure item, and $\alpha$ is the coefficient of the risk structure item. It can be seen from this that $\alpha$ , the greater the penalty for the structural risk of the model, the simpler the model structure, and the better suppression of overfitting.

2.2 Tree Structure Control Parameters

The parameters related to the structure of the control tree model are the most numerous types of parameters.

max_depth, max_leaf_nodes : limit the overall structure of the model
min_samples_split, min_samples_leaf : limit tree growth from the perspective of the number of node samples
min_impurity_split and min_impurity_decrease : parameters that limit tree growth from the perspective of reducing loss values

Through the joint action of these parameters, the growth of the tree is effectively restricted from various angles. For the tree model, too many leaf nodes, too few samples contained in a single leaf node, and less Gini coefficient due to internal node re-division are all possible manifestations of overfitting, and special attention should be paid when modeling .

2.3 Iterative stochastic process control parameters

splitter

When the value of this parameter is random, the classification rules are randomly selected to divide the current data set

max_features

This parameter can be arbitrarily set to bring in a few features at most to mine alternative rules. As long as the setting of this parameter does not bring in all the features for modeling, it is equivalent to randomly drawing a range for the candidate features, and it is also equivalent to setting a tree model. Training adds a certain amount of randomness.

These two parameters can improve the speed of model training. For example, if only the best division rule is selected from individual features, or a division rule is randomly generated and used directly without comparison, it can greatly save the amount of calculation, but this is also a This is a way of exchanging accuracy for efficiency, and such an operation will definitely bring about a decrease in the accuracy of the model results.

2.4 Sklearn call example

Go directly to the code:

import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

def train_and_plot_decision_tree(X, y):
    # 调用决策树评估器并进行训练
    clf = DecisionTreeClassifier().fit(X, y)
    
    # 输出模型在训练集上的评分
    score = clf.score(X, y)
    print(f'The accuracy score of the model on the training set: {
      
      score}')

    # 绘制决策树
    plt.figure(figsize=(6, 2), dpi=150)
    plot_tree(clf)
    plt.show()

# 使用方法
X = np.array([[1, 1], [2, 2], [2, 1], [1, 2], [1, 1], [1, 2], [1, 2], [2, 1]])
y = np.array([0, 0, 0, 1, 0, 1, 1, 0])

train_and_plot_decision_tree(X, y)

The result is as follows:

According to the output results, it can be seen that the modeling process of the classification tree in sklearn is to first divide the data set according to the different values of the first feature, and then divide the data set according to the different values of the second feature, and finally form A three-leaf node, two-level decision tree model.

3. CART regression tree modeling process

3.1 Basic modeling process of CART regression tree

Through an experiment to simulate the modeling process of the CART tree.

Step 1: Suppose you have the following dataset, which contains only one feature and one continuous label:

data = np.array([[1, 1], [2, 3], [3, 3], [4, 6], [5, 6]])
plt.scatter(data[:, 0], data[:, 1])

The distribution of the data is as follows, the abscissa represents the characteristics of the dataset, and the ordinate represents the label of the dataset.

Step 2: Generate candidate rules

The process of the CART regression tree and the classification tree are basically the same, and the intermediate points with different values are found feature by feature as the segmentation point. For the above data set, since there is only one feature and there are 5 different values in total, there are 4 segmentation points.

y_range = np.arange(1, 6, 0.1)

def plot_scatter_and_line(ax, data, line_position):
    ax.scatter(data[:, 0], data[:, 1])
    ax.plot(np.full_like(y_range, line_position), y_range, 'r--')

line_positions = [1.5, 2.5, 3.5, 4.5]
fig, axes = plt.subplots(2, 2)

for ax, line_position in zip(axes.flatten(), line_positions):
    plot_scatter_and_line(ax, data, line_position)

Step 3: Select the optimal segmentation rule

After determining the alternative division rules, the next step is to find the best division method according to some evaluation criteria.

This step of the regression tree is quite different from that of the classification tree. In the classification tree, the Gini coefficient or information entropy is used to measure the decline in the label impurity of the data set after division to select the best division method, while in the regression tree, it is based on the division. Afterwards, the sub-dataset MSE declines to select the best division method . In this process, the overall MSE calculation method of the sub-dataset is also similar to the CART classification tree. It first calculates the individual MSE of each subset, and then calculates the MSE of the two subsets as a whole by weighted summation.

The calculation of MSE is not complicated, but a prerequisite is: to calculate MSE, a predicted value must be given, and then the MSE can be calculated based on the predicted value and the real value.

After the CART regression tree divides the sub-datasets, it will give a predicted value for each sub-dataset (note that a predicted value is given for all the data in a sub-dataset, rather than a prediction for each number. value), and the predicted value will be calculated according to the goal of minimizing the MSE of the corresponding sub-dataset, and the best predicted value of each sub-dataset is the mean value of the true label of this sub-dataset.

For example, for the first case of dividing the data set above, the predicted value and MSE calculation results of each sub-data set are as follows:

plt.scatter(data[:, 0], data[:, 1])
plt.plot(np.full_like(y_range, 1.5), y_range, 'r--')

The corresponding division is expressed in the following form:

At this time, the MSE of the sub-datasets B1 and B2 is calculated, first of all, the predicted value of the two data sets, that is, the mean of the two data sets:

# B1数据集的预测值
y_1 = np.mean(data[0, 0])
y_1

# B2数据集的预测值
y_2 = np.mean(data[1: , 1])
y_2

# 模型预测结果
plt.scatter(data[:, 0], data[:, 1])
plt.plot(np.full_like(y_range, 1.5), y_range, 'r--')
plt.plot(np.arange(0, 1.5, 0.1), np.full_like(np.arange(0, 1.5, 0.1), y_1), 'r-')
plt.plot(np.arange(1.7, 5.1, 0.1), np.full_like(np.arange(1.7, 5.1, 0.1), y_2), 'r-')

Then calculate the final Score using MSE as the evaluation indicator under this segmentation rule:

def calculate_mse_reduction(data, split_position):
    # B1数据集的预测值
    y_1 = np.mean(data[0, 0])

    # B2数据集的预测值
    y_2 = np.mean(data[1: , 1])

    # 计算B1的MSE，如果B1只有一个数据点，则MSE为0
    mse_b1 = 0

    # 计算B2的MSE
    mse_b2 = np.power(data[1: , 1] - y_2, 2).mean()

    # 计算B1和B2的加权MSE
    mse_b = 1/5 * mse_b1 + 4/5 * mse_b2

    # 计算父节点的MSE
    mse_a = np.power(data[:, 1] - data[:, 1].mean(), 2).mean()

    # 计算MSE的减少量
    mse_reduction = mse_a - mse_b

    return mse_reduction


mse_reduction = calculate_mse_reduction(data, 1.5)
print(f"MSE reduction: {mse_reduction}")

The result obtained is: MSE reduction: 1.9599999999999993, that is to say, after dividing in this way, the reduced MSE is: 1.95999999999999993.

According to the above process, calculate the score of several other division methods

impurity_decrease = []

for i in range(4):
    # 寻找切分点
    splitting_point = data[i: i+2 , 0].mean()
    
    # 进行数据集切分
    data_b1 = data[data[:, 0] <= splitting_point]
    data_b2 = data[data[:, 0] > splitting_point]
    
    # 分别计算两个子数据集的MSE
    mse_b1 = np.power(data_b1[: , 1] - data_b1[: , 1].mean(), 2).sum() / data_b1[: , 1].size
    mse_b2 = np.power(data_b2[: , 1] - data_b2[: , 1].mean(), 2).sum() / data_b2[: , 1].size
    
    # 计算两个子数据集整体的MSE
    mse_b = data_b1[: , 1].size/data[: , 1].size * mse_b1 + data_b2[: , 1].size/data[: , 1].size * mse_b2
    #mse_b = mse_b1 + mse_b2
    
    # 计算当前划分情况下MSE下降结果
    impurity_decrease.append(mse_a - mse_b)

impurity_decrease

The results obtained are as follows: [1.9599999999999993, 2.1599999999999993, 3.226666666666666, 1.2099999999999999], that is to say, the third division can reduce the MSE to the greatest extent, that is, the first growth of the tree model is as follows:

Step 4: Perform multiple iterations

Next, further divide around B1 and B2. At this time, the MSE of B2 is already 0, so no further division is needed, while the MSE of B1 is 0.88, which can be further divided.

The division process of B1 is also consistent with the division process of A data set. Find the method that can reduce the MSE of the subset the most, and obtain the following final diagram:

Step 5: The build is complete

After the model has been built, the prediction process of the regression tree is actually very similar to the classification tree. As long as new data is allocated to the sample space according to the division rules, the prediction result of the spatial model is the prediction result of the data.

So far, the construction of the CART regression tree has been completed in this experiment. The construction process of the regression tree and the classification tree are roughly the same, and the iterative process is also basically the same. They can be regarded as two different implementation forms of the same modeling idea.

3.2 Different values of the criterion parameter

Although the CART tree can solve the classification problem and the regression problem at the same time, because there are still certain differences in the nature of the two types of problems, the corresponding modeling process of the CART tree is slightly different when dealing with different types of problems, so the evaluation in the corresponding sklearn The device is also different.

The regression tree alone is a model for solving regression problems, but in fact, the regression tree is actually the basic classifier for constructing the gradient boosting tree (GBDT), and whether it is solving regression problems or classification problems, the CART regression tree is unique. The basic classifier, so the relevant methods of the CART regression tree need to be mastered , so as to lay the foundation for the subsequent learning of the integrated algorithm.

The way to implement the CART regression tree in Sklearn is: DecisionTreeRegressor In Sklearn, you can still view it in this way:

from sklearn.tree import DecisionTreeRegressor

DecisionTreeRegressor?

Most of the parameters are consistent with DecisionTreeClassifier, the difference is: criterion parameter , which is the selection index of alternative division rules, for CART classification tree, the default Gini coefficient and optional information entropy, but for CART regression tree, the default mse, At the same time, mae, poisson and friedman_mse are optional, for different value situations:

When criterion='mse':

In decision tree regression, if the evaluation index criterionis set to mse, it is using Mean Squared Error (Mean Squared Error, MSE) as the evaluation standard for node splitting. The calculation process is to divide the sum of squared errors by the total number of samples. The mathematical expression is as follows:
$\frac{1}{m}\sum^ m_{i=1}(y_i-\hat y _i)^2 \tag{2}$
In this case, the decision tree will try to minimize the MSE of each subset. For each leaf node of the decision tree,its predicted value is the average of the target values of all training samples contained in the node. Because such a predicted value can minimize the sum of the squared errors from the predicted value to each actual value, that is, minimize the MSE.

This is also the default evaluation index of the regression decision tree, because it is less sensitive to outliers and has good mathematical properties, making the calculation process simpler.

When criterion='mae':

When the evaluation index is set in the decision tree algorithm criterion,mae the mean absolute error (Mean Absolute Error, MAE) is used as the evaluation standard and, unlike MSE, MAE actually calculates the absolute value of the difference between the predicted value and the real value The value is divided by the total number of samples, and the mathematical expression is as follows:
$\frac{1}{m}\sum^m_{i=1 }|(y_i-\hat y _i)| \tag{3}$
In this case, the decision tree will try to minimize the MAE of each subset when performing each split. The predicted value that minimizes MAE is not the mean, but the median. This is because, for any set of numbers, choosing the median as the predicted value can minimize the sum of the absolute errors from the predicted value to each actual value.

In general, MSE is calculated based on the Euclidean distance between the predicted value and the actual value, while MAE is calculated based on the street distance between the two. In many cases, MSE is also called L2 loss, while MAE is Known as the L1 loss. When the criterion value is mae, in order to minimize the MAE value in each subset during each division, the model prediction value of each subset is no longer the mean, but the median.

The criterion of the CART regression tree is not only the evaluation standard when selecting the division method, but also the determinant factor for selecting the predicted value after dividing the sub-data set. That is to say, for the regression tree, the value of the criterion actually determines two aspects, one is to determine the calculation method of the loss value, and the other is to determine the calculation method of the predicted value of each data set—data The predicted value of the set requires the minimum value of criterion. If criterion=mse, the predicted value of the data set requires the minimum value of mse under the current data situation. At this time, the mean value of the label of the data set should be used as the predicted value; and if criterion= mse, the predicted value of the data set requires the minimum value of mae in the current data situation, and the median of the data set label should be used as the predicted value at this time.

When criterion='friedman_mse':

friedman_mse is an improved indicator based on mse. It is a residual calculation method designed by Friedman, the proposer of GBDT (gradient boosting tree, an integrated algorithm). It is the default criterion value of the tree gradient lifting tree in sklearn. , which is generally not recommended for individual tree decision tree models.

3.2 Sklearn call example

Look at the code:

from sklearn.tree import DecisionTreeRegressor

data = np.array([[1, 1], [2, 3], [3, 3], [4, 6], [5, 6]])

clf = DecisionTreeRegressor().fit(data[:, 0].reshape(-1, 1), data[:, 1])

# 同样可以借助tree.plot_tree进行结果的可视化呈现
plt.figure(figsize=(6, 2), dpi=150)
tree.plot_tree(clf)

The result is as follows:

Standalone tree decision tree models are generally not recommended.

3.2 Sklearn call example

Look at the code:

from sklearn.tree import DecisionTreeRegressor

data = np.array([[1, 1], [2, 3], [3, 3], [4, 6], [5, 6]])

clf = DecisionTreeRegressor().fit(data[:, 0].reshape(-1, 1), data[:, 1])

# 同样可以借助tree.plot_tree进行结果的可视化呈现
plt.figure(figsize=(6, 2), dpi=150)
tree.plot_tree(clf)

The result is as follows:

Four. Summary

This article mainly introduces the basic process of implementing CART classification tree and regression tree in Sklearn library. The parameters of the CART classification tree are explained in detail, including model evaluation parameters, tree structure control parameters and iterative stochastic process control parameters, and an example of calling the CART classification tree in Sklearn is given. Secondly, the modeling process of the CART regression tree and the different values of the criterion parameter are explained, and an example of calling Sklearn is also provided.

Finally, thank you for reading this article! If you feel that you have gained something, don't forget to like, bookmark and follow me, this is the motivation for my continuous creation. If you have any questions or suggestions, you can leave a message in the comment area, I will try my best to answer and accept your feedback. If there's a particular topic you'd like to know about, please let me know and I'd be happy to write an article about it. Thank you for your support and look forward to growing up with you!

Machine Learning (17): Practical Operation_The basic process of implementing CART tree in Sklearn

1. Introduction

2. Detailed explanation of CART classification tree evaluator parameters

2.1 Model Evaluation Parameters

2.2 Tree Structure Control Parameters

2.3 Iterative stochastic process control parameters

2.4 Sklearn call example

3. CART regression tree modeling process

3.1 Basic modeling process of CART regression tree

3.2 Different values of the criterion parameter

3.2 Sklearn call example

3.2 Sklearn call example

Four. Summary

Guess you like

Machine Learning (17): Practical Operation_The basic process of implementing CART tree in Sklearn

1. Introduction

2. Detailed explanation of CART classification tree evaluator parameters

2.1 Model Evaluation Parameters

2.2 Tree Structure Control Parameters

2.3 Iterative stochastic process control parameters

2.4 Sklearn call example

3. CART regression tree modeling process

3.1 Basic modeling process of CART regression tree

3.2 Different values ​​of the criterion parameter

3.2 Sklearn call example

3.2 Sklearn call example

Four. Summary

Guess you like

3.2 Different values of the criterion parameter