Machine Learning (16): Decision Tree

The full text has a total of more than 18,000 words, and the expected reading time is about 36~60 minutes | Full of dry goods, it is recommended to collect!

insert image description here

1. Introduction

The tree model is currently one of the most important models in the field of machine learning, and it is also the most commonly used basic classifier in integrated learning.

Different from algorithms such as linear regression and logistic regression, the tree model is not just a specific algorithm, but a model family covering a variety of algorithms.

The principle of the tree model is simple and easy to understand, with high calculation efficiency and strong discrimination ability. More importantly, it has excellent interpretability and can provide a clear and intuitive decision path. In addition, it can output important additional information, such as feature importance and optimal binning of continuous variables, further enhancing its practical value.

A decision tree is a type of tree model.

Second, use logistic regression to reproduce the decision tree

Let's do an experiment first: use logistic regression to model the surrounding iris data set, the code is as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# 加载数据
iris_features, iris_target = load_iris(return_X_y=True)

# 划分训练集和测试集
features_train, features_test, target_train, target_test = train_test_split(iris_features, iris_target, random_state=24)

# 创建逻辑回归模型实例,添加参数“class_weight”和“random_state”
logreg_model = LogisticRegression(max_iter=int(1e6), solver='saga')

# 定义逻辑回归模型的参数空间,添加参数"fit_intercept"和"tol"
logreg_param_grid = {'penalty': ['l1', 'l2'],
                     'C': [1, 0.5, 0.1, 0.05, 0.01],}

# 创建网格搜索评估器
grid_search_estimator = GridSearchCV(estimator=logreg_model,
                      param_grid=logreg_param_grid)

# 使用训练数据拟合模型
grid_search_estimator.fit(features_train, target_train)

# 输出最优参数
best_params = grid_search_estimator.best_params_
print("Best parameters: \n", best_params)

# 输出最优模型的系数和截距
best_estimator = grid_search_estimator.best_estimator_
best_estimator_coefficients = best_estimator.coef_
best_estimator_intercept = best_estimator.intercept_
print("Best estimator coefficients: \n", best_estimator_coefficients)
print("Best estimator intercept: \n", best_estimator_intercept)

Here comes the point! Let's analyze the output:

image-20230713153055248

When logistic regression deals with multi-category problems, it generates an equation for each category, which is used to distinguish this category from other categories.

There are 3 categories in the iris data set, so there are 3 equations, and the corresponding coefficients are a 3x4 matrix.

For the first category (that is, the category corresponding to the coefficient in the first row), only the coefficient of the third feature is not 0 (-3.47343992), and the coefficients of other features are all 0. This means that the model mainly relies on the third feature when distinguishing the first class from other classes . In other words, according to this model, it is possible to judge whether a sample belongs to the first category by observing the third feature.

According to this idea, to see if it is effective, go directly to the code:

# 加载数据
iris = load_iris(as_frame=True)

# 获取特征和目标变量
features = np.array(iris.data)
target = np.array(iris.target)

# 将2、3类划归为一类
target[50:] = 1

# 创建2x2的子图
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# 分别对比第三个特征与第1,2,4个特征
for i, idx in enumerate([0, 1, 3]):
    ax[i].scatter(features[:, 2], features[:, idx], c=target)
    ax[i].set_title(f'3rd feature vs {
      
      idx+1}st feature')
    ax[i].set_xlabel('3rd feature')
    ax[i].set_ylabel(f'{
      
      idx+1}st feature')
    ax[i].plot([2.5]*50, np.linspace(min(features[:, idx]), max(features[:, idx]), 50), 'r--')

plt.tight_layout()
plt.show()

Explain a key operation of the above code: why should categories 2 and 3 be classified into one category?

This is done to check whether the third feature is effective in distinguishing the first type ("setosa") of irises from the remaining two types ("versicolor" and "virginica") of irises. Therefore, the second and third categories of irises should be combined into one category, thus turning multi-category into two categories.

Look at the effect:

image-20230713161525541

The conclusion is that it is indeed possible to distinguish the first category (purple) from the other two categories (yellow dots) through the third feature (abscissa). That is to say, from the classification results, it can be distinguished simply by a classification rule The first type of irises and the other two.

Petal length (cm) <= 2.5 can be used as the classification condition. When the classification condition is met, the iris belongs to the first category, otherwise it belongs to the second and third categories. The basic classification is as follows:

image-20230713161800537

Next, for the unclassified second and third types of iris data, further find the classification rules similar to the previous ones to effectively classify them.

The idea is as follows: You can use the feature selection method to select important features through L1 regularization (also known as Lasso). Adjusting the regularization parameter C can control the model complexity. A smaller value of C will lead to a stronger regularization effect, making some coefficients of the model become zero. In this way, the features that are most effective for classification of the second and third categories (that is, the most important features) can be selected, and then based on the logistic regression modeling results with l1 regularization with only one feature coefficient not being 0, find the decision Boundary, and the decision boundary is the best way to divide Iris II and III sub-data based on this individual feature.

Go directly to the code:

def select_feature_and_train_model():
    # 加载数据
    iris = load_iris(as_frame=True)
    X = np.array(iris.data)
    y = np.array(iris.target)

    # 提取待分类的子数据集
    X_sub = X[y > 0]
    y_sub = y[y > 0]

    # 尝试各种C值,观察系数变化
    C_values = np.linspace(1, 0.1, 100)
    coef_list = []
    for C in C_values:
        model = LogisticRegression(penalty='l1', C=C, max_iter=int(1e6), solver='saga')
        model.fit(X_sub, y_sub)
        coef_list.append(model.coef_.flatten())

    # 使用GridSearchCV来找到最优的C值
    param_grid = {
    
    'C': np.linspace(0.1, 1, 100)}
    grid_search = GridSearchCV(LogisticRegression(penalty='l1', max_iter=int(1e6), solver='saga'), param_grid, cv=5)
    grid_search.fit(X_sub[:, 2].reshape(-1, 1), y_sub)
    selected_C = grid_search.best_params_['C']
    
    # 使用选定的C值,只用第三个特征训练模型
    selected_model = LogisticRegression(penalty='l1', C=selected_C, max_iter=int(1e6), solver='saga')
    selected_model.fit(X_sub[:, 2].reshape(-1, 1), y_sub)
    
    # 打印模型参数和评分
    print("Best C: ", selected_C)
    print("Model Coefficients: ", selected_model.coef_)
    print("Model Intercept: ", selected_model.intercept_)
    print("Model Score: ", selected_model.score(X_sub[:, 2].reshape(-1, 1), y_sub))
    
    # 计算决策边界并绘图
    decision_boundary = -selected_model.intercept_[0] / selected_model.coef_[0][0]
    plt.plot(X_sub[:, 2][y_sub==1], X_sub[:, 3][y_sub==1], 'ro')
    plt.plot(X_sub[:, 2][y_sub==2], X_sub[:, 3][y_sub==2], 'bo')
    plt.plot([decision_boundary]*20, np.arange(0.5, 2.5, 0.1), 'r--')
    plt.show()

    return selected_model, coef_list

# 使用函数
model, coef_list = select_feature_and_train_model()

Take a look at the results:

image-20230713164408759

Although the accuracy rate is less than 100%, it is also a very good classification rule, that is, the classification condition is petal length (cm) <= 4.879. When the classification condition is met, the iris flower belongs to the second category, and when it is not satisfied, the iris flower belongs to the third category class, the classification accuracy rate based on this classification condition is 93%, so the following figure can be obtained:

image-20230713164548608

This process is actually the basic process of building a decision tree. Two classification rules are mined through the logistic regression model with regularization items, and the two classification rules present a progressive relationship, that is, one classification rule is in the other. Classification continues under the classification results of a classification rule, and finally the two classification rules and the corresponding divided data sets present a tree shape, while models with different levels of classification rules

In general, the core idea of ​​a decision tree is to mine effective classification rules and present them in tree form.

3. The basic concept of decision tree

3.1 The nature of decision trees

A decision tree is essentially a superposition of a series of classification rules. The essence of its construction is to mine effective classification rules and finally present them in the form of a tree.

3.2 Basic structure of decision tree

The basic structure of a decision tree is a tree structure, which is a directed acyclic graph as a whole.

From the perspective of the graph structure, different types of points can be defined with the help of the direction of the edge. If an edge leads from point A to point B, this edge is an outgoing edge for point A and an incoming edge for point B. , A node is the parent node of B node, according to which all the points in the decision tree can be divided into the following categories:

  1. Root Node (Root Node) : The topmost node of the tree, there are no incoming branches, but there are two or more outgoing branches. Its corresponding features are used to initially segment the data.
  2. Internal Node : Each internal node has one incoming branch and two or more outgoing branches. Each internal node corresponds to a feature, which is used to further segment the data according to certain rules (such as whether the feature value is greater than a certain threshold).
  3. Leaf Node : A leaf node, also known as a terminal node, is the bottom node of the tree. There is no outgoing branch, but there is an incoming branch. Each leaf node corresponds to a prediction result, that is, a class in classification problems or a value in regression problems.

Each path from the root node to the leaf node corresponds to a decision rule. For a given input, according to the decision rule starting from the root node, follow a certain path, and finally reach a certain leaf node, and the prediction result of the leaf node is the model's prediction of the input.

In the process of continuously dividing the data set, the original complete data set corresponds to the root node of the decision tree, and the sub-data sets divided by the root node constitute the internal nodes in the decision tree, and when the iteration stops, the corresponding The data set is actually the leaf node in the decision tree.

Each data set is finally divided by a series of classification rules. It can also be understood that each node actually corresponds to a series of classification rules. For example, in the case in the second section: the E node is actually petal length (cm ) <= 2.5 and petal length (cm) <= 4.879 are both False data sets.

3.3 The growth process of the decision tree

In the process of building the tree model, the data set is actually divided into layers. Whenever a classification rule is set, the data set can be divided according to whether the classification rule is satisfied, and the subsequent classification rules Mining is further determined according to the situation of the sub-datasets divided by the previous layer. The data set is divided layer by layer, the classification rules are found for each data set, and then the data set is divided. This is the growth process of the tree model.

The popular understanding is that the process of decision tree classification is like answering a series of "yes or no" questions until the most suitable category is found.

For a new data sample, starting from the root node of the decision tree, each step checks this sample according to the decision rule of the node (a threshold judgment for a certain feature). Send the sample to the left child (if the decision rule is satisfied) or the right child (if not), and then repeat the above process on this new node until a leaf node is reached. The category label of this leaf node is the prediction result.

In the case of the second section, if the petal length (cm) of an iris flower is <= 2.5, then it is predicted to be the first category. Otherwise, check whether the petal length (cm) is <= 4.879, if yes, predict it to be the second type, and if not, predict it to be the third type. This is the process of decision tree classification.

This process is an iterative calculation process (the data set of the upper layer determines the mining of effective laws, and the mining of effective laws).

4. Evaluation indicators of classification rules

4.1 What is the partition rule evaluation index

In the experiment of logistic regression reproducing the core idea of ​​the decision tree , such a decision tree was constructed:

image-20230713161800537

The modeling process of the decision tree is actually the process of mining effective classification rules, and whether the classification rules are effective or not requires an evaluation standard .

Finding classification rules based on the conclusions of the logistic regression model can be seen as finding classification rules based on the accuracy of the classification results. At this time, the accuracy rate is the evaluation index for evaluating the classification conditions.

For example, in the first layer where the decision tree begins to grow, select petal length (cm) <= 2.5 as the classification condition to divide the original data set into two, one of the two divided data sets is 100% of the first-class data, and the other is Then it is all the second and third types of data. At this time, if the accuracy rate is used to judge whether the classification rule can distinguish between the first type and the second and third types of data, then the accuracy rate at this time is 100%, and the classification error is 0. .

This method of defining classification evaluation indicators is not universal and is prone to confusion in multi-category problems. For example, if it is a four-category problem, 1 condition can 100% distinguish between class A and BCD, and 2 conditions can 100% distinguish between class AB and CD At this time, if the discriminant index for judging whether the classification condition is good or bad is still based on the accuracy rate, then which condition to choose becomes a problem.

Therefore, generally speaking, the evaluation index of tree model selection classification rules is not the accuracy rate after each category is divided, but the purity of the label of the child node data set after the parent node is divided into child nodes.

4.2 Purity and impurity

In decision tree models, "purity" and "impurity" are measures used to measure the degree of class confusion in a data set.

  • Purity : A dataset is "pure" if all its instances belong to the same class. In other words, high purity means that a node contains samples that mostly belong to the same class.

  • Impurity : As opposed to purity, a dataset is "impure" if its instances belong to multiple categories. Impurity is a measure of how many categories are mixed together in the same dataset.

The direction in which the decision tree grows is the direction in which the purity of each divided subset becomes higher and higher.

In a decision tree, the ultimate goal is to divide the dataset into as "pure" subsets as possible by choosing appropriate features and split points . That is, it is desirable that the instances in each subset belong to the same class as much as possible.

There are many impurity metrics used in decision trees, among which there are three most commonly used: classification error (Classification Error), information entropy (Entropy) and Gini coefficient (Gini).

For each node of the decision tree, calculate the impurity corresponding to all possible segmentation methods, and select the segmentation method with the largest reduction in impurity to divide the data set.

4.3 Measures of impurity

4.3.1 Classification error

Classification error is a common indicator to measure the performance of classifiers. In a binary classification problem, classification error can be simply defined as the ratio of the number of misclassified samples to the total number of samples. For multi-classification problems, the classification error can also be extended to the ratio of the number of misclassified samples to the total number of samples.

If applied to a decision tree, classification error can be defined at each node. For a given node, the classification error is the ratio of the number of samples in the non-main category to the total number of samples in the samples on this node . That is to say, if all samples are predicted according to the main category on this node, then the proportion of prediction errors is the classification error. Its mathematical expression is as follows:
Classification error ( t ) = 1 − max ⁡ 1 ≤ i ≤ c [ p ( i ∣ t ) ] (1) Classification\ error(t) = 1-\max_{1\leq i\ leq c}[p(i|t)] \tag{1}Classification error(t)=11icmax[p(it)](1)

  • i i i displayiiclass i
  • c c c indicates that the current data set has a total ofccclass c
  • p ( i ∣ t ) p(i|t) p ( i t ) means theiiThe proportion of type i data to the total data in the current dataset.

If it is measured by classification error, it is 1 minus the proportion of the majority class.

For example, in a dataset of 10 pieces of data, there are 6 pieces of category 0 data and 4 pieces of category 1 data. At this time, the classification error of the dataset is 1-6/10 = 0.4. The classification error takes a value in the range [0, 0.5]. The smaller the classification error, the higher the purity of the data set label.

The decision tree partition rule evaluation index used by the greedy algorithm is the classification error

But in practice, it's not a good measure of the purity of a node. Because the classification error is not sensitive to the distribution of samples in the node, its value is often only determined by the largest number of classes. Therefore, in the construction of decision trees, more sensitive metrics, such as Gini impurity or information entropy, are often used.

4.3.2 Information Entropy (Entropy)

Entropy (Entropy) has also been mentioned in previous articles, so I won’t repeat it here. It is used to measure the uncertainty or confusion of information. Information entropy is defined as:
Entropy ( t ) = − ∑ i = 1 cp ( i ∣ t ) log 2 p ( i ∣ t ) (2) Entropy(t) = -\sum_{i=1}^cp(i |t)log_2p(i|t) \tag{2}Entropy(t)=i=1cp(it)log2p(it)(2)

  • E n t r o p y ( t ) Entropy(t) E n t ro p y ( t ) : refers to the target variablettThe information entropy on t (such as a node of a decision tree) is the quantity to be calculated.
  • p ( i ∣ t ) p(i|t) p ( i t ) : is the target variablettUnder the condition of t , the data belongs to the iiThe probability of class i . Specifically, the target variablettlower t iiThe ratio of the number of type i data to the total number of data.
  • ∑ i = 1 c \sum_{i=1}^c i=1c: This summation symbol means to sum all categories, where ccc is the total number of categories.
  • l o g 2 log_2 log2: This is the base of the logarithmic function. The reason for using 2 as the base is that in information theory, the unit of measurement for information is usually bits.

Therefore, when information entropy is applied to a decision tree, the meaning of the entire formula is to calculate the given target variable ttUnder the condition of t (a decision tree node), the information entropy calculated according to the probability distribution of each category. It measures the target variablettThe uncertainty or confusion of the data on t .

Still using this data set of 10 pieces of data, there are 6 pieces of 0-type data and 4 pieces of 1-type data. The process of calculating its information entropy is as follows:

First you need to calculate the probability distribution p ( i ∣ t ) p(i|t) of each categoryp(it)

  • for class 0

p ( 0 ∣ t ) = 6 10 = 0.6 (3) p(0|t)=\frac{6}{10}=0.6 \tag{3} p(0∣t)=106=0.6(3)

  • for class 1

p ( 1 ∣ t ) = 4 10 = 0.4 (4) p(1|t)=\frac{4}{10}=0.4 \tag{4} p(1∣t)=104=0.4(4)

Next, bring these probability distributions into the formula for information entropy:

E n t r o p y ( t ) = − ∑ i = 1 c p ( i ∣ t ) log ⁡ 2 p ( i ∣ t ) (5) Entropy(t) = -\sum_{i=1}^c p(i|t)\log_2p(i|t) \tag{5} Entropy(t)=i=1cp(it)log2p(it)(5)

Bring specific values ​​into the formula:

E n t r o p y ( t ) = − [ 0.6 ∗ log ⁡ 2 ( 0.6 ) + 0.4 ∗ log ⁡ 2 ( 0.4 ) ] (6) Entropy(t) = -[0.6 * \log_2(0.6) + 0.4 * \log_2(0.4)] \tag{6} Entropy(t)=[0.6log2(0.6)+0.4log2(0.4)](6)

After calculation, get:

E n t r o p y ( t ) ≈ 0.97 (7) Entropy(t) \approx 0.97 \tag{7} Entropy(t)0.97(7)

This value is the information entropy of the data set. The smaller the information entropy, the higher the purity of the data set.

The intuitive meaning of information entropy is: if there are more possible outcomes of an event and the more uniform the probability distribution of the outcome, then the greater the information entropy of the event, otherwise the smaller the information entropy. For example, for a fair coin, the probability of its heads and tails appearing is equal, so its information entropy is the largest, which is 1 bit. And for a coin with heads on both sides, its information entropy is 0 because its outcome is deterministic.

4.3 Gini coefficient (Gini)

The Gini coefficient (Gini Index) is also a method of measuring impurity. The larger the value of the Gini coefficient, the higher the impurity of the data set. It is defined as follows:
G ini ( t ) = 1 − ∑ i = 1 cp ( i ∣ t ) 2 (8) Gini(t) = 1-\sum_{i=1}^cp(i|t)^2 \tag {8}G ini ( t )=1i=1cp(it)2(8)

  • c c c represents the total number of categories in the dataset
  • p ( i ∣ t ) p(i|t) p ( i t ) means theiiThe proportion of type i data to the data set

Calculation of the Gini coefficient involves subtracting the sum of squared probabilities for each class from 1. The Gini coefficient reaches a maximum when all classes are distributed with equal probabilities, which is when the data set is most impure. When there is only one type of data, the probability of all other categories is 0. At this time, the Gini coefficient is equal to 0, indicating that the data set is completely pure.

The calculation of the Gini coefficient is more intuitive and more efficient because it does not require logarithmic calculations. In addition, compared with information entropy, Gini coefficient is more sensitive to unbalanced class distribution.

Still using this data set of 10 pieces of data, there are 6 pieces of 0-type data and 4 pieces of 1-type data. The process of calculating the Gini coefficient is as follows:

First, calculate the proportion of the two classes in the dataset:

  • for class 0

p ( 0 ∣ t ) = 6 10 = 0.6 (9) p(0|t)=\frac{6}{10}=0.6 \tag{9} p(0∣t)=106=0.6(9)

  • for class 1

p ( 1 ∣ t ) = 4 10 = 0.4 (10) p(1|t)=\frac{4}{10}=0.4 \tag{10} p(1∣t)=104=0.4(10)

Substituting the proportion into the formula for the Gini coefficient:
G ini ( t ) = 1 − ∑ i = 1 c [ p ( i ∣ t ) ] 2 (11) Gini(t) = 1 - \sum_{i=1}^c [ p(i|t)]^2 \tag{11}G ini ( t )=1i=1c[p(it)]2(11)

Bring in specific value calculations:

G i n i ( t ) = 1 − [ p ( 0 ∣ t ) 2 + p ( 1 ∣ t ) 2 ] = 1 − [ ( 0.6 ) 2 + ( 0.4 ) 2 ] = 1 − [ 0.36 + 0.16 ] = 1 − 0.52 = 0.48 (12) Gini(t) = 1 - [p(0|t)^2 + p(1|t)^2] = 1 - [(0.6)^2 + (0.4)^2] = 1 - [0.36 + 0.16] = 1 - 0.52 = 0.48 \tag{12} G ini ( t )=1[p(0∣t)2+p(1∣t)2]=1[(0.6)2+(0.4)2]=1[0.36+0.16]=10.52=0.48(12)

Unlike information entropy, the Gini coefficient takes values ​​in the range [0, 0.5], and the smaller the Gini coefficient, the higher the purity of the data set label

4.4. Evaluation Metrics for Multiple Datasets

The above process is the calculation method under the singular set. However, in most cases, it is not only necessary to measure the purity of labels in individual datasets, but also to measure the purity of labels in multiple datasets as a whole, such as the overall evaluation index of two child nodes when a parent node is divided into two child nodes.

As an example, consider the following dataset:

  • Dataset A has two features and one label, and the label has only two categories of 0-1. The features of the data set are income (income) and credit rating (credit_rating)

    image-20230714131633125

For the overall data set, the calculation process is as follows: (Gini coefficient)

Data set A containing 8 pieces of data, of which 5 pieces belong to category 0 and 3 pieces belong to category 1, first calculate the overall Gini coefficient:

  • The proportion of class 0 p ( 0 ∣ ​​t ) = 5 8 = 0.625 p(0|t) = \frac{5}{8} = 0.625p(0∣t)=85=0.625
  • The proportion of class 1 p ( 1 ∣ t ) = 3 8 = 0.375 p(1|t) = \frac{3}{8} = 0.375p(1∣t)=83=0.375

G i n i ( A ) = 1 − [ p ( 0 ∣ A ) 2 + p ( 1 ∣ A ) 2 ] = 1 − [ ( 0.625 ) 2 + ( 0.375 ) 2 ] = 1 − [ 0.390625 + 0.140625 ] = 1 − 0.53125 = 0.46875 (13) Gini(A) = 1 - [p(0|A)^2 + p(1|A)^2] = 1 - [(0.625)^2 + (0.375)^2] = 1 - [0.390625 + 0.140625] = 1 - 0.53125 = 0.46875 \tag{13} Gini ( A ) _=1[p(0∣A)2+p(1∣A)2]=1[(0.625)2+(0.375)2]=1[0.390625+0.140625]=10.53125=0.46875(13)

Assuming that a classification condition is set arbitrarily: income <= 1.5, the above data set can be further divided into two sub-data sets B1 and B2:

image-20230714134630376

For a sub-dataset B1 containing 5 pieces of data, which contains 2 types of 0 and 3 types of 1, the Gini coefficient of B1 is calculated according to the Gini coefficient formula.

  • The proportion of class 0 p ( 0 ∣ ​​B 1 ) = 2 5 = 0.4 p(0|B1) = \frac{2}{5} = 0.4p(0∣B1)=52=0.4
  • The proportion of class 1 p ( 1 ∣ B 1 ) = 3 5 = 0.6 p(1|B1) = \frac{3}{5} = 0.6p(1∣B1)=53=0.6

Gini ( B 1 ) = 1 − [ p ( 0 ∣ ​​B 1 ) 2 + p ( 1 ∣ B 1 ) 2 ] = 1 − [ ( 0.4 ) 2 + ( 0.6 ) 2 ] = 1 − [ 0.16 + = 1 − 0.52 = 0.48 (14) Gini(B1) = 1 - [p(0|B1)^2 + p(1|B1)^2] = 1 - [(0.4)^2 + (0.6)^2 ] = 1 - [0.16 + 0.36] = 1 - 0.52 = 0.48 \tag{14}This G ( B 1 )=1[p(0∣B1)2+p(1∣B1)2]=1[(0.4)2+(0.6)2]=1[0.16+0.36]=10.52=0.48(14)

The B2 data set only contains one label, so the Gini coefficient of B2 is 0, ie

G i n i ( B 2 ) = 0 (15) Gini(B2) = 0 \tag{15} G ini ( B 2 )=0(15)

At this time, if you want to calculate the overall Gini coefficient of B1 and B2, you need to carry out the weighted summation of the proportion of the sample number of each data set to the overall data set on the basis of gini_B1 and gini_B2, that is, calculate according to the following method: G ini ( B
) = ∣ B 1 ∣ ∣ A ∣ G ini ( B 1 ) + ∣ B 2 ∣ ∣ A ∣ G ini ( B 2 ) (16) Gini(B) = \frac{|B_1|}{|A|}Gini( B_1)+\frac{|B_2|}{|A|}Gini(B_2) \tag{16}Gini(B)=AB1Gini(B1)+AB2Gini(B2)( 16 )
Among them∣ B i ∣ ∣ A ∣ \frac{|B_i|}{|A|}ABiFor the sub-dataset B i B_iBiThe ratio of the number of data to the number of data in parent data set A. Therefore the above B 1 B_1B1 B 2 B_2 B2The overall Gini coefficient is:

G i n i ( B ) = 5 8 G i n i ( B 1 ) + 3 8 G i n i ( B 2 ) = 0.3 (17) Gini(B) = \frac {5}{8}Gini(B1) + \frac {3}{8}Gini(B2) = 0.3 \tag{17} Gini(B)=85This G ( B 1 )+83G ini ( B 2 )=0.3(17)

So far, a method for describing the overall purity of the two subsets after the data set is divided has been constructed.

5. Decision Tree Algorithm

The core idea of ​​the tree model comes from the local optimal solution idea of ​​the greedy algorithm

So far, there are dozens of tree models. What must be understood is the ID3, C4.5, and C5.0 decision trees. What must be mastered is the CART decision tree . They are all specific algorithms for constructing decision trees. With different decision tree construction strategies, it can be regarded as different schools. Each algorithm has its own advantages and disadvantages and is suitable for solving different types of problems.

5.1 Basic modeling process of ID3 algorithm

ID3 (Iterative Dichotomiser 3): It is the earliest and most classic decision tree algorithm, and it is also an algorithm that really develops the tree model . Proposed by Ross Quinlan in 1975 (in his doctoral dissertation), it can only model discrete variables for classification problems, that is to say: ID3 cannot handle continuous features, nor can it handle regression problems. If there are continuous variables in the training data, they need to be discretized first (such as continuous variable binning).

Take a look at an example:

This is a piece of personal consumption data, and each feature is a discrete variable (the two columns of age and income are the results of continuous variable binning, for example, the age column is bounded by 30 and 40 for continuous variable binning)

image-20230714103506614

For such data, if the ID3 algorithm is used for modeling, the idea to follow is: determine the classification rule discriminant index, find the fastest way to reduce the information entropy to divide the data set (classification rule extraction), and continue to iterate until convergence

The data set division process (law extraction process) of ID3 is expanded according to the column, that is, the data set is divided according to the different values ​​of a certain column. For example, the original data set is divided according to the different values ​​of the age column in the above data set, and the division results are as follows:

image-20230714140542492

Then, it is possible to calculate the overall impurity reduction result of the data set after dividing the data set with different values ​​of age as the division rule.

In ID3, information entropy is used as the evaluation index , and the specific calculation process is as follows:

import numpy as np

def entropy(p):
    """计算信息熵"""
    if p != 0 and p != 1:
        ent = -p * np.log2(p) - (1-p) * np.log2(1-p)
    else:
        ent = 0
    return ent

def calculate_entropy(A, B1, B2, B3, weights):
    """计算父节点和子节点的信息熵,并计算信息增益"""
    ent_A = entropy(A)

    ent_B1 = entropy(B1)
    ent_B2 = entropy(B2)
    ent_B3 = entropy(B3)

    ent_B = weights[0]*ent_B1 + weights[1]*ent_B2 + weights[2]*ent_B3

    gain = ent_A - ent_B

    print(f"Information entropy of parent node A: {ent_A:.3f}")
    print(f"Information entropy of child node B1: {ent_B1:.3f}")
    print(f"Information entropy of child node B2: {ent_B2:.3f}")
    print(f"Information entropy of child node B3: {ent_B3:.3f}")
    print(f"Information entropy of child nodes B: {ent_B:.3f}")
    print(f"Information gain from A to B: {gain:.3f}")

    return gain

# 使用函数
A = 5/14
B1 = 2/5
B2 = 2/5
B3 = 0
weights = [5/14, 5/14, 4/14]

calculate_entropy(A, B1, B2, B3, weights)

This code is the reproduction of section 4.4, see the result:

image-20230714141223159

Information gain is to calculate the change value of information entropy before and after dividing the data set, that is, the information entropy of the original data set minus the information entropy of the divided data set.

The above process calculates the result of reducing the impurity of the data set after dividing the data set according to the different values ​​​​of the age column. Expanding according to the age column can only be regarded as an alternative division rule in the first step of tree growth. Test the decrease in the impurity of the data set after expanding according to the income, student or credit_rating column. The specific calculation process is similar to the calculation process after expanding the age column. The result is as follows:

1

Expanding according to the age column can more effectively reduce the impurity of the data set, so the first layer of tree growth is to divide the data set according to the different values ​​​​of the age column.

The next step is the process of continuous iterative calculation. The final tree growth form of the model is as follows:

image-20230714142447425

So far, the whole process of modeling the ID3 decision tree has been completed.

A few points need to be clarified:

1. ID3 extracts the rules according to the columns. There will be several branches in each step of growth. In fact, it is completely determined by the classification levels of the current column.

2. ID3 expands one column at a time, so the "column consumption" in the modeling process is very fast, and the number of features in the data set determines the maximum depth of the tree

3. Because ID3 is expanded according to columns, it can only handle data sets whose features are all discrete variables.

4. The ID3 tree will be more inclined to select categorical variables with more values ​​to expand during the actual growth process, but this will make the model more likely to overfit

5.2 Basic modeling process of C4.5 algorithm

C4.5 is an improved version of ID3, which was also proposed by Ross Quinlan, and has been optimized in three aspects based on ID3:

1. Introduce the information gain rate (Gain Ratio) as a criterion for feature selection to overcome the problem that the ID3 algorithm tends to select features with many values

2. It can handle continuous features and missing values ​​by finding the middle value of adjacent values ​​as the segmentation point

3. Added a strategy of pruning the decision tree to prevent overfitting.

  • information value

The information value (hereinafter referred to as IV value) in C4.5 is an index used to measure the number of branches when the data set is divided. If there are more branches when dividing, the higher the IV value. The calculation formula of specific IV value is as follows:
Information Value = − ∑ i = 1 KP ( vi ) log 2 P ( vi ) (18) Information\ Value = -\sum^K_{i=1}P(v_i)log_2P (v_i) \tag{18}Information Value=i=1KP(vi)log2P(vi)( 18 )
whereKKK means that a certain division is the total number of branches,vi v_iviRepresents a sample after division, P ( vi ) P(v_i)P(vi) indicates the proportion of the sample size to the data volume of the parent node

The IV value calculation formula is basically the same as the information entropy calculation formula, except that the specific calculation ratio is no longer the proportion of various samples, but the proportion of the data of each divided sub-node, or the information entropy is calculated according to different labels. The degree of confusion of the value, and the IV value is the degree of confusion of different values ​​​​of the calculation feature

In the actual modeling process, ID3 selects the division rules through the calculation results of information gain, while C4.5 uses the IV value to correct the calculation results of information gain, and constructs a new evaluation index for data set division: Gain Ratio (Gain Ratio, by called profit ratio or gain rate), to guide the selection of specific division rules. The calculation formula of GR is as follows:
Gain R atio = Information Gain Information V alue (19) Gain\ Ratio = \frac{Information\ Gain}{Information\ Value} \tag{19}Gain Ratio=Information ValueInformation Gain( 19 )
For example, the information gain result after expanding with the age column is:
IG = ent A − ent B = 0.94 − 0.69 = 0.25 IG = ent_A - ent_B = 0.94 - 0.69 = 0.25IG=entAentB=0.940.69=0.25
image-20230714144552335

The split information value (Split Information, IV) of the feature is calculated according to the three subsets (B1, B2, B3).

The weight of each subset (the proportion of the total sample) is 5/14, 5/14, 4/14 respectively, and the calculation formula is as follows:

I V = − ( 5 / 14 ∗ log ⁡ 2 ( 5 / 14 ) + 5 / 14 ∗ log ⁡ 2 ( 5 / 14 ) + 4 / 14 ∗ log ⁡ 2 ( 4 / 14 ) ) = 1.577 (20) IV = - (5/14 * \log_2(5/14) + 5/14 * \log_2(5/14)+ 4/14 * \log_2(4/14)) = 1.577 \tag{20} IV=(5/14log2(5/14)+5/14log2(5/14)+4/14log2(4/14))=1.577(20)

因此,最终得到的GR值:
G a i n   R a t i o = I n f o r m a t i o n   G a i n I n f o r m a t i o n   V a l u e = 0.15 (21) Gain\ Ratio = \frac{Information\ Gain}{Information\ Value} = 0.15 \tag{21} Gain Ratio=Information ValueInformation Gain=0.15(21)

The continuous variable processing method of C4.5 is to find the middle point of the adjacent value in the continuous variable as an alternative segmentation point, and select the final data set division method by calculating the GR value after segmentation .

For example: If the age column of the above data set is replaced with a continuous variable, the GR that needs to be calculated becomes GR(income), GR(student), GR(credit_rating), GR(age<=26.5), GR( age<=27.5)...

image-20230714145316911

5.3 Basic modeling process of CART algorithm (very important)

CART (Classification and Regression Trees) is a type of decision tree that can be used for both classification and regression tasks . It only uses a binary tree for decision-making, that is, each internal node will only generate two child nodes. Compared with ID3 and C4.5, the CART algorithm is more convenient for processing continuous and discrete data.

The CART classification tree uses the Gini coefficient to select the optimal division feature and division point; the CART regression tree uses the variance or absolute deviation of the sample as the division criterion.

Still using this data to illustrate, it does this when creating alternate rules:

image-20230714131633125

First, sort them numerically column by column, and extract the two columns of income and credit_rating separately for descending sorting. The sorting results are as follows:

image-20230714150219688

Then, find the intermediate point between different values ​​of these features as the cut point to construct alternative rules.

For example, income has two values ​​of 1 and 2, so only one cut point is 1.5, and a rule of income <= 1.5 can be created to divide the data set, so that the data whose income value is 1 can be classified into one Subsets, data sets with an income value of 2 are assigned to another subset. If the feature is a three-valued feature, you can find two cut points, two ways to divide the dataset, and so on.

If a feature in the data is a continuous variable, each piece of data has different values, for example:

image-20230714150437610

It can be regarded as a categorical variable with 8 classification levels, and the division rules are constructed by looking for the intermediate value of the adjacent value level as the cut point. For example, 7 alternative methods of dividing the data set can be constructed (age <= 36, age <= 34.5...).

In general, whether it is a continuous variable or a categorical variable, as long as there are N values, N-1 division conditions can be created to divide the original data set into two parts.

After understanding how to create candidate rules, it is necessary to select the best classification rules for tree building.

For the above-mentioned A data set, there are two features in total, and each feature has an alternative division rule. Therefore, when dividing the root node, there are actually two ways to divide the data set. If income <= 1.5 is used for classification the result of:

image-20230714150927893

However, if credit_rating <= 1.5 is used to divide the data set, the following results will appear:

image-20230714150945195

Judging from the results, these two division conditions can divide a data set containing only one type of label, and the result is not very different. So which classification rule should be used to divide the data set for the first time and make the decision Does the tree complete the first step of growth?

At this time, the evaluation index of the classification rule will be used. Generally speaking, for multiple rules, the Gini coefficient (Gini(A)) of the parent node will be calculated first, and then the overall Gini coefficient (Gini(B)) of the two divided child nodes will be calculated. )), by comparing which division method can make the difference between the two larger, that is, can make the Gini coefficient of the child node drop faster, choose which rule, the code is as follows:

import numpy as np

def gini(p):
    """根据类别概率计算基尼系数"""
    return 1 - np.sum(np.power(p, 2))

# 计算数据集A的基尼系数
A = np.array([5/8, 3/8])
gini_A = gini(A)
print(f"Gini index of dataset A: {gini_A:.3f}")

# 计算数据集B1的基尼系数
B1 = np.array([2/5, 3/5])
gini_B1 = gini(B1)
print(f"Gini index of dataset B1: {gini_B1:.3f}")

# 计算数据集B2的基尼系数
B2 = np.array([3/3, 0/3])
gini_B2 = gini(B2)
print(f"Gini index of dataset B2: {gini_B2:.3f}")

# 计算分类后的整体基尼系数
gini_total = 5/8 * gini_B1 + 3/8 * gini_B2
print(f"Overall Gini index after splitting: {gini_total:.3f}")

# 计算基尼系数的下降值
gini_decrease = gini_A - gini_total
print(f"Gini decrease: {gini_decrease:.3f}")

The result is as follows:

image-20230714152326601

If the second division rule is used to divide the data set, the result of the decline of the Gini coefficient at this time is:

# 计算数据集B1的基尼系数
B1 = np.array([4/4, 0/4])
gini_B1 = gini(B1)
print(f"Gini index of dataset B1: {gini_B1:.3f}")

# 计算数据集B2的基尼系数
B2 = np.array([1/4, 3/4])
gini_B2 = gini(B2)
print(f"Gini index of dataset B2: {gini_B2:.3f}")

# 计算分类后的整体基尼系数
gini_total = 4/8 * gini_B1 + 4/8 * gini_B2
print(f"Overall Gini index after splitting: {gini_total:.3f}")

# 计算基尼系数的下降值
gini_decrease = gini_A - gini_total
print(f"Gini decrease: {gini_decrease:.3f}")

The result is as follows:

image-20230714152651546

The second rule can make the Gini coefficient of the parent node drop faster, that is, credit_rating <= 1.5, so this rule should be used in the first data set division.

After completing a rule screening and tree growth, the next question that needs to be considered is whether it is necessary to further search for classification rules to divide the currently divided data sets B1 and B2.

image-20230714150945195

First of all, for the data set B1, its Gini coefficient is already 0, no further calculation is needed;

The Gini coefficient of the B2 data set is 0.375, and effective classification rules can be further extracted to classify it to reduce its Gini coefficient. At this point, it is necessary to completely repeat the division process of data set A. First, extract the candidate rules around the data set B2. For B2, the only candidate rule is income <= 1.5, so the data set is divided according to this rule:

image-20230714152949601

The Gini coefficients of C1 and C2 that are finally divided are both 0, and the decision tree has stopped growing so far.

In general: the growth process of the decision tree is essentially an iterative operation, based on the conclusions of the previous round (the division of the data set) as the basic condition, to find the best classification rules for the sub-data sets, and then proceed The data set is divided, and so on.

Six, pruning strategy

A decision tree is a greedy algorithm that selects at each node the best splitting attribute (i.e., the attribute with the greatest information gain or the smallest Gini coefficient) to classify the training data as accurately as possible. However, this method may lead to overfitting, that is, the decision tree model is too complex and fits the training data so well that it loses its generalization ability to unknown data (test data).

Decision tree itself is a more prone to overfitting model. The more layers you grow, the more complex the tree model is. At this time, the risk of the model structure is higher, and the model is more prone to overfitting. Therefore, in many cases, if there are no constraints on the growth of the tree, that is, if the set convergence conditions are strict (for example, the final Gini coefficient is required to be all 0), and the maximum number of iterations is not limited, it is likely to be easy to overfit. Therefore, in the decision tree modeling process, there is a very important link, which is to limit the overfitting tendency of the decision tree model.

The pruning strategies of different decision tree algorithms are also different. In general, the pruning of tree models is divided into two types.

  • Limit the growth of the model before the model grows. This method is also called pre-pruning or potting method. The idea is to estimate each node before dividing it in the process of building the decision tree. If the division of the current node If it cannot improve the generalization performance of the decision tree, stop dividing and mark the current node as a leaf node;
  • Let the tree model grow as much as possible first, and then pruning. This method is also called post-pruning or pruning. The idea is to construct a complete decision tree first, and then inspect the non-leaf nodes from the bottom up. If pruning can improve the generalization performance of the decision tree, then pruning will be performed.

Both C4.5 and CART trees adopt the method of post-pruning. Whether it is pre-pruning or post-pruning, the goal is to simplify the model, improve the generalization ability of the model, and avoid the occurrence of over-fitting.

7. Summary

In this article, the basic concept, construction and growth process of decision tree are introduced in detail. The evaluation indicators of classification rules, including the calculation and function of classification error, information entropy and Gini coefficient are introduced. Finally, three main decision tree algorithms: ID3, C4.5, and CART, and their modeling processes are analyzed in depth. I hope it can help everyone to have a clear understanding of the decision tree.

Finally, thank you for reading this article! If you feel that you have gained something, don't forget to like, bookmark and follow me, this is the motivation for my continuous creation. If you have any questions or suggestions, you can leave a message in the comment area, I will try my best to answer and accept your feedback. If there's a particular topic you'd like to know about, please let me know and I'd be happy to write an article about it. Thank you for your support and look forward to growing up with you!

Looking forward to growing together with you in future studies.

Guess you like

Origin blog.csdn.net/Lvbaby_/article/details/131725481