In-depth exploration of decision trees: from basic construction to machine learning applications

Overview

Decision Tree is a basic classification and regression method and the cornerstone of many advanced machine learning algorithms. The decision tree uses a tree structure to model the chemical decision-making process. Each node represents a test on an attribute. Each node represents a test on an attribute. The branch ratio is still the result of the test, and each final leaf node represents the decided classification. The decision tree is intuitive and easy to understand. It is a common tool for data mining and machine learning. Today I will show you how to understand it. The power and wide range of applications of decision trees.
machine learning decision tree

Basic concepts of decision trees

Decision Tree is based on number structure for decision-making. The basic idea comes from the human decision-making process. For example, when we buy a mobile phone, we will first consider whether the price is within the budget, and then consider the brand, performance, and appearance. and other factors. A decision tree makes decisions from top to bottom through a similar logical process. Each node of the tree (Tree) contains a conditional judgment. According to different conditions, the data is divided into different child nodes (Child Node) Go, loop like this until reaching the leaf node (Leaf Node), and get the final classification result.

decision tree
The importance of decision trees is mainly reflected in their intuitiveness and ease of interpretation. Compared with algorithms such as neural networks (Neural Network) or support vector machines (SVM, Support Vector Machine), decision trees are easier to understand. For medicine, finance, It is very important in fields such as business. Because these fields not only require accurate results, but also need to prove how the results were obtained.

Applications of decision trees

Decision Tree is used for classification problems:

  • Risk assessment, predicting the borrower's default risk through income, liabilities, age, occupation, education, etc.
  • Spam detection, predicting whether the email is spam based on the email content, sender, sending frequency, etc.
  • User churn prediction, predicting whether customers are likely to churn in the future by analyzing purchase history, service usage, satisfaction surveys and other data

Decision Tree is used for regression problems:

  • House price prediction. By analyzing house characteristics, such as area, location, age, etc., decision trees can help us predict house prices.
  • Stock price prediction, by analyzing historical price data, such as company financial reports, macroeconomic and other information, decision trees can help us predict the future price of stocks.
  • Sales forecasting, by analyzing past sales data, promotions, seasonality and other factors, decision trees can help us predict future sales

Basic construction of decision tree

As we delve deeper into the construction of decision trees, we write by hand to understand the basic elements in decision trees, namely nodes and branches.

Node

Each node (Node) in the decision tree contains a conditional judgment or a classification output.

According to the different functions of nodes, we usually divide them into 3 categories:

  • Root Node: the starting point of the decision tree
  • Internal Node: Characteristic testing of data
  • Leaf Node: Contains the decided classification

Branch

Branches represent the transition from one node (Node) to another node. That is, based on the conditions in the node, the data will move along different branches to different child nodes (Child Node).

Through nodes and branches, the decision tree simulates a step-by-step decision-making process, starting from the root node, passing through a series of conditional judgments, and finally reaching the leaf nodes to obtain the decision result.

The construction process of decision tree

Building a decision tree mainly includes two key steps: selecting attributes and splitting the data set.

Select properties:

  • Selecting the appropriate attribute as the judgment condition of the node is the core of decision tree construction. Usually we will use some indicators (such as information gain, Gini index, etc.) to evaluate the segmentation effect of each attribute, and select the best attribute as the judgment of the current node. condition

Create a branch:

  • After selecting the appropriate attribute, we need to divide the branches according to the different values ​​​​of the attribute. Each subset corresponds to a branch in the tree. For example, if the attribute we select is "color", and the color includes "red", "Blue", "Green", then we will create three branches

Split the dataset:

  • According to the different values ​​of the segmentation attribute, the data set is divided into several subsets, each subset corresponding to a branch in the tree. The data in each subset has the same attribute value.

Build subtree:

  • Recursively build a subtree for each subset until all data is correctly classified or a preset stopping condition is reached.

For example:
We have the following data to predict whether to go for a picnic based on weather conditions:

weather temperature Whether to go for a picnic
clear high yes
clear Low yes
rain high no
rain Low no
Negative high yes
Negative Low yes

Decision tree construction process:

  1. Select attributes: Select attributes to separate the data set. Using the above data we can choose two attributes: weather and temperature. We found that "sunny" is the best segmentation attribute by calculating the information gain.
  2. Create branches: Based on the three possible values ​​​​of the "weather" attribute (sunny, rainy, cloudy), we create three branches
  3. Split the dataset: split into three subsets based on “weather”
    • Sunny subset: {(high, yes), (low, yes)}
    • Rainy day subset: {(High, No), (Low, No)}
    • Cloudy subset: {(high, yes), (low, yes)}
  4. Recursively build subtrees: for each subset, we again select the best splitting attribute and split the data until the stopping condition is met
    • For the sunny and cloudy subsets: all data belong to the same category (going for a picnic), so we can stop splitting and mark these nodes as also nodes, with the value of the leaf nodes as "yes"
    • For the rainy day subset, all the data also belongs to the same category (not going for a picnic), so we can also stop the segmentation and mark this node as a leaf node with the value "No"

information gain

Information Gain is a common method to evaluate the importance of attributes, derived from the concept of Entropy in information theory. In Decision Tree learning, Information Gain is used to select the attributes that can best distinguish the data. Properties of sets. The following is the basic concept and calculation method of information gain.

information gain

Entropy

Entropy refers to a measure of data uncertainty. The greater the value of entropy, the higher the uncertainty of the data.

熵的公式:
H ( D ) = − ∑ i = 1 m p i log ⁡ 2 ( p i ) H(D) = -\sum\limits_{i=1}^{m}p_i\log_2(p_i) H(D)=i=1mpilog2(pi)

  • m m m: Different quantity
  • p i p_i pi: Number of positions i i i proportionality

Conditional Entropy

Conditional Entropy is the entropy of a data set under given conditions.

The formula of conditional entropy:
H ( D ∣ A ) = − ∑ j = 1 v ∣ D v ∣ D H ( D v ) H(D|A) = -\ sum\limits_{j=1}^{v}\frac{|D^v|}{D}H(D^v) H(DA)=j=1vDDvH(Dv)

  • v vv: the number of possible values ​​of A
  • D v D^vDv: Attribute A take number v v The sub-dataset when v values, ∣ D v ∣ |D^v| Dv Sum ∣ D ∣ |D| D are the sizes of the sub-data set and the original data set respectively

Information Gain

Information gain is the difference between the entropy of the data set and the conditional entropy given a certain attribute.

公式:
I G ( D , A ) = H ( D ) − H ( D ∣ A ) IG(D, A) = H(D) - H(D| A) IG(D,A)=H(D)H(DA)

Information entropy reflects the amount of information obtained by segmenting data set D through attribute A. The greater the information entropy, the better the classification effect of attribute A on data set D.

Information entropy calculation

Let's take an example and calculate the following:

weather temperature Whether to go for a picnic
clear high yes
clear Low yes
rain high no
rain Low no
Negative high yes
Negative Low yes

Calculate the entropy of a data set

We want two categories: "Yes" and "No". In this data set, there are 4 "Yes" and 2 "No", we can get:

p (is) = 4 6 p(is) = \frac{4}{6}p(here)=64

p (No) = 2 6 p(No) = \frac{2}{6}p()=62

Therefore, the entropy of the data set is:

H ( D ) = − [ p ( yes) × log ⁡ 2 ( p ( yes) ) + p ( no) − l o g 2 ( p ( no) ) ] H(D) = - [p (yes) \times \ log2(p(yes)) + p(no) - log2(p(no))]H(D)=[p()×log2(p())+p()log2(p())]

H ( D ) = − [ 4 6 × log ⁡ 2 ( 4 6 ) + 2 6 × log ⁡ 2 ( 2 6 ) ] H(D) = - [\frac{4}{6} \times \log2(\frac{4}{6}) + \frac{2}{6} \times \log2(\frac{2}{6})] H(D)=[64×log2(64)+62×log2(62)]

H ( D ) = 0.918 H(D) = 0.918 H(D)=0.918

Calculate entropy given a property

Take the weather as an example:

When the weather is sunny, there are 2 "yes" and 0 "no"

H ( D ∣ weather = sunny) = − [ 2 2 × log ⁡ 2 ( 2 2 ) + 0 ] = 0 H(D|weather = sunny) = -[\frac{2}{2} \times \log2( \frac{2}{2}) + 0] = 0H(D天气=Sunny)=[22×log2(22)+0]=0

H ( D ∣ weather = rain) = − [ 0 + 2 2 × log ⁡ 2 ( 2 2 ) ] = 0 H(D | weather = rain) = -[0 + \frac{2}{2} \times \ log2(\frac{2}{2})] = 0H(D天气=Rain)=[0+22×log2(22)]=0

H ( D ∣ weather= overcast) = − [ 2 2 × log ⁡ 2 ( 2 2 ) + 0 ] = 0 H(D|weather= overcast) = -[\frac{2}{2} \times \log2( \frac{2}{2}) + 0] = 0H(D天气=)=[22×log2(22)+0]=0

Then we calculate the conditional entropy of the weather attributes:

H ( D ∣ weather) = 2 6 × H ( D ∣ weather = sunny) + 2 6 × H ( D ∣ weather = rain) + 2 6 × H ( D ∣ weather = cloudy) = 0 H(D|weather) = \frac{2}{6} \times H(D|Weather = sunny) + \frac{2}{6} \times H(D|Weather = rain) + \frac{2}{6} \times H (D|Weather = cloudy) = 0H(D天气)=62×H(D天气=Sunny)+62×H(D天气=Rain)+62×H(D天气=)=0

Calculate Information Gain

I G ( D , weather) = H ( D ) − H ( D ∣ weather ) = 0.918 IG(D, weather) = H(D) - H(D|weather) = 0.918IG(D,天气)=H(D)H(D天气)=0.918

Similar calculations can be applied to the temperature attribute. By comparing the information gain, we can determine which attribute is more suitable for splitting the data set. In the above example, the weather attribute provides the largest information gain, so we use weather as the decision tree. Split attributes.

Commonly used decision tree algorithms

Several common decision tree algorithms are: ID3, C4.5 and CART.

ID3 (Iterative Dischotomiser 3) algorithm

The ID3 algorithm uses a top-down, greedy strategy to construct a decision tree. Select an attribute at each node to split the data in order to obtain the maximum information gain.

The ID3 algorithm is susceptible to noise and can cause problems like yours and . It cannot handle continuous attributes and missing values ​​directly.

ID3 example:

import sys
import six
sys.modules['sklearn.externals.six'] = six
from id3 import Id3Estimator
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# 加载数据
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# 创建ID3分类器实例
clf = Id3Estimator()

# 拟合模型
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 计算精度
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output result:

Accuracy: 1.00

C4.5 Algorithm

The C4.5 algorithm is an extension of the ID3 algorithm with some optimizations to overcome the limitations of ID3. The C4.5 algorithm applies information gain rate rather than information gain to select attributes to reduce the preference for attributes with a large number of values.

C4.5 introduces pruning technology to avoid overfitting, and simplifies the model by deleting some unnecessary nodes after constructing the tree during the construction process. C4.5 can handle continuity and missing values, making it practical sex.

CART algorithm

The CART algorithm is a binary tree that can be used for classification or regression tasks. In the Wen Lei problem, CART uses the Gini index to select attributes. The Gini index measures the impurity of the data set. The smaller the Gini index, the lower the impurity. , the better the classification effect.

Different from the multi-way trees of ID3 and C4.5, CART adopts a binary tree structure, each node has two child nodes, which makes the model more concise and efficient. In regression problems, CART uses the squared error minimization criterion to select attributes and segmentation data.

CART code example:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# 创建决策树分类器实例
clf = DecisionTreeClassifier(random_state=42)

# 拟合模型
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 计算精度
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Output result:

Accuracy: 1.00

Decision tree evaluation and pruning

Decision Tree construction is an iterative process. In order to obtain an efficient and reliable model, we need to measure the performance of the decision tree through some evaluation indicators and use pruning to optimize the model. This is the process of decision tree learning. A vital part.

Evaluation metrics for decision trees

Evaluation indicators (Evaluation) are the basis for us to measure model performance. For decision trees, commonly used evaluation indicators include precision, recall, and F1.

Common evaluation indicators:

  • Precision: Precision refers to the proportion of positive samples correctly predicted by the model to all samples predicted to be positive. Precision reflects the accuracy of the model.
  • Recall: Recall refers to the proportion of positive samples correctly predicted by the model to all samples that are actually positive. Recall reflects the completeness of the model.
  • F1 value (F1 Score): F1 value is the harmonic average of precision and recall rate. F1 is a balance between precision and recall rate, and is a good indicator for evaluating the overall performance of the model.

Through these evaluation indicators, we can evaluate the performance of the decision tree from different perspectives and find out the advantages and disadvantages of the model.

Decision tree pruning technology

In order to prevent model overfitting (Overfitting) and improve the generalization ability of the model, pruning technology is not essential.

Pruning is divided into two types: pre-pruning and post-pruning:

  • Pre-pruning: Pre-pruning is pruning during the construction process of the decision tree. Commonly used pre-pruning techniques include setting the maximum depth, setting the minimum number of divided samples, etc. Through pre-pruning, we Can control the complexity of decision trees
  • Post-pruning: Post-pruning is performed after the decision tree is constructed. Commonly used post-pruning techniques include error rate pruning, cost complexity pruning, etc. Post-pruning can usually obtain more accurate results. model, but the computational cost is higher

Code comparison:

"""
@Module Name: 决策树 预剪枝vs后剪枝.py
@Author: CSDN@我是小白呀
@Date: October 19, 2023

Description:
决策树 预剪枝vs后剪枝
"""
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载数据集
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

# 创建决策树分类器并设置最大深度为3 (预剪枝)
clf = DecisionTreeClassifier(max_depth=3, random_state=0)
clf.fit(X_train, y_train)

# 评估模型
print("预剪枝:")
print(f'Training accuracy: {clf.score(X_train, y_train)}')
print(f'Test accuracy: {clf.score(X_test, y_test)}')

# 设置ccp_alpha参数进行成本复杂度剪枝 (后剪枝)
clf_cost_complexity_pruned = DecisionTreeClassifier(ccp_alpha=0.02, random_state=0)
clf_cost_complexity_pruned.fit(X_train, y_train)

# 评估模型
print("后剪枝:")
print(f'Training accuracy (pruned): {clf_cost_complexity_pruned.score(X_train, y_train)}')
print(f'Test accuracy (pruned): {clf_cost_complexity_pruned.score(X_test, y_test)}')

Output result:

预剪枝:
Training accuracy: 0.9821428571428571
Test accuracy: 0.9736842105263158
后剪枝:
Training accuracy (pruned): 0.9821428571428571
Test accuracy (pruned): 0.9736842105263158

Application of decision trees in machine learning

Decision tree is a very common model in machine learning, which can be used for classification, regression, ensemble learning and other tasks.

Classification tasks

In the classification task, the decision tree builds a tree structure for classification by learning the relationship between the features and labels of the data. Each internal node represents a feature test, and each leaf node represents a category. From the root node Initially, the feature values ​​are tested step by step, and finally the data category is obtained at the node.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt


# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 分割数据为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建决策树分类器
clf = DecisionTreeClassifier(random_state=42)

# 拟合模型
clf.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = clf.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# 使用 plot_tree 进行可视化
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

Output result:
Decision tree classification task

return mission

In regression tasks, the goal of a decision tree is to predict a continuous value. This is slightly different from a classification tree. The nodes of a regression tree contain a real value instead of a category label.

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt


# 加载数据
boston = load_boston()
X, y = boston.data, boston.target

# 分割数据为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建决策树回归器
reg = DecisionTreeRegressor(random_state=42)

# 拟合模型
reg.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = reg.predict(X_test)

# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

# 使用 plot_tree 进行可视化
plt.figure(figsize=(12, 8))
plot_tree(reg, feature_names=boston.feature_names, filled=True)
plt.show()

Output result:
Decision tree regression task

Advantages and Disadvantages of Decision Trees

As a cost-effective and easy-to-use machine learning model, decision trees have achieved good results in many practical applications. Let’s talk about the advantages and disadvantages below.

Advantages of decision trees

  • Simple and intuitive: The structure of the decision tree is very simple, simulating the human decision-making process, making the interpretation of the model easier. By visualizing the decision tree, we can clearly see each decision and branch condition
  • High computational efficiency: The training and prediction processes of decision trees are very efficient. Decision trees do not require any preprocessing or standardization, and can directly classify and continuous features, which makes decision trees convenient and fast in practical applications.

Disadvantages of decision trees

  • Easy to overfit: Decision trees are easy to overfit, especially when the depth of the tree is large. Overfitting may cause the model to perform well on the training data, but perform poorly on the test set
  • Sensitive to noise: Decision trees are very sensitive to noise and outliers in the data. A slight amount of noise may lead to the generation of a completely different tree.

Hand rub decision tree

"""
@Module Name: 手把手教你实现决策树.py
@Author: CSDN@我是小白呀
@Date: October 20, 2023

Description:
手把手教你实现决策树
"""
import numpy as np


class TreeNode:
    def __init__(self, gini, num_samples, num_samples_per_class, predicted_class):
        self.gini = gini
        self.num_samples = num_samples
        self.num_samples_per_class = num_samples_per_class
        self.predicted_class = predicted_class
        self.feature_index = 0
        self.threshold = 0
        self.left = None
        self.right = None


def gini(y):
    m = len(y)
    return 1.0 - sum((np.sum(y == c) / m) ** 2 for c in range(num_classes))


def grow_tree(X, y, depth=0, max_depth=None):
    num_samples_per_class = [np.sum(y == i) for i in range(num_classes)]
    predicted_class = np.argmax(num_samples_per_class)
    node = TreeNode(
        gini=gini(y),
        num_samples=len(y),
        num_samples_per_class=num_samples_per_class,
        predicted_class=predicted_class,
    )

    if depth < max_depth:
        idx, thr = best_split(X, y)
        if idx is not None:
            indices_left = X[:, idx] < thr
            X_left, y_left = X[indices_left], y[indices_left]
            X_right, y_right = X[~indices_left], y[~indices_left]
            node.feature_index = idx
            node.threshold = thr
            node.left = grow_tree(X_left, y_left, depth + 1, max_depth)
            node.right = grow_tree(X_right, y_right, depth + 1, max_depth)
    return node


def best_split(X, y):
    m, n = X.shape
    if m <= 1:
        return None, None

    num_parent = [np.sum(y == c) for c in range(num_classes)]
    best_gini = 1.0 - sum((num / m) ** 2 for num in num_parent)
    best_idx, best_thr = None, None

    for idx in range(n):
        thresholds, classes = zip(*sorted(zip(X[:, idx], y)))
        num_left = [0] * num_classes
        num_right = num_parent.copy()
        for i in range(1, m):
            c = classes[i - 1]
            num_left[c] += 1
            num_right[c] -= 1
            gini_left = 1.0 - sum(
                (num_left[x] / i) ** 2 for x in range(num_classes)
            )
            gini_right = 1.0 - sum(
                (num_right[x] / (m - i)) ** 2 for x in range(num_classes)
            )
            gini = (i * gini_left + (m - i) * gini_right) / m
            if thresholds[i] == thresholds[i - 1]:
                continue
            if gini < best_gini:
                best_gini = gini
                best_idx = idx
                best_thr = (thresholds[i] + thresholds[i - 1]) / 2
    return best_idx, best_thr


def predict_tree(node, X):
    if node.left is None and node.right is None:
        return node.predicted_class * np.ones(X.shape[0], dtype=int)

    left_idx = (X[:, node.feature_index] < node.threshold)
    right_idx = ~left_idx

    y = np.empty(X.shape[0], dtype=int)
    y[left_idx] = predict_tree(node.left, X[left_idx])
    y[right_idx] = predict_tree(node.right, X[right_idx])

    return y


def train_tree(X, y, max_depth=None):
    global num_classes
    num_classes = len(set(y))
    tree = grow_tree(X, y, max_depth=max_depth)
    return tree

if __name__ == '__main__':
    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score

    # 加载数据
    iris = load_iris()
    X, y = iris.data, iris.target

    # 分割数据为训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 创建决策树分类器
    clf = DecisionTreeClassifier(random_state=42)

    # 拟合模型
    clf.fit(X_train, y_train)

    # 在测试集上进行预测
    y_pred = clf.predict(X_test)

    # 计算准确率
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy * 100:.2f}%')

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/133932852