"Machine Learning Formula Derivation and Code Implementation"-chapter7 Decision Tree

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

decision tree

Decision tree (decision tree) continuously divides data instances according to conditions based on features, and finally achieves the purpose of classification or regression.

The author of this chapter mainly introduces how to use decision trees for classification models.
The process of decision tree model prediction can be regarded not only as a set of if-then conditions, but also as a conditional probability distribution defined in feature space and class space.
The core concepts of the decision tree model include 特征选择方法, 决策树构造过程, 决策树剪枝. Common feature selection methods include 信息增益, , 信息增益比and 基尼指数(Gini index), and the corresponding three common decision tree algorithms are ID3, , C4.5and CART.

The decision tree model can be understood from two perspectives. The first one is to regard the decision tree as a set of if-then rules, and construct a rule for each path from the root node to the leaf node of the decision tree. In the path The internal node features of represent the rule conditions, while the leaf nodes represent the conclusion of this rule. All if-then rules of a decision tree are mutually exclusive and complete. The if-then rules are essentially a set of classification rules, and the goal of decision tree learning is to induce such a set of rules based on data.

The second is to understand decision trees from the perspective of conditional probability distributions. Assume that the feature space is divided into mutually disjoint regions, and the probability distribution of the classes defined by each region constitutes a conditional probability distribution. The conditional probability distribution represented by the decision tree is composed of the conditional probability distribution of the given class in each region. Assuming that X is a random variable of characteristics, Y is a random variable of class, and the corresponding conditional probability distribution is expressed as P(Y|X), when the conditional probability distribution on the leaf node is biased towards a certain class, then the probability of belonging to this class It is relatively large.

Our learning goal is to find a decision tree that can classify correctly as much as possible, but in order to ensure generalization, we need this decision tree not to be too correct, and to minimize the empirical error while regularizing the parameters.

The goal of decision tree learning is to minimize this loss function:
L a ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) + α ∣ T ∣ L_{a}\left ( T \right ) = \ sum_{t=1}^{\left | T \right | } N_{t}H_{t}\left ( T \right ) +\alpha\left | T \right |La(T)=t=1TNtHt(T)+aT is the empirical entropy ( )
H(t)on the leaf nodewhich is the regularization parameter, t is the leaf node of the tree T, and each leaf node has Nt samples.empirical entropya≥0

1 Feature selection

In order to build a decision tree with good classification performance, we need to continuously select features with classification ability from the training set. If the effect of using a feature to classify a data set is not different from that of a randomly selected classification, we can consider that the feature's ability to classify the data set is low; on the contrary, if a feature can make the classified branch nodes as far as possible Belong to the same category, that is, the node has a high purity ( purity), then the feature has a strong classification ability for the data set.
In the decision tree, we have three ways to select the optimal features, including 信息增益, , 信息增益比and 基尼指数.

1.1 Information entropy ( information entropy)

In information theory and probability statistics, entropy is a measure to describe the uncertainty of random variables. It can also be used to describe the purity of a sample set. The lower the information entropy, the smaller the sample uncertainty, and the higher the corresponding purity.
Assuming that the proportion of the kth class in the current sample data set D is Pk (k=1,2…Y), then the entropy of the sample data set can be defined as: E ( D ) = − ∑ k =
1 Y pk log ⁡ pk E\left ( D \right )= -\sum_{k=1}^{Y}p_{k}\log_{}{p_{k}}E(D)=k=1Ypklogpk

## 信息熵计算定义
from math import log
import pandas as pd

# 信息熵计算函数
def entropy(ele): # ele 包含类别取值的列表

    probs = [ele.count(i) / len(ele) for i in set(ele)] # 计算列表中取值的概率分布
    return -sum([prob * log(prob, 2) for prob in probs]) # 计算信息熵

# 试运行
df = pd.read_csv('./golf_data.csv')
entropy(df['play'].to_list())
0.9402859586706309

1.2 Information Gain

(X,Y)Assume that the joint probability distribution of discrete random variables is:
P(X=xi, Y=yi) = pij(i=1,2,...m,j=1,2,...n) The
conditional entropy is E(Y|X)expressed in the known random variables The uncertainty measure Xunder the condition can be defined as the mathematical expectation of the entropy of the conditional probability distribution on X under the given condition . Conditional entropy can be expressed as: E ( Y ∣ X ) = ∑ i = 1 mpi E ( Y ∣ X = xi ) E\left ( Y \mid X \right )=\sum_{i=1}^{m}p_ {i}E\left ( Y\mid X=x_{i} \right )YE(Y|X)XY
E(YX)=i=1mpiE(YX=xi)
When using actual data for calculation, the probability calculation in entropy and conditional entropy is based on maximum likelihood estimation, and the corresponding entropy and conditional entropy are also called empirical entropy and empirical conditional entropy.

Information gain ( information gain) is defined as the degree to which the information uncertainty of class Y is reduced due to the information of feature X, that is, information gain is a quantity that describes the increase in certainty of the target category. The greater the information gain of the feature, the greater the target category. The greater the certainty. Assuming that the empirical entropy of the training set D is E(D), and the empirical conditional entropy of D under the condition of given feature A is E(D|A), then the information gain can be defined as the empirical entropy E(D) and the empirical conditional entropy E (D|A) difference:
g ( D , A ) = E ( D ) − E ( D ∣ A ) g\left ( D,A \right ) =E\left ( D \right )-E\left ( D\mid A \right )g(D,A)=E(D)E(DA )
Information gain can be used for feature selection when building a decision tree algorithm. Given the training setDand featuresA, the empirical entropyE(D)can be expressed asDthe uncertainty of classifying the data set, and the empirical conditional entropythe uncertainty of classifyingthe data set afterE(D|A)the given featuresThe difference between uncertainties is the information gain. Specific to the data setFeature selection based on information gain.ADDID3

# 划分数据集
def df_split(df, col): # 数据,特征
    unique_col_val = df[col].unique() # 获取依据特征的不同取值
    res_dict = {
    
    elem: pd.DataFrame for elem in unique_col_val}
    for key in res_dict.keys():
        res_dict[key] = df[:][df[col] == key]
    return res_dict
res_dict = df_split(df, 'outlook')
res_dict

insert image description here

# 信息增益计算
def info_gain(df, col):
    res_dict = df_split(df, col)
    entropy_D = entropy(df['play'].to_list()) # 计算数据集的经验熵
    entropy_DA = 0 # 天气特征的经验条件熵
    for key in res_dict.keys():
        entropy_DA += len(res_dict[key]) / len(df) * entropy(res_dict[key]['play'].tolist()) # p * 经验条件熵
    return entropy_D - entropy_DA # 天气特征的信息增熵

# 增益越大,代表对应的特征分类能力越强
print(f"humility:{
      
      info_gain(df, 'humility')}")
print(f"outlook:{
      
      info_gain(df, 'outlook')}")
print(f"windy:{
      
      info_gain(df, 'windy')}")
print(f"temp:{
      
      info_gain(df, 'temp')}")
humility:0.15183550136234136
outlook:0.2467498197744391
windy:0.04812703040826927
temp:0.029222565658954647

1.3 Information Gain Ratio

Information gain is a very good feature selection method, but there are also some problems: when a certain feature category has more values, the information gain calculation result of this feature will be larger, such as adding a "number" feature to the data set , from the first record to the last record, a total of 14 different values, this feature will generate 14 decision tree branches, each branch contains only one sample, and the information purity of each node is relatively high, and finally calculated The resulting information gain will also be much larger than other features. However, according to the actual situation, we know that features such as "number" are difficult to classify, and the decision tree constructed in this way is invalid. Therefore, when selecting features based on information gain, it will be biased towards features with larger values.

print(f"humility:{
      
      info_gain(df, 'humility')}")
# 为数据集加一个“编号”特征
df['counter'] = range(len(df))
print(f"counter:{
      
      info_gain(df, 'counter')}")
df

insert image description here
The above problem is corrected using the information gain ratio. The information gain ratio of a feature Ato a data set can be defined as the ratio of Dthe information gain g(D,A)to the entropy of the data set Dabout the value of the feature : n represents the number of values ​​of A g R ( D , A ) = g ( D , A ) EA ( D ) g_{R}\left ( D, A \right )=\frac{g\left ( D, A \right ) }{E_{A}\left ( D \right )}AEA(D)
gR(D,A)=EA(D)g(D,A)
E A ( D ) = − ∑ i = 1 n ∣ D i ∣ D log ⁡ 2 ∣ D i ∣ D E_{A} \left ( D \right )=-\sum_{i=1}^{n}\frac{\left | D_{i} \right | }{D}\log_{2}\frac{\left | D_{i} \right |}{D} EA(D)=i=1nDDilog2DDi

# 信息增益比计算
def information_gain_ratio(df, col):
    g = info_gain(df, col)
    entropy_EAD = entropy(df[col].to_list())
    return g / entropy_EAD

# 试运行
print(f"outlook:{
      
      information_gain_ratio(df, 'outlook')}")
print(f"counter:{
      
      information_gain_ratio(df, 'counter')}")
outlook:0.15642756242117517
counter:0.2469656698468429

In addition to information gain and information gain ratio, the Gini index ( Gini index) is also a better feature selection method. The Gini index is for probability distributions. Suppose the sample has Ka class, and kthe probability that the sample belongs to the first class is pk, then the Gini index of the probability distribution of the sample class can be defined as:
G ini ( p ) = ∑ k = 1 K pk ( 1 − pk ) = 1 − ∑ k = 1 K pk 2 Gini\left ( p \right )=\sum_{k=1}^{K}p_{k}\left ( 1-p_{k} \right )=1-\sum_{k=1 }^{K}p_{k}^{2}Gini(p)=k=1Kpk(1pk)=1k=1Kpk2
For a given training set D, which is a set of samples Ckbelonging to the first class, the Gini index of the training set can be defined as: G ini ( D ) = 1 − ∑ k = 1 K ( ∣ C k ∣ ∣ D ∣ ) 2 Gini\ left ( D \right )=1-\sum_{k=1}^{K}\left ( \frac{\left | C_{k} \right | }{\left | D \right | } \right ) ^ {2}k
Gini(D)=1k=1K(DCk)2
If the training setis divided intotwoa certain valueDof the feature, then under the condition of the feature, the Gini index of the training setcan be defined as:G ini ( D , A ) = D 1 DG ini ( D 1 ) + D 2 DG ini ( D 2 ) Gini\left ( D,A \right )=\frac{D_{1}}{D}Gini\left ( D_{1} \right )+\frac{D_{2}} {D}Gini\left (D_{2} \right)AaD1D2AD
Gini(D,A)=DD1Gini(D1)+DD2Gini(D2)
is similar to the definition of information entropy, the Gini index of the training setDrepresents the uncertainty of the set, andthe uncertainty ofGini(D,A)the training setDafterA=aFor classification tasks, we hope that the smaller the uncertainty of the training set, the better, and the stronger the classification ability of the corresponding features for the training samples. CARTThe algorithm is based on the Gini index for feature selection.

# 计算基尼指数
import numpy as np
def calculate_gini(y): # y 包含类别取值的列表
    # 将数组转化为列表
    y = y.tolist()
    probs = [y.count(i)/len(y) for i in np.unique(y)]
    gini = sum([p*(1-p) for p in probs])
    return gini

# 划分数据集并计算基尼指数
def gini_da(df, col, key): # g根据天气特征取值为晴与非晴划分为两个子集
    col_val = [key, 'other']
    new_dict = {
    
    elem: pd.DataFrame for elem in col_val} # 创建划分结果的数据框字典
    new_dict[key] = df[:][df[col] == key]
    new_dict['other'] = df[:][df[col] != key]
    gini_DA = 0
    for key in new_dict.keys():
        gini_DA += len(new_dict[key]) / len(df) * calculate_gini(new_dict[key]['play'])
    return gini_DA

# 计算天气特征条件下数据集的基尼指数
print(f"sunny:{
      
      gini_da(df, 'outlook','sunny')}")
print(f"rainy:{
      
      gini_da(df, 'outlook','rainy')}")
print(f"overcast:{
      
      gini_da(df, 'outlook','overcast')}")
sunny:0.3936507936507937
rainy:0.4571428571428572
overcast:0.35714285714285715

2 Decision tree model

Based on the three feature selection methods of information gain, information gain ratio and Gini coefficient, there are three classic decision tree algorithms ID3, C4.5and respectively . CARTThese three algorithms are basically the same in the construction of classification decision trees, and they all use the feature selection method to recursively select the optimal features for construction. Among them ID3, C4.5the neutralization algorithm only generates the decision tree, and does not include the pruning part of the decision tree, so these two algorithms are sometimes prone to overfitting. CARTIn addition to being used for classification, the algorithm can also be used for regression, and the algorithm includes decision tree pruning.

2.1 ID3

ID3The full name of the algorithm is Iterative Dichotomiser 3(3-generation iterative binary tree), and its core is to recursively select the optimal feature to construct a decision tree based on information gain.

The specific method is as follows: first preset a decision tree root node, then calculate the information gain for all features, select a feature with the largest information gain as the optimal feature, establish sub-nodes according to the different values ​​of the feature , and then calculate the information gain for each sub-node The node recursively calls the above method until the information gain is small or there is no feature to choose, then the final ID3 decision tree can be constructed.

# ID3算法的核心步骤-选择最优特征

def choose_best_feature(df, label):
    '''
    思想:根据训练集和标签选择信息增益最大的特征作为最优特征
    输入:
    df:待划分的训练数据
    label:训练标签
    输出:
    max_value:最大信息增益值
    best_feature:最优特征
    max_splited:根据最优特征划分后的数据字典
    '''

    entropy_D = entropy(df[label].tolist()) # 计算训练标签的信息熵
    cols = [col for col in df.columns if col not in [label]] # 特征集
    max_value, best_feature, max_splited = -999, None, None # 初始化最大信息增益、最优特征和划分后的数据集
    for col in cols: # 遍历特征并根据特征取值进行划分
        splited_set = df_split(df, col)
        entropy_DA = 0 # 初始化经验条件熵
        for subset_col, subset in splited_set.items():
            entropy_DA += len(subset) / len(df) * entropy(subset[label].tolist()) # 计算当前特征的经验条件熵
        info_gain = entropy_D - entropy_DA # 计算当前特征的特征增益
        if info_gain > max_value: # 获取最大信息增熵,并保存对应的特征和划分结果 
            max_value, best_feature = info_gain, col
            max_splited = splited_set
    return max_value, best_feature, max_splited

# 试运行
df = df.drop(labels='counter', axis=1)
choose_best_feature(df, 'play')

insert image description here

# 封装构建ID3决策树的算法类

class ID3Tree: # ID3算法类

    class TreeNode: # 定义树结点
        
        def __init__(self, name): # 定义
            self.name = name
            self.connections = dict()
            
        def connect(self, label, node):
            self.connections[label] = node

    def __init__(self, df, label):
        self.columns = df.columns
        self.df = df
        self.label = label
        self.root = self.TreeNode('Root')
    
    def construct_tree(self): # 构建树的调用
        self.construct(self.root, '', self.df, self.columns)
    
    def construct(self, parent_node, parent_label, sub_df, columns): # 决策树构建方法
        max_value, best_feature, max_splited = choose_best_feature(sub_df[columns], self.label) # 选择最优特征
        if not best_feature: # 如果选不到最优特征,则构造单结点树
            node = self.TreeNode(sub_df[self.label].iloc[0])
            parent_node.connect(parent_label, node)
            return
        
        # 根据最优特征以及子结点构建树
        node = self.TreeNode(best_feature)
        parent_node.connect(parent_label, node)

        new_columns = [col for col in columns if col != best_feature] # 以A-Ag为新的特征集

        # 递归的构造决策树
        for splited_value, splited_data in max_splited.items():
            self.construct(node, splited_value, splited_data, new_columns)
    
    # 打印树
    def print_tree(self, node, tabs):
        print(tabs + node.name)
        for connection, child_node in node.connections.items():
            print(tabs + "\t" + "(" + str(connection) + ")")
            self.print_tree(child_node, tabs + "\t\t")
# 构建id3决策树
id3_tree = ID3Tree(df, 'play')
id3_tree.construct_tree()
id3_tree.print_tree(id3_tree.root, '')

insert image description here

2.2 CART

CARTThe full name of the algorithm is classification and regression tree(Classification and Regression Tree), and CART can be understood as a learning algorithm that outputs the conditional probability distribution of a random variable Y under the condition of a given random variable X. The decision trees generated by CART are all binary decision trees, and the internal node values ​​are "yes" and "no". This node division method is equivalent to recursively bisecting each feature and dividing the feature space into a finite number of units. And determine the predicted probability distribution on these units, that is, the aforementioned predicted conditional probability distribution.
CARTAlgorithms can also be used to build regression trees. A regression tree corresponds to a partition of the feature space and the output value on the partition unit. Assuming that the feature space has Ma division unit R1, R2,… RM, and each division unit has an output weight cm, then the regression tree model can be expressed as:
f ( x ) = ∑ m = 1 M cm I ( x ∈ R m ) f\ left ( x \right )=\sum_{m=1}^{M}c_{m}I\left ( x\in R_{m} \right )f(x)=m=1McmI(xRm)
Like linear regression, the purpose of regression tree model training is also to minimize the mean square loss in order to obtain the optimal output weightcm_hat. Specifically, we use the method of minimizing the square error to solve the optimal weight on each unit. The optimal output weight can be determined by the mean value of the output values ​​​​corresponding to all input instances on each unit: cm ^ = average ( yi ∣
xi ∈ R m ) \hat{c_{m}}=average\left ( y_{i} \mid x_{i}\in R_{m}\right )cm^=average(yixiRm)
Assume that the regression tree selects the firstjfeaturex(j)and its corresponding valuesas the dividing feature and dividing point, and defines two regions at the same time:
R 1 ( j , s ) = { x ∣ x ( j ) ≤ s } ; R 2 ( j , s ) = { x ∣ x ( j ) > s } R_{1}\left ( j,s \right )=\left \{ x\mid x^{(j)}\le s \right \}; R_{2}\left ( j,s \right )=\left \{ x\mid x^{(j)}> s \right \}R1(j,s)={ xx(j)s};R2(j,s)={ xx(j)>s }
and then solve it to get the input featuresjand the optimal division points.
minjs [ minc 1 ∑ xi ∈ R 1 ( j , s ) ( yi − c 1 ) 2 + minc 2 ∑ xi ∈ R 2 ( j , s ) ( yi − c 2 ) 2 ] min_{js}\left [ min_ {c_{1}\sum_{x_{i}\in R_{1}(j,s)}^{}}(y_{i}-c_{1})^{2}+min_{c_{2} \sum_{x_{i}\in R_{2}(j,s)}^{}}(y_{i}-c_{2})^{2}\right ]minjs[minc1xiR1(j,s)(yic1)2+minc2xiR2(j,s)(yic2)2]

# 定义树结点
class TreeNode:
    def __init__(self, feature_i=None, threshold=None, leaf_value=None, left_branch=None, right_branch=None):
        
        self.feature_i = feature_i # 特征索引
        self.threshold = threshold # 特征划分阈值
        self.leaf_value = leaf_value # 叶子节点取值
        self.left_branch = left_branch # 左子树
        self.right_branch = right_branch # 右子树
# 定义二叉特征分裂函数
def feature_split(X, feature_i, threshold):
    split_func = None
    if isinstance(threshold, int) or isinstance(threshold, float):
        split_func = lambda sample: sample[feature_i] >= threshold
    else:
        split_func = lambda sample: sample[feature_i] == threshold

    X_left = np.array([sample for sample in X if split_func(sample)])
    X_right = np.array([sample for sample in X if not split_func(sample)])

    return np.array([X_left, X_right])
# 定义二叉决策树
class BinaryDecisionTree:
    def __init__(self, min_samples_split=3, min_gini_impurity=999, max_depth=float('inf'), loss=None): # 决策树初始参数
        
        self.root = None  # 根结点
        self.min_samples_split = min_samples_split # 节点最小分裂样本数
        self.mini_gini_impurity = min_gini_impurity # 节点初始化基尼不纯度
        self.max_depth = max_depth # 树最大深度
        self.impurity_calculation = None # 基尼不纯度计算函数
        self._leaf_value_calculation = None # 叶子节点值预测函数
        self.loss = loss # 损失函数
    
    def fit(self, X, y, loss=None): # 决策树拟合函数
        self.root = self._build_tree(X, y) # 递归构建决策树
        self.loss=None
    
    def _build_tree(self, X, y, current_depth=0): # 决策树构建函数
        init_gini_impurity = 999 # 初始化最小基尼不纯度
        best_criteria = None # 初始化最佳特征索引和阈值
        best_sets = None # 初始化数据子集

        Xy = np.concatenate((X, y), axis=1) # 合并输入和标签
        n_samples, n_features = X.shape # 获取样本数和特征数
        
        # 设定决策树构建条件
        if n_samples >= self.min_samples_split and current_depth <= self.max_depth: # 训练样本数量大于节点最小分裂样本数且当前树深度小于最大深度
            
            for feature_i in range(n_features):
                unique_values = np.unique(X[:, feature_i]) # 获取第i个特征的唯一取值
                
                for threshold in unique_values: # 遍历取值并寻找最佳特征分裂阈值
                    Xy1, Xy2 = feature_split(Xy, feature_i, threshold) # 特征节点二叉分裂

                    if len(Xy1) > 0 and len(Xy2) > 0: # 如果分裂后的子集大小都不为0
                        y1, y2 = Xy1[:, n_features:], Xy2[:, n_features:] # 获取两个子集的标签值
                        impurity = self.impurity_calculation(y, y1, y2) # 计算基尼不纯度

                        if impurity < init_gini_impurity:
                            init_gini_impurity = impurity # 获取最小基尼不纯度
                            best_criteria = {
    
    "feature_i": feature_i, "threshold": threshold} # 最佳特征索引和分裂阈值
                            best_sets = {
    
    
                                "leftX": Xy1[:, :n_features],   
                                "lefty": Xy1[:, n_features:],   
                                "rightX": Xy2[:, :n_features],  
                                "righty": Xy2[:, n_features:]   
                                }

        if init_gini_impurity < self.mini_gini_impurity: # 如果计算的最小不纯度小于设定的最小不纯度
            
            # 分别构建左右子树
            left_branch = self._build_tree(best_sets["leftX"], best_sets["lefty"], current_depth + 1)
            right_branch = self._build_tree(best_sets["rightX"], best_sets["righty"], current_depth + 1)
            return TreeNode(feature_i=best_criteria["feature_i"], threshold=best_criteria["threshold"], left_branch=left_branch, right_branch=right_branch) 

        # 计算叶子计算取值
        leaf_value = self._leaf_value_calculation(y)
        return TreeNode(leaf_value=leaf_value)

    def predict_value(self, x, tree=None): # 定义二叉树值预测函数
        if tree is None:
            tree = self.root

        if tree.leaf_value is not None: # 如果叶子节点已有值,则直接返回已有值
            return tree.leaf_value
        
        feature_value = x[tree.feature_i] # 选择特征并获取特征值

        # 判断落入左子树还是右子树
        branch = tree.right_branch
        if isinstance(feature_value, int) or isinstance(feature_value, float):
            if feature_value >= tree.threshold:
                branch = tree.left_branch
        elif feature_value == tree.threshold:
            branch = tree.left_branch
        
        return self.predict_value(x, branch) # 测试子集

    def predict(self, X): # 数据集预测函数
        y_pred = [self.predict_value(sample) for sample in X]
        return y_pred
# CART分类树
class ClassificationTree(BinaryDecisionTree): # 分类树
    
    def _calculate_gini_impurity(self, y, y1, y2): # 定义基尼不纯度计算过程
        p = len(y1) / len(y)
        gini_impurity = p * calculate_gini(y1) + (1-p) * calculate_gini(y2)
        return gini_impurity
    
    def _majority_vote(self, y): # 多数投票
        most_common = None
        max_count = 0
        for label in np.unique(y):
            # 统计多数
            count = len(y[y == label])
            if count > max_count:
                most_common = label
                max_count = count
        return most_common
    
    def fit(self, X, y): # 分类树拟合
        self.impurity_calculation = self._calculate_gini_impurity
        self._leaf_value_calculation = self._majority_vote
        super(ClassificationTree, self).fit(X, y)
# 测试CART分类树
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = datasets.load_iris() # 鸢尾花数据集
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y.reshape(-1,1), test_size=0.3)
clf = ClassificationTree()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(accuracy_score(y_test, y_pred))
0.9777777777777777
# CART回归树
class RegressionTree(BinaryDecisionTree): # 回归树
    def _calculate_variance_reduction(self, y, y1, y2):
        var_tot = np.var(y, axis=0)
        var_y1 = np.var(y1, axis=0)
        var_y2 = np.var(y2, axis=0)
        frac_1 = len(y1) / len(y)
        frac_2 = len(y2) / len(y)
        
        variance_reduction = var_tot - (frac_1 * var_y1 + frac_2 * var_y2) # 计算方差减少量
        return 1/sum(variance_reduction) # 方差减少量越大越好,所以取倒数

    def _mean_of_y(self, y): # 节点值取平均
        value = np.mean(y, axis=0)
        return value if len(value) > 1 else value[0]

    def fit(self, X, y):
        self.impurity_calculation = self._calculate_variance_reduction
        self._leaf_value_calculation = self._mean_of_y
        super(RegressionTree, self).fit(X, y)
# 测试CART回归树
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = RegressionTree()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
Mean Squared Error: 18.304654605263156

A complete decision tree algorithm, in addition to the decision tree generation algorithm, also includes the decision tree pruning algorithm. The decision tree generation algorithm recursively generates a decision tree, and the generated decision tree is large and comprehensive, but it is easy to overfit. Decision tree pruning ( pruning) is the process of simplifying the generated decision tree, by cutting off some subtrees or leaf nodes of the generated decision tree, and using its root node or parent node as a new leaf node , so as to achieve the purpose of simplifying the decision tree.

Decision tree pruning generally includes two methods: pre-pruning ( pre-pruning) and post-pruning ( psot-pruning). The so-called pre-pruning is a pruning algorithm that stops the growth of the tree in advance during the decision tree generation process. The main idea is to calculate whether the current node division can improve the generalization ability of the model before the decision tree node is split. If not, the decision tree stops growing at this node. Pre-pruning stops tree growth early, and there is a risk of underfitting to some extent.

In practical applications, we still mainly use the post-pruning method. Post-pruning is mainly achieved by minimizing the overall loss function of the decision tree. As mentioned earlier, the decision tree minimizes the following function:
L a ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) + α ∣ T ∣ L_{a}\left ( T \right ) = \sum_{ t=1}^{\left | T \right | } N_{t}H_{t}\left ( T \right ) +\alpha\left | T \right |La(T)=t=1TNtHt(T)+aT ∣The
empirical entropy of the first item can be expressed as:
H t ( T ) = − ∑ k N tk N tlog N tk N t H_{t}(T)=-\sum_{k}^{}\frac {N_{tk}}{N_{t}}log\frac{N_{tk}}{N_{t}}Ht(T)=kNtNtklogNtNtk
L(T)的第一项可以表示为:
L a ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) = − ∑ t = 1 ∣ T ∣ ∑ k = 1 K N t k l o g N t k N t L_{a}\left ( T \right ) = \sum_{t=1}^{\left | T \right | } N_{t}H_{t}\left ( T \right )=-\sum_{t=1}^{\left | T \right |}\sum_{k=1}^{K} N_{tk}log\frac{N_{tk}}{N_{t}} La(T)=t=1TNtHt(T)=t=1Tk=1KNtklogNtNtk
Rewrite the decision tree optimization function as
L a ( T ) = L ( T ) + α ∣ T ∣ L_{a} (T)=L(T)+\alpha\left | T \right |La(T)=L(T)+aT
L(T) is the empirical error term of the model,|T|is the complexity of the decision tree, andαis the regularization parameter.

Post-decision tree pruning is to select the decision tree model with the smallest αloss function when the complexity is determined . Given the decision tree and regularization parameters Lα(T)obtained by the generation algorithm , the post-decision tree pruning algorithm is described as follows: (1). Calculate the empirical entropy of each tree node . (2). Recursively retract from bottom to top. Assume that a group of leaf nodes are retracted to the tree before and after the parent node respectively, and the corresponding loss functions are respectively and . If , then pruning, the parent node The node becomes a new leaf node. (3). Repeat (2) until the subtree with the smallest loss function is obtained .Tα
Ht(T)
TbeforeTafterLα(Tbefore)Lα(Tafter)Lα(Tafter)Lα(Tbefore)

CART后剪枝Pruning is achieved by calculating the loss function of the subtree and a subtree sequence is obtained, and then 交叉验证the optimal subtree is selected from the subtree sequence by the method.

2.3 Implement classification tree and regression tree based on sklearn

# 基于sklearn实现分类树
from sklearn.tree import DecisionTreeClassifier
data = datasets.load_iris() # 鸢尾花数据集
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y.reshape(-1,1), test_size=0.3)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.9333333333333333
# 基于sklearn实现回归树
from sklearn.tree import DecisionTreeRegressor
X, y = load_boston(return_X_y=True)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Mean Squared Error: 14.449605263157896

Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131205405