Machine Learning - Decision Tree Supplement

The previous blog mentioned how to create a decision tree and let the decision tree be visualized. Here, I will supplement the concept explanation of continuity and missing, as well as the code implementation of pruning.

Continuous and missing values

Continuous value processing

What is continuous?

Definition: A variable that can take any value within a certain interval is called a continuous variable. Its value is continuous , and two adjacent values ​​can be infinitely divided, that is, it can take an infinite number of values. --Baidu Encyclopedia

It can be seen from the definition that the value of continuous value is infinite, not the finite value of discrete value, so the nodes cannot be divided directly according to the possible value of continuous attribute, so there is continuous attribute discretization technology. The simplest strategy is to use the mechanism adopted by the C4.5 decision tree to process continuous values ​​​​by dichotomy

Given a sample set D and a continuous attribute a, assuming that a has n different values ​​on D, arrange these values ​​from small to large, and denote as a1,a2,...a^n. D_{t}^{+}Based on the partition point t, D can be divided into a subset sum D_{t}^{-}, which D_{t}^{-}includes those samples whose value on attribute a is not greater than t, and D_{t}^{+}includes those samples whose value on attribute a is greater than t. Take the values ​​ai, ai+1 for the adjacent attributes, and take any value of t in [ai, ai+1] to produce the same division result. For the continuous attribute a, the midpoint of the interval [ai, ai+1] is used as the candidate division point.
T a = ai + ai + 1 2 ∣ 1 ≤ i ≤ n − 1 T_{a} =\frac{a^i + a^{i+1}}{2} | 1\le{i}\le{ n-1}Ta=2ai+ai+11in1
Use the discrete attribute value method to calculate the information gain of these division points, and select the optimal division point to divide the sample set:

G a i n ( D , a ) = max ⁡ t ∈ T a G a i n ( D , a , t ) = max ⁡ t ∈ T a E n t ( D ) − ∑ λ ∈ ( − , + ) ∣ D t λ ∣ ∣ D ∣ E n t ( D t λ ) Gain(D,a) = \max_{t\in T_a}Gain(D,a,t) = \max_{t\in T_a}Ent(D) - \sum_{\lambda \in{(-,+)}}\frac{|D_{t}^{\lambda } |}{|D|}Ent(D_{t}^{\lambda }) Gain(D,a)=tTamaxGain(D,a,t)=tTamaxEnt(D)λ(,+)DDtlEnt(Dtl)

Missing value handling

In reality, our data set samples often encounter incomplete samples (the value of a feature is empty), which is called missing some attribute values ​​of the sample. In this case, it is not good to manually detect and mark again, or to abandon these data sets. Manual completion will lead to a lot of waste of time, and abandonment will cause a huge waste of data information.

The watermelon book re-proposes two problems that need to be solved to deal with missing values:

  1. How to do partition attribute selection with missing attribute values?
  2. Given a partition attribute, how to partition the sample if the value of the sample on this attribute is missing?

solve:

For question 1:

Given a training set D and attribute a, let \widetilde{D}denote the subset of samples in D that have no missing values ​​on attribute a. It can be used \widetilde{D}to judge the pros and cons of attribute a. Assuming that attribute a has V possible values ​​{ }, let represent the sample subset that takes value on attribute a , and represent the sample subset that belongs to the kth class (k=1,2,...,|y|) . then there is a1,a2,...,a^v\widetilde{D}^{v}\widetilde{D}a^v\widetilde{D_{k}}

\widetilde{D}=\cup_{k=1}^{|y|}\widetilde{D_{k}} \widetilde{D}=\cup _{v=1}{V}\widetilde{D{v}}

Suppose we assign a weight to each sample x, and define

\rho =\frac{\sum_{x\in\widetilde{D}}{w_{x}}}{\sum_{x \in D}{w_{x}}}\tilde{r_v}\cdot w_x

\widetilde{p_{k}}=\frac{\sum_{x\in\widetilde{D_{k}}}{w_{x}}}{\sum_{x \in D}{w_{x}}}

\widetilde{r_{v}}=\frac{\sum_{x\in\widetilde{D_{v}}}{w_{x}}}{\sum_{x \in \widetilde{D}}{w_{x}}}

Among them, \rhorepresents the proportion of samples without missing values, \widetilde{p_{k}}represents the proportion of class k in samples without missing values, and represents the proportion of samples that take \widetilde{r_{v}}values ​​on attribute a in samples without missing values .a^{v}

Based on the above definition, the calculation formula of information gain is extended as:

Gain(D,a)=\rho \times Gain(\tilde{D},a) =\rho \times (Ent(\tilde{D})-\sum_{v=1}{V}\tilde{r_v}Ent(\tilde{Dt}))

in,

Ent(\tilde{D})=-\sum_{k=1}^{|y|} \tilde{p_k}log_2\tilde{p_k}

For question two:

If the value of the sample x on the division attribute a is known, then divide x into the child node corresponding to its value, and the sample weight remains in the child node as w_x. If the value of the sample x on the division attribute a is unknown, x is divided into all child nodes at the same time, and the sample weight is a^vadjusted in the child nodes corresponding to the attribute value \tilde{r_v}\cdot w_x; intuitively, this is to let the same sample use different Probabilities are divided into different sub-nodes.

Code

The definition and concept of pruning have been explained in my previous article, so I won’t go into details here. Click to jump .

Here I judge whether I can participate in the flag-raising team on October 1 by judging the three characteristics of the Jimei University flag guard team's height, team age, and whether they are injured (the data is virtual).

Data Display:

insert image description here

The main code (taking "post-pruning" as an example) shows:

# 剪枝策略
def postPruningTree(inputTree, dataSet, data_test, labels, labelProperties):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    classList = [example[-1] for example in dataSet]
    featkey = copy.deepcopy(firstStr)
    if '<' in firstStr:  # 对连续的特征值,使用正则表达式获得特征标签和value
        featkey = re.compile("(.+<)").search(firstStr).group()[:-1]
        featvalue = float(re.compile("(<.+)").search(firstStr).group()[1:])
    labelIndex = labels.index(featkey)
    temp_labels = copy.deepcopy(labels)
    temp_labelProperties = copy.deepcopy(labelProperties)
    if labelProperties[labelIndex] == 0:  # 离散特征
        del (labels[labelIndex])
        del (labelProperties[labelIndex])
    for key in secondDict.keys():  # 对每个分支
        if type(secondDict[key]).__name__ == 'dict':  # 如果不是叶子节点
            if temp_labelProperties[labelIndex] == 0:  # 离散的
                subDataSet = splitDataSet_c(dataSet, labelIndex, key)
                subDataTest = splitDataSet_c(data_test, labelIndex, key)
            else:
                if key == 'Y':
                    subDataSet = splitDataSet_c(dataSet, labelIndex, featvalue,
                                               'L')
                    subDataTest = splitDataSet_c(data_test, labelIndex,
                                                featvalue, 'L')
                else:
                    subDataSet = splitDataSet_c(dataSet, labelIndex, featvalue,
                                               'R')
                    subDataTest = splitDataSet_c(data_test, labelIndex,
                                                featvalue, 'R')
            if len(subDataTest) > 0:
                inputTree[firstStr][key] = postPruningTree(secondDict[key],
                                                       subDataSet, subDataTest,
                                                       copy.deepcopy(labels),
                                                       copy.deepcopy(
                                                           labelProperties))
    print(testing(inputTree,  data_test, temp_labels,
               temp_labelProperties))
    print(testingMajor(majorityCnt(classList), data_test))
    if testing(inputTree, data_test, temp_labels,
               temp_labelProperties) <= testingMajor(majorityCnt(classList),
                                                     data_test):
        return inputTree
    return majorityCnt(classList)
# 测试决策树正确率
def testing(myTree, data_test, labels, labelProperties):
    error = 0.0
    for i in range(len(data_test)):
        classLabelSet = classify(myTree, labels, labelProperties, data_test[i])
        
        maxWeight = 0.0
        classLabel = ''
        for item in classLabelSet.items():
            if item[1] > maxWeight:
                classLabel = item[0]
        if classLabel !=  data_test[i][-1]:
            error += 1
    return float(error)


# 测试投票节点正确率
def testingMajor(major, data_test):
    error = 0.0
    for i in range(len(data_test)):
        if major[0] != data_test[i][-1]:
            error += 1
    return float(error)
# 测试算法
def classify(inputTree,featLabels, featLabelProperties, testVec):
    firstStr = list(inputTree.keys())[0]  # 根节点
    firstLabel = firstStr
    lessIndex = str(firstStr).find('<')
    if lessIndex > -1:  # 如果是连续型的特征
        firstLabel = str(firstStr)[:lessIndex]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstLabel)  # 跟节点对应的特征
    classLabel = {
    
    }
    
    for key in secondDict.keys():  # 对每个分支循环
        if featLabelProperties[featIndex] == 0:  # 离散的特征
            if testVec[featIndex] == key:  # 测试样本进入某个分支
                if type(secondDict[key]).__name__ == 'dict':  # 该分支不是叶子节点,递归
                    classLabelSub = classify(secondDict[key],  featLabels,
                                          featLabelProperties, testVec)
                    for classKey in classLabel.keys():
                        classLabel[classKey] += classLabelSub[classKey]
                else:  # 如果是叶子, 返回结果
                    for classKey in classLabel.keys():
                        if classKey == secondDict[key][0]:
                            classLabel[classKey] += secondDict[key][1]
                        else:
                            classLabel[classKey] += secondDict[key][2]
            elif testVec[featIndex] == 'N':  # 如果测试样本的属性值缺失,则进入每个分支
                if type(secondDict[key]).__name__ == 'dict':  # 该分支不是叶子节点,递归
                    classLabelSub = classify(secondDict[key],  featLabels,
                                          featLabelProperties, testVec)
                    for classKey in classLabel.keys():
                        classLabel[classKey] += classLabelSub[key]
                else:  # 如果是叶子, 返回结果
                    for classKey in classLabel.keys():
                        if classKey == secondDict[key][0]:
                            classLabel[classKey] += secondDict[key][1]
                        else:
                            classLabel[classKey] += secondDict[key][2]
        else:
            partValue = float(str(firstStr)[lessIndex + 1:])
            if testVec[featIndex] == 'N':  # 如果测试样本的属性值缺失,则对每个分支的结果加和
                # 进入左子树
                if type(secondDict[key]).__name__ == 'dict':  # 该分支不是叶子节点,递归
                    classLabelSub = classify(secondDict[key],  featLabels,
                                          featLabelProperties, testVec)
                    for classKey in classLabel.keys():
                        classLabel[classKey] += classLabelSub[classKey]
                else:  # 如果是叶子, 返回结果
                    for classKey in classLabel.keys():
                        if classKey == secondDict[key][0]:
                            classLabel[classKey] += secondDict[key][1]
                        else:
                            classLabel[classKey] += secondDict[key][2]
            elif float(testVec[featIndex]) <= partValue and key == 'Y':  # 进入左子树
                if type(secondDict['Y']).__name__ == 'dict':  # 该分支不是叶子节点,递归
                    classLabelSub = classify(secondDict['Y'], featLabels,
                                             featLabelProperties, testVec)
                    for classKey in classLabel.keys():
                        classLabel[classKey] += classLabelSub[classKey]
                else:  # 如果是叶子, 返回结果
                    for classKey in classLabel.keys():
                        if classKey == secondDict[key][0]:
                            classLabel[classKey] += secondDict['Y'][1]
                        else:
                            classLabel[classKey] += secondDict['Y'][2]
            elif float(testVec[featIndex]) > partValue and key == 'N':
                if type(secondDict['N']).__name__ == 'dict':  # 该分支不是叶子节点,递归
                    classLabelSub = classify(secondDict['N'], featLabels,
                                             featLabelProperties, testVec)
                    for classKey in classLabel.keys():
                        classLabel[classKey] += classLabelSub[classKey]
                else:  # 如果是叶子, 返回结果
                    for classKey in classLabel.keys():
                        if classKey == secondDict[key][0]:
                            classLabel[classKey] += secondDict['N'][1]
                        else:
                            classLabel[classKey] += secondDict['N'][2]

    return classLabel

operation result:
insert image description here

After pruning:

insert image description here

analyze

I believe that after seeing the above pruning effect, you may find the following problems:

  1. It's a lie, after pruning, it can be judged directly by a feature, which is different from your own division?
  2. Why would it lead to such a result? There are several layers of other people's or textbooks.

Regarding the issue above:

  1. The data set is designed by myself. The size of the data and the number of features are relatively small, so the results will be better. You can find some data sets with a larger order of magnitude for testing. I believe that the results after paper cutting will be sure. Not like me.
  2. The effect of post-pruning tends to be better in generalization, but the training time is relatively long, and it is also related to the size of the data volume and the number of features.

The purpose of pruning is to reduce those decision-making paths with unclear judgments, so that there are fewer decision-making paths and the accuracy of judgments is as large as possible. Just like ourselves, after making a decision, our thinking may change later, and some useless (bad effect) judgment basis will be removed (post-pruning).

code acquisition

Link: https://pan.baidu.com/s/1BynAi2uPx1eyZLhR6cHnww
Extraction code: 1234

Guess you like

Origin blog.csdn.net/weixin_51961968/article/details/127982576