Decision Tree Supplement
The previous blog mentioned how to create a decision tree and let the decision tree be visualized. Here, I will supplement the concept explanation of continuity and missing, as well as the code implementation of pruning.
Continuous and missing values
Continuous value processing
What is continuous?
Definition: A variable that can take any value within a certain interval is called a continuous variable. Its value is continuous , and two adjacent values can be infinitely divided, that is, it can take an infinite number of values. --Baidu Encyclopedia
It can be seen from the definition that the value of continuous value is infinite, not the finite value of discrete value, so the nodes cannot be divided directly according to the possible value of continuous attribute, so there is continuous attribute discretization technology. The simplest strategy is to use the mechanism adopted by the C4.5 decision tree to process continuous values by dichotomy
Given a sample set D and a continuous attribute a, assuming that a has n different values on D, arrange these values from small to large, and denote as . Based on the partition point t, D can be divided into a subset sum , which includes those samples whose value on attribute a is not greater than t, and includes those samples whose value on attribute a is greater than t. Take the values ai, ai+1 for the adjacent attributes, and take any value of t in [ai, ai+1] to produce the same division result. For the continuous attribute a, the midpoint of the interval [ai, ai+1] is used as the candidate division point.
T a = ai + ai + 1 2 ∣ 1 ≤ i ≤ n − 1 T_{a} =\frac{a^i + a^{i+1}}{2} | 1\le{i}\le{ n-1}Ta=2ai+ai+1∣1≤i≤n−1
Use the discrete attribute value method to calculate the information gain of these division points, and select the optimal division point to divide the sample set:G a i n ( D , a ) = max t ∈ T a G a i n ( D , a , t ) = max t ∈ T a E n t ( D ) − ∑ λ ∈ ( − , + ) ∣ D t λ ∣ ∣ D ∣ E n t ( D t λ ) Gain(D,a) = \max_{t\in T_a}Gain(D,a,t) = \max_{t\in T_a}Ent(D) - \sum_{\lambda \in{(-,+)}}\frac{|D_{t}^{\lambda } |}{|D|}Ent(D_{t}^{\lambda }) Gain(D,a)=t∈TamaxGain(D,a,t)=t∈TamaxEnt(D)−λ∈(−,+)∑∣D∣∣Dtl∣Ent(Dtl)
Missing value handling
In reality, our data set samples often encounter incomplete samples (the value of a feature is empty), which is called missing some attribute values of the sample. In this case, it is not good to manually detect and mark again, or to abandon these data sets. Manual completion will lead to a lot of waste of time, and abandonment will cause a huge waste of data information.
The watermelon book re-proposes two problems that need to be solved to deal with missing values:
- How to do partition attribute selection with missing attribute values?
- Given a partition attribute, how to partition the sample if the value of the sample on this attribute is missing?
solve:
For question 1:
Given a training set D and attribute a, let denote the subset of samples in D that have no missing values on attribute a. It can be used to judge the pros and cons of attribute a. Assuming that attribute a has V possible values { }, let represent the sample subset that takes value on attribute a , and represent the sample subset that belongs to the kth class (k=1,2,...,|y|) . then there is
Suppose we assign a weight to each sample x, and define
Among them, represents the proportion of samples without missing values, represents the proportion of class k in samples without missing values, and represents the proportion of samples that take values on attribute a in samples without missing values .
Based on the above definition, the calculation formula of information gain is extended as:
in,
For question two:
If the value of the sample x on the division attribute a is known, then divide x into the child node corresponding to its value, and the sample weight remains in the child node as . If the value of the sample x on the division attribute a is unknown, x is divided into all child nodes at the same time, and the sample weight is adjusted in the child nodes corresponding to the attribute value ; intuitively, this is to let the same sample use different Probabilities are divided into different sub-nodes.
Code
The definition and concept of pruning have been explained in my previous article, so I won’t go into details here. Click to jump .
Here I judge whether I can participate in the flag-raising team on October 1 by judging the three characteristics of the Jimei University flag guard team's height, team age, and whether they are injured (the data is virtual).
Data Display:
The main code (taking "post-pruning" as an example) shows:
# 剪枝策略
def postPruningTree(inputTree, dataSet, data_test, labels, labelProperties):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
classList = [example[-1] for example in dataSet]
featkey = copy.deepcopy(firstStr)
if '<' in firstStr: # 对连续的特征值,使用正则表达式获得特征标签和value
featkey = re.compile("(.+<)").search(firstStr).group()[:-1]
featvalue = float(re.compile("(<.+)").search(firstStr).group()[1:])
labelIndex = labels.index(featkey)
temp_labels = copy.deepcopy(labels)
temp_labelProperties = copy.deepcopy(labelProperties)
if labelProperties[labelIndex] == 0: # 离散特征
del (labels[labelIndex])
del (labelProperties[labelIndex])
for key in secondDict.keys(): # 对每个分支
if type(secondDict[key]).__name__ == 'dict': # 如果不是叶子节点
if temp_labelProperties[labelIndex] == 0: # 离散的
subDataSet = splitDataSet_c(dataSet, labelIndex, key)
subDataTest = splitDataSet_c(data_test, labelIndex, key)
else:
if key == 'Y':
subDataSet = splitDataSet_c(dataSet, labelIndex, featvalue,
'L')
subDataTest = splitDataSet_c(data_test, labelIndex,
featvalue, 'L')
else:
subDataSet = splitDataSet_c(dataSet, labelIndex, featvalue,
'R')
subDataTest = splitDataSet_c(data_test, labelIndex,
featvalue, 'R')
if len(subDataTest) > 0:
inputTree[firstStr][key] = postPruningTree(secondDict[key],
subDataSet, subDataTest,
copy.deepcopy(labels),
copy.deepcopy(
labelProperties))
print(testing(inputTree, data_test, temp_labels,
temp_labelProperties))
print(testingMajor(majorityCnt(classList), data_test))
if testing(inputTree, data_test, temp_labels,
temp_labelProperties) <= testingMajor(majorityCnt(classList),
data_test):
return inputTree
return majorityCnt(classList)
# 测试决策树正确率
def testing(myTree, data_test, labels, labelProperties):
error = 0.0
for i in range(len(data_test)):
classLabelSet = classify(myTree, labels, labelProperties, data_test[i])
maxWeight = 0.0
classLabel = ''
for item in classLabelSet.items():
if item[1] > maxWeight:
classLabel = item[0]
if classLabel != data_test[i][-1]:
error += 1
return float(error)
# 测试投票节点正确率
def testingMajor(major, data_test):
error = 0.0
for i in range(len(data_test)):
if major[0] != data_test[i][-1]:
error += 1
return float(error)
# 测试算法
def classify(inputTree,featLabels, featLabelProperties, testVec):
firstStr = list(inputTree.keys())[0] # 根节点
firstLabel = firstStr
lessIndex = str(firstStr).find('<')
if lessIndex > -1: # 如果是连续型的特征
firstLabel = str(firstStr)[:lessIndex]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstLabel) # 跟节点对应的特征
classLabel = {
}
for key in secondDict.keys(): # 对每个分支循环
if featLabelProperties[featIndex] == 0: # 离散的特征
if testVec[featIndex] == key: # 测试样本进入某个分支
if type(secondDict[key]).__name__ == 'dict': # 该分支不是叶子节点,递归
classLabelSub = classify(secondDict[key], featLabels,
featLabelProperties, testVec)
for classKey in classLabel.keys():
classLabel[classKey] += classLabelSub[classKey]
else: # 如果是叶子, 返回结果
for classKey in classLabel.keys():
if classKey == secondDict[key][0]:
classLabel[classKey] += secondDict[key][1]
else:
classLabel[classKey] += secondDict[key][2]
elif testVec[featIndex] == 'N': # 如果测试样本的属性值缺失,则进入每个分支
if type(secondDict[key]).__name__ == 'dict': # 该分支不是叶子节点,递归
classLabelSub = classify(secondDict[key], featLabels,
featLabelProperties, testVec)
for classKey in classLabel.keys():
classLabel[classKey] += classLabelSub[key]
else: # 如果是叶子, 返回结果
for classKey in classLabel.keys():
if classKey == secondDict[key][0]:
classLabel[classKey] += secondDict[key][1]
else:
classLabel[classKey] += secondDict[key][2]
else:
partValue = float(str(firstStr)[lessIndex + 1:])
if testVec[featIndex] == 'N': # 如果测试样本的属性值缺失,则对每个分支的结果加和
# 进入左子树
if type(secondDict[key]).__name__ == 'dict': # 该分支不是叶子节点,递归
classLabelSub = classify(secondDict[key], featLabels,
featLabelProperties, testVec)
for classKey in classLabel.keys():
classLabel[classKey] += classLabelSub[classKey]
else: # 如果是叶子, 返回结果
for classKey in classLabel.keys():
if classKey == secondDict[key][0]:
classLabel[classKey] += secondDict[key][1]
else:
classLabel[classKey] += secondDict[key][2]
elif float(testVec[featIndex]) <= partValue and key == 'Y': # 进入左子树
if type(secondDict['Y']).__name__ == 'dict': # 该分支不是叶子节点,递归
classLabelSub = classify(secondDict['Y'], featLabels,
featLabelProperties, testVec)
for classKey in classLabel.keys():
classLabel[classKey] += classLabelSub[classKey]
else: # 如果是叶子, 返回结果
for classKey in classLabel.keys():
if classKey == secondDict[key][0]:
classLabel[classKey] += secondDict['Y'][1]
else:
classLabel[classKey] += secondDict['Y'][2]
elif float(testVec[featIndex]) > partValue and key == 'N':
if type(secondDict['N']).__name__ == 'dict': # 该分支不是叶子节点,递归
classLabelSub = classify(secondDict['N'], featLabels,
featLabelProperties, testVec)
for classKey in classLabel.keys():
classLabel[classKey] += classLabelSub[classKey]
else: # 如果是叶子, 返回结果
for classKey in classLabel.keys():
if classKey == secondDict[key][0]:
classLabel[classKey] += secondDict['N'][1]
else:
classLabel[classKey] += secondDict['N'][2]
return classLabel
operation result:
After pruning:
analyze
I believe that after seeing the above pruning effect, you may find the following problems:
- It's a lie, after pruning, it can be judged directly by a feature, which is different from your own division?
- Why would it lead to such a result? There are several layers of other people's or textbooks.
Regarding the issue above:
- The data set is designed by myself. The size of the data and the number of features are relatively small, so the results will be better. You can find some data sets with a larger order of magnitude for testing. I believe that the results after paper cutting will be sure. Not like me.
- The effect of post-pruning tends to be better in generalization, but the training time is relatively long, and it is also related to the size of the data volume and the number of features.
The purpose of pruning is to reduce those decision-making paths with unclear judgments, so that there are fewer decision-making paths and the accuracy of judgments is as large as possible. Just like ourselves, after making a decision, our thinking may change later, and some useless (bad effect) judgment basis will be removed (post-pruning).
code acquisition
Link: https://pan.baidu.com/s/1BynAi2uPx1eyZLhR6cHnww
Extraction code: 1234