Table of contents
2. ID3 algorithm for decision tree construction
1. Construction process of decision tree
2. Use ID3 algorithm to divide features
Third, implement the decision tree
3. This experiment is temporarily unable to visualize the decision tree
1. What is a decision tree?
1. Decision tree concept:
Simply put, a decision tree is a tree. A decision tree contains a root node , several internal nodes , and several leaf nodes. The leaf nodes are the decision results of the problem. That is to say, a tree includes a root node, a parent node, a child node, and a leaf node. The child node is split from the parent node, and then the child node continues to split as the new parent node until the final result is obtained.
2. Decision tree example:
Suppose an animal is given a name and some of its characteristics, and then we need to judge whether the animal is a mammal based on these characteristics.
animal name | body temperature | breathe | viviparous | the hair | mammal |
Cat | constant temperature | lung | yes | yes | yes |
snake | cold blooded | cheek | no | no | no |
The decision tree is as follows :
2. ID3 algorithm for decision tree construction
1. Construction process of decision tree
animal name | food | body temperature | way of breathing | viviparous | the hair | living environment | mammal |
tiger | meat | constant temperature | lung | yes | yes | grassland | yes |
snake | meat | cold blooded | cheek | no | no | forest | no |
fish | Omnivore | cold blooded | cheek | no | no | in the water | no |
sheep | Grass | constant temperature | lung | yes | yes | grassland | yes |
giraffe | Grass | constant temperature | lung | yes | yes | grassland | yes |
panda | Omnivore | constant temperature | lung | yes | yes | forest | yes |
elephant | Grass | constant temperature | lung | yes | no | forest | yes |
frog | meat | cold blooded | cheek | no | no | in the water | no |
turtle | meat | cold blooded | cheek | no | no | in the water | no |
blue whale | meat | constant temperature | lung | yes | no | in the water | yes |
Step 1: Treat all features as nodes. (Characteristics: body temperature, breathing pattern, live birth, hair, etc.)
Step 2: Traverse the child nodes of the current feature to find the most suitable division point. Calculate the purity information of all child nodes after division. (For example: body temperature is a feature divided by constant temperature and cold blood)
The third step: Use the second step to traverse all the features, select the optimal division method of the feature, and get the final child node.
There are two problems with the above decision-making process:
1. Which feature should be selected for the best segmentation?
2. When to stop partitioning?
2. Use ID3 algorithm to divide features
Information entropy : entropy (entropy) indicates the degree of chaos, the greater the entropy, the more chaotic. Suppose there is a set D, which has X random variables, and the probability that the i-th element appears in the set is Pi.
where pi represents the proportion of the number of class i. Assuming two classification problems A and B, if the number of A is equal to the number of B, then the purity of the classification node will reach the minimum, and the entropy is equal to 1. Only when the information of the node belongs to A or B, entropy = 1.
Information gain : (ingormation gain), which indicates the degree to which the characteristic information of A is known, and the uncertainty of type B information is reduced. Calculated as follows:
For a data set, all feature attributes are divided, and then the purity of the result set of the division operation is compared, and the feature attribute with high purity is selected as the current division node.
Third, implement the decision tree
1. Establish a data set : each row of data represents a sample, the first 6 data of each sample represent 6 features, and the seventh data represents the classification of the sample. The following labels indicate what the six characteristics of each sample are. The six created samples are tiger, snake, fish, sheep, panda and blue whale.
def createDataSet():
dataSet = [[0,0,0,0,0,0,'是'],
[1,1,1,1,0,1,'否'],
[1,1,1,1,1,2,'否'],
[0,0,0,0,2,0,'是'],
[0,0,0,1,2,1,'是'],
[0,0,0,1,0,2,'是']]
#(0恒温,1冷血),(0肺,1腮),(0胎生,1非胎生),(0有毛发,1无毛发)
#(0肉食,1杂食,2草),(0草原,1森林,2水里)
labels = ['体温','呼吸方式','胎生','毛发','食物','生活环境']
return dataSet, labels
2. Calculate the information entropy according to the formula : featVec[-1] represents the value of the last data of the data in each sample, that is, the two categories of "yes" and "no".
def calcShannonEnt(dataSet):
numEntries = len(dataSet) #数据集样本数
labelCounts = {} #构建字典保存每个标签出现的次数
for featVec in dataSet:
#给每个分类(这里是[是,否])创建字典并统计各种分类出现的次数
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts: #计算信息熵
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * math.log(prob,2) #信息增益计算熵
return shannonEnt
3. Divide the dataset by category:
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value: #开始遍历数据集
#featVec 是一维数组,下标为axis元素之前的值加入到reducedFeatVec
reducedFeatVec = featVec[:axis]
#下一行的内容axis+1之后的元素加入到reducedFeatVec
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
#返回划分好后的数据集
return retDataSet
4. Select the last data feature as the division standard : dataSet[0] represents the first sample, [0,0,0,0,0,0,'yes'], the length is 7, 7-1=6, 6 Indicates the number of features of the sample.
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0 #信息增益
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList) #值去重
newEntropy = 0.0 #信息熵
for value in uniqueVals: #计算信息增益
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
print("第%d个特征的信息增益为%.1f"%(i,infoGain))
if (infoGain > bestInfoGain): #选出最大的信息增益
bestInfoGain = infoGain
bestFeature = i
return bestFeature
5. Storage: Use a dictionary to store the number of occurrences of each value of each attribute)
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
6. The last step, create a tree:
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0] #当所有类型都相同时 返回这个类型
if len(dataSet[0]) == 1: #当没有可以在分类的特征集时
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat]) #已经选择的特征不在参与分类
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
#对每个特征集递归调用建树方法
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
7. Results: There is a problem . When calculating the information gain with 1 digit and 4 decimal places, the information gain of the first 3 features cannot be compared. That is to say, it can be directly judged whether the animal is a mammal or a non-human based on the animal’s body temperature. Mammals, or viviparous and hairy hairs can also determine whether the animal is a mammal. The reason for this should be the small number of data samples and the ID3 algorithm.
8. Complete code:
import math
import operator
import numpy as np
from numpy import tile
def createDataSet():
dataSet = [[0,0,0,0,0,0,'是'],
[1,1,1,1,0,1,'否'],
[1,1,1,1,1,2,'否'],
[0,0,0,0,2,0,'是'],
[0,0,0,1,2,1,'是'],
[0,0,0,1,0,2,'是']]
#(0恒温,1冷血),(0肺,1腮),(0胎生,1非胎生),(0有毛发,1无毛发)
#(0肉食,1杂食,2草),(0草原,1森林,2水里)
labels = ['体温','呼吸方式','胎生','毛发','食物','生活环境']
return dataSet, labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet) #数据集样本数
labelCounts = {} #构建字典保存每个标签出现的次数
for featVec in dataSet:
#给每个分类(这里是[是,否])创建字典并统计各种分类出现的次数
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts: #计算信息熵
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * math.log(prob,2) #信息增益计算熵
return shannonEnt
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value: #开始遍历数据集
#featVec 是一维数组,下标为axis元素之前的值加入到reducedFeatVec
reducedFeatVec = featVec[:axis]
#下一行的内容axis+1之后的元素加入到reducedFeatVec
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
#返回划分好后的数据集
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0 #信息增益
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList) #值去重
newEntropy = 0.0 #信息熵
for value in uniqueVals: #计算信息增益
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
print("第%d个特征的信息增益为%.4f"%(i,infoGain))
if (infoGain > bestInfoGain): #选出最大的信息增益
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0] #当所有类型都相同时 返回这个类型
if len(dataSet[0]) == 1: #当没有可以在分类的特征集时
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat]) #已经选择的特征不在参与分类
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
#对每个特征集递归调用建树方法
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
if __name__ == '__main__':
myDat,labels = createDataSet()
print(createTree(myDat, labels))
Four. Summary
1. Decision tree
Advantages: easy to understand and explain, decision tree classification is fast, and can handle irrelevant feature data.
Disadvantages: Difficult to deal with datasets with missing data. Its construction process is a recursive process, and the stopping condition needs to be determined, otherwise the process will not end. It is easy to have overfitting problem.
2. ID3 algorithm
ID3 is only suitable for use on small-scale data sets.
3. This experiment is temporarily unable to visualize the decision tree
Reason: There is a problem with the installation package of plt.