Machine Learning - Decision Tree Creation

 

Table of contents

1. What is a decision tree?

          1. Decision tree concept:   

               2. Decision tree example:

2. ID3 algorithm for decision tree construction

        1. Construction process of decision tree

        2. Use ID3 algorithm to divide features

Third, implement the decision tree

 Four. Summary

 1. Decision tree

2. ID3 algorithm

3. This experiment is temporarily unable to visualize the decision tree


1. What is a decision tree?

          1. Decision tree concept:   

       Simply put, a decision tree is a tree. A decision tree contains a root node , several internal nodes , and several leaf nodes. The leaf nodes are the decision results of the problem. That is to say, a tree includes a root node, a parent node, a child node, and a leaf node. The child node is split from the parent node, and then the child node continues to split as the new parent node until the final result is obtained.

               2. Decision tree example:

           Suppose an animal is given a name and some of its characteristics, and then we need to judge whether the animal is a mammal based on these characteristics.

animal name body temperature breathe viviparous the hair mammal
Cat constant temperature lung yes yes yes
snake cold blooded cheek no no no

             The decision tree is as follows :

2. ID3 algorithm for decision tree construction

        1. Construction process of decision tree

animal name food body temperature way of breathing viviparous the hair living environment mammal
tiger meat constant temperature lung yes yes grassland yes
snake meat cold blooded cheek no no forest no
fish Omnivore cold blooded cheek no no in the water no
sheep Grass constant temperature lung yes yes grassland yes
giraffe Grass constant temperature lung yes yes grassland yes
panda Omnivore constant temperature lung yes yes forest yes
elephant Grass constant temperature lung yes no forest yes
frog meat cold blooded cheek no no in the water no
turtle meat cold blooded cheek no no in the water no
blue whale meat constant temperature lung yes no in the water yes

Step 1: Treat all features as nodes. (Characteristics: body temperature, breathing pattern, live birth, hair, etc.)

Step 2: Traverse the child nodes of the current feature to find the most suitable division point. Calculate the purity information of all child nodes after division. (For example: body temperature is a feature divided by constant temperature and cold blood)

The third step: Use the second step to traverse all the features, select the optimal division method of the feature, and get the final child node.

 There are two problems with the above decision-making process:

  1. Which feature should be selected for the best segmentation?

  2. When to stop partitioning?

        2. Use ID3 algorithm to divide features

        Information entropy : entropy (entropy) indicates the degree of chaos, the greater the entropy, the more chaotic. Suppose there is a set D, which has X random variables, and the probability that the i-th element appears in the set is Pi.

H(D)=-\sum_{i=1}^{x} pi\log_{2}{pi}

      where pi represents the proportion of the number of class i. Assuming two classification problems A and B, if the number of A is equal to the number of B, then the purity of the classification node will reach the minimum, and the entropy is equal to 1. Only when the information of the node belongs to A or B, entropy = 1. 

       Information gain : (ingormation gain), which indicates the degree to which the characteristic information of A is known, and the uncertainty of type B information is reduced. Calculated as follows:

Gain(D,A)=H(D)-H(D|A) =H(D)-\sum_{v=1}^{V} \frac{|D^{v} |}{|D|} H(D^{v} )

      For a data set, all feature attributes are divided, and then the purity of the result set of the division operation is compared, and the feature attribute with high purity is selected as the current division node.

Third, implement the decision tree

 1. Establish a data set : each row of data represents a sample, the first 6 data of each sample represent 6 features, and the seventh data represents the classification of the sample. The following labels indicate what the six characteristics of each sample are. The six created samples are tiger, snake, fish, sheep, panda and blue whale.

def createDataSet():
    dataSet = [[0,0,0,0,0,0,'是'],
               [1,1,1,1,0,1,'否'],
               [1,1,1,1,1,2,'否'],
               [0,0,0,0,2,0,'是'],
               [0,0,0,1,2,1,'是'],
               [0,0,0,1,0,2,'是']]
    #(0恒温,1冷血),(0肺,1腮),(0胎生,1非胎生),(0有毛发,1无毛发)
    #(0肉食,1杂食,2草),(0草原,1森林,2水里)
    labels = ['体温','呼吸方式','胎生','毛发','食物','生活环境']
    return dataSet, labels

2. Calculate the information entropy according to the formula : featVec[-1] represents the value of the last data of the data in each sample, that is, the two categories of "yes" and "no".

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  #数据集样本数
    labelCounts = {}  #构建字典保存每个标签出现的次数
    for featVec in dataSet:
        #给每个分类(这里是[是,否])创建字典并统计各种分类出现的次数
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:   #计算信息熵
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * math.log(prob,2)   #信息增益计算熵
    return shannonEnt

 3. Divide the dataset by category:

def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:  #开始遍历数据集
            #featVec 是一维数组,下标为axis元素之前的值加入到reducedFeatVec
            reducedFeatVec = featVec[:axis]
            #下一行的内容axis+1之后的元素加入到reducedFeatVec
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    #返回划分好后的数据集
    return retDataSet

4. Select the last data feature as the division standard : dataSet[0] represents the first sample, [0,0,0,0,0,0,'yes'], the length is 7, 7-1=6, 6 Indicates the number of features of the sample.

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0   #信息增益
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)  #值去重
        newEntropy = 0.0   #信息熵
        for value in uniqueVals: #计算信息增益
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        print("第%d个特征的信息增益为%.1f"%(i,infoGain))
        if (infoGain > bestInfoGain):  #选出最大的信息增益
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

5. Storage: Use a dictionary to store the number of occurrences of each value of each attribute)

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
        sortedClassCount = sorted(classCount.iteritems(),  key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

 6. The last step, create a tree:

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]   #当所有类型都相同时 返回这个类型
    if len(dataSet[0]) == 1:  #当没有可以在分类的特征集时
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])  #已经选择的特征不在参与分类
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        #对每个特征集递归调用建树方法
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

7. Results: There is a problem . When calculating the information gain with 1 digit and 4 decimal places, the information gain of the first 3 features cannot be compared. That is to say, it can be directly judged whether the animal is a mammal or a non-human based on the animal’s body temperature. Mammals, or viviparous and hairy hairs can also determine whether the animal is a mammal. The reason for this should be the small number of data samples and the ID3 algorithm.

 

8. Complete code:

import math
import operator
import numpy as np
from numpy import tile


def createDataSet():
    dataSet = [[0,0,0,0,0,0,'是'],
               [1,1,1,1,0,1,'否'],
               [1,1,1,1,1,2,'否'],
               [0,0,0,0,2,0,'是'],
               [0,0,0,1,2,1,'是'],
               [0,0,0,1,0,2,'是']]
    #(0恒温,1冷血),(0肺,1腮),(0胎生,1非胎生),(0有毛发,1无毛发)
    #(0肉食,1杂食,2草),(0草原,1森林,2水里)
    labels = ['体温','呼吸方式','胎生','毛发','食物','生活环境']
    return dataSet, labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  #数据集样本数
    labelCounts = {}  #构建字典保存每个标签出现的次数
    for featVec in dataSet:
        #给每个分类(这里是[是,否])创建字典并统计各种分类出现的次数
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:   #计算信息熵
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * math.log(prob,2)   #信息增益计算熵
    return shannonEnt


def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:  #开始遍历数据集
            #featVec 是一维数组,下标为axis元素之前的值加入到reducedFeatVec
            reducedFeatVec = featVec[:axis]
            #下一行的内容axis+1之后的元素加入到reducedFeatVec
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    #返回划分好后的数据集
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0   #信息增益
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)  #值去重
        newEntropy = 0.0   #信息熵
        for value in uniqueVals: #计算信息增益
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        print("第%d个特征的信息增益为%.4f"%(i,infoGain))
        if (infoGain > bestInfoGain):  #选出最大的信息增益
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature


def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
        sortedClassCount = sorted(classCount.iteritems(),  key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]   #当所有类型都相同时 返回这个类型
    if len(dataSet[0]) == 1:  #当没有可以在分类的特征集时
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])  #已经选择的特征不在参与分类
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        #对每个特征集递归调用建树方法
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree



if __name__ == '__main__':

    myDat,labels = createDataSet()
    print(createTree(myDat, labels))

 Four. Summary

 1. Decision tree

Advantages: easy to understand and explain, decision tree classification is fast, and can handle irrelevant feature data.

Disadvantages: Difficult to deal with datasets with missing data. Its construction process is a recursive process, and the stopping condition needs to be determined, otherwise the process will not end. It is easy to have overfitting problem.

2. ID3 algorithm

 ID3 is only suitable for use on small-scale data sets.

3. This experiment is temporarily unable to visualize the decision tree

Reason: There is a problem with the installation package of plt. 

Guess you like

Origin blog.csdn.net/Gucciwei/article/details/127867733