Machine learning combat-construction, drawing and examples of decision tree ID3: predicting contact lens types

statement

        This article refers to the code in the book "Machine Learning Combat", combined with the explanation of the book, plus his own understanding and elaboration

Machine learning combat series blog post

 

Decision tree establishment

Definition of decision tree

        In layman's terms, a decision tree uses several eigenvalues ​​to separate layers of information. Like a tree, the leaves of the tree are the type of classification

        First of all, we need to explain the basis for the establishment of the tree. You can check the concept of information and information entropy on Baidu. The goal of the decision tree is to make the information entropy minimize after the tree is classified according to a certain feature. Information entropy is also called Shannon entropy, the formula is:

                                

Computing information entropy

        Count the number of each feature in the data set, find the probability, and then calculate according to the above formula.

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*log(prob,2)
    return shannonEnt

        Next, we must divide a certain feature into data sets, and the purpose is to divide the training data into multiple data sets according to a certain feature.

def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

        We traverse the way of division. In other words, each feature is divided once, and then the information entropy is calculated. The corresponding feature with the smallest information entropy is the basis for classification.

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob*calcShannonEnt(subDataSet)
        infoGain = baseEntropy-newEntropy
        if(infoGain>bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

        Let's consider that if we run out of all features, but the data on a certain branch still does not belong to the same category, then we must use the most primitive minority to obey the majority method to determine the type of these data.

def majorityCnt(classList):
    classCount = {}
    for vote in classCount:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

Build a decision tree

        So far, the preparation work has been completed, the following began to formally establish a decision tree

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    #类别完全相同,停止分类
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    #用完了所有特征后,用少数服从多数决定类型
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    #一个递归建立所有子树
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(
            dataSet,bestFeat,value),subLabels)
    return myTree

        This recursion may not be easy to understand. For example, when my decision tree selects feature t, there are m different values ​​in feature t, then my tree will be divided into m forks, each fork corresponds to A value, and then each of my forks is a decision tree built on the remaining data, so that the decision tree is built recursively.

        Let's verify it below. Looking at the table below, we categorize the five animals below. For data processing, see the code:

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    return dataSet, labels

        Then we build a decision tree on the data set

myData ,labels = createDataSet()
myTree = createTree(myData,labels)
print(myTree)

        The result is: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

        According to our human brain, such a classification is already very reasonable.

 

Decision tree drawing

        In fact, this drawing is a tree drawn with wireframes, with some tags added, and the code is directly placed here. The code in the original book is python2, and I changed it to python3 code here, as follows:

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")


def getNumLeafs(myTree):
    numLeafs = 0
    for keyss in myTree:
        firstStr = keyss
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs


def getTreeDepth(myTree):
    maxDepth = 0
    for keyss in myTree:
        firstStr = keyss
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth


def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)


def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)


def plotTree(myTree, parentPt, nodeTxt):  # if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  # this determines the x width of this tree
    depth = getTreeDepth(myTree)
    for keyss in myTree:
        firstStr = keyss
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[
                    key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodes
            plotTree(secondDict[key], cntrPt, str(key))  # recursion
        else:  # it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD


# if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)  # no ticks
    # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW;
    plotTree.yOff = 1.0;
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

        This drawing does not involve many algorithms, so I wo n’t explain it more. If you want to use it, you can call createPlot directly. If you are interested, you can also look at the code. For example, the decision tree we calculated in the last part is drawn as follows:

        It's a bit rough, in fact, there are many practical GUI libraries that can be used to draw tree python, so I won't say much here

 

Example: Predicting the type of contact lens

Problem Description

        The contact lens data set is a very famous data set, which contains the observation conditions of many patients' eye conditions and the types of contact lenses recommended by doctors. The types of contact lenses include hard materials, soft materials and are not suitable for wearing contact lenses. The data comes from the UCI database. In order to display the data more easily, this book makes simple changes to the data. The data is shown below:

        From the four types of data in the front, determine which of the three types of hard materials, soft materials and unsuitable wearable lenses is

Decision tree classification

#数据文件
fr = open("lenses.txt")
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
#数据标签
lensesLabels = ['age','prescript','astigmatic','tearRate']
#建立决策树
lensesTree = createTree(lenses,lensesLabels)

Then got the classification:

{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'presbyopic': 'no lenses', 'pre': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age': {'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'pre': 'soft', 'young': 'soft'}}}}}}

This looks very unintuitive, we draw it:

        In this way, when we obtain a new piece of data, we can filter according to the nodes in the tree, and finally determine which category it belongs to

Published 60 original articles · Like 89 · Visits 10,000+

Guess you like

Origin blog.csdn.net/qq_41685265/article/details/105227289