Introduction to Machine Learning - Decision Tree (1)

In the previous article, I learned the kNN algorithm, which can complete many classification tasks, but its biggest disadvantage is that it cannot give the internal meaning of the data. In contrast, the advantage of decision trees is that

  • The data format is very easy to understand
  • Classifiers can be persisted, whereas kNN must relearn each time it classifies

Decision trees are also a common class of machine learning algorithms that are used in the hope of learning a model for classifying new examples from a given training dataset.
There are ID3, CART, C4.5 and so on in the mainstream of decision tree algorithm.
Before discussing the division criteria of each algorithm for the sample set, let's bring some definitions:

Information entropy

  • Information entropy (information entropy) is the most commonly used indicator to measure the purity of a sample set. Assuming that the proportion of the k-th sample in the current sample set D is pk, the information entropy of D is defined as:
    Ent(D)=-∑ p(xk)log(2,p(xk)) (k=1,2,..n)
  • When calculating information entropy, it is agreed that if p=0 => p*log(2, p)=0

information gain

  • The change in information before and after dividing the dataset becomes the information gain
  • Assuming that the discrete attribute a has V possible values ​​{a1, a2, ..., av}, if a is used to divide the sample set D, V branch nodes will be generated, of which the vth branch node contains D All the samples whose value is av on attribute a are recorded as Dv, and weights |Dv|/|D| are given to different branch nodes, so the "information gain" obtained by dividing the sample set D by attribute a can be calculated.
    Gain(D, a)=Ent(D)-∑|Dv|/|D|*Ent(Dv)
  • Generally speaking, the greater the information gain, the greater the purity improvement obtained by using attribute a to divide
  • Preference for attributes with a larger number of possible values

Gain rate

  • Gain_ratio(D,a)=Gain(D,a)/IV(a)
  • where IV(a)=-∑(|Dv|/|D|*log2(|Dv|/|D|)) is called the "intrinsic value" of a
  • Preference for attributes with fewer possible values

Gini

  • The purity of the dataset D can be measured by the Gini value:
    Gini(D)=1-∑pk^2 (k=1,2,…,)
    Intuitively, Gini(D) reflects the random sampling of two samples from D , the probability that its class labels are inconsistent, therefore, the smaller the Gini(D), the higher the purity of the dataset D
  • The Gini index of attribute a is defined as
    Gini_index(D,a)=∑(|Dv|/|D|*Gini(Dv))
    Generally, the attribute with the smallest Gini index after division is selected as the optimal division attribute

So the division criteria for the three decision tree algorithms just mentioned are:
ID3 – information gain, C4.5 – gain rate, CART – Gini index
In fact, decision trees generally also involve pruning, continuous value processing, missing Value processing, multivariate decision tree and other issues, these issues will be put in decision tree (2) (3) to bring learning analysis, the focus of today's actual combat is ID3, the source code and test samples are on github
github link address

Calculate information entropy

def calcShannonEnt(dataSet):
    numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt

Description: dataset is data similar to the following characteristics

dataSet = [[1, 1, 'yes'],
           [1, 1, 'yes'],
           [1, 0, 'no'],
           [0, 1, 'no'],
           [0, 1, 'no']]

Select a eigenvalue to divide the dataset

#axis为所选取的用来划分的特征值的下标,value为划分所得数据集中该属性的取值
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec=featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

Choose the best partitioning attribute

def chooseBestFeatureToSplit(dataSet):
    #the number of feature attributes that the current data packet contains
    numFeatures=len(dataSet[0])-1

    # entropy 熵
    baseEntropy=calcShannonEnt(dataSet)

    bestInfoGain=0.0
    bestFeature=-1

    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
        for value in uniqueVals:
            subDataset=splitDataSet(dataSet,i,value)
            prob=len(subDataset)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataset)
        infoGain=baseEntropy-newEntropy
        if(infoGain>bestInfoGain):
            bestInfoGain=infoGain
            bestFeature=i
    return bestFeature

Given a list, return the element with the most occurrences

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote]+=1
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

Recursively construct decision tree (represented in dictionary form)

def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    #类别完全相同则停止继续划分
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    #遍历完所有特征是返回出现次数最多的类别
    if len(dataSet[0])==1:
        return majorityCnt(classList)

    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    mytree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels=labels[:]
        mytree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return mytree

If you feel that the simple dictionary form is not intuitive enough, you can also draw it through matplotlib

# coding=utf-8
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
from math import log
import operator

decisionNode=dict(boxstyle="sawtooth",fc="0.8")
leafNode=dict(boxstyle="round4",fc="0.8")
arrow_args=dict(arrowstyle="<-")

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

# def createPlot():
#     fig = plt.figure(1, facecolor='white')
#     fig.clf()
#     createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
#     plotNode('决策节点', (0.5, 0.1), (0.1, 0.5), decisionNode)
#     plotNode('叶节点', (0.8, 0.1), (0.3, 0.8), leafNode)
#     plt.show()

def createPlot(inTree):
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    axprops=dict(xticks=[],yticks=[])
    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW=float(getNumLeafs(inTree))
    plotTree.totalD=float(getTreeDepth(inTree))
    plotTree.xOff=-0.5/plotTree.totalW
    plotTree.yOff=1.0
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

def getNumLeafs(myTree):
    numleafs=0
    firstStr=next(iter(myTree))
    sencondDict=myTree[firstStr]
    for key in sencondDict.keys():
        if type(sencondDict[key]).__name__=='dict':
            numleafs+=getNumLeafs(sencondDict[key])
        else:
            numleafs+=1
    return numleafs

def getTreeDepth(myTree):
    maxDepth=0
    firstStr=next(iter(myTree))
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth=1+getTreeDepth(secondDict[key])
        else:
            thisDepth=1
        if thisDepth>maxDepth:
            maxDepth=thisDepth
    return maxDepth

def plotMidText(cntrPt,parentPt,txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]                               
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)

def plotTree(myTree,parentPt,nodeTxt):
    numLeafs=getNumLeafs(myTree)
    depth=getTreeDepth(myTree)
    firstStr=next(iter(myTree))
    cntrPt=(plotTree.xOff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict=myTree[firstStr]
    plotTree.yOff=plotTree.yOff-1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.xOff=plotTree.xOff+1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff=plotTree.yOff+1.0/plotTree.totalD

How to classify by decision tree

#自顶向下递归遍历
def classify(inputTree,featlabels,testVec):
    firstStr=next(iter(inputTree))
    secondDict=inputTree[firstStr]
    featIndex=featlabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLabel=classify(secondDict[key],featlabels,testVec)
            else:
                classLabel=secondDict[key]
    return classLabel

illustrate

featlabels格式形如:['no surfacing','flippers']
testVec格式形如: [0,1]

Actual classification according to the dataset in txt

def lensesClassify():
    fr=open('lenses.txt','r')
    lenses=[inst.strip().split('\t') for inst in fr.readlines()]
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    lensesTree=createTree(lenses,lensesLabels)
    tp.createPlot(lensesTree)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324691991&siteId=291194637