Machine learning of practical decision tree (python achieve)

Source code GitHub ( https://github.com/fansking/Machine/blob/master/Machine/trees.py) in
a variety of attributes of things, but a lot of property does not determine whether it was this thing. For example, some cats are hairy, some cats are hairless, not all hair is not a cat can not decide. Then for so many attributes, how do we choose from here one or more attributes as a decisive factor in it. Here we must introduce the concept of entropy and information gain.
When we add an attribute to a particular category and attribute something so consistent, for example, whether to increase the fish can live in the water again, most of the fish is yes, but not most of the fish can not live in the water. This will not be separated from the fish very good area, underwater life can bring a good gain information. So how do we find this property?
That's (all attributes together information expectations) - (expected value of a property to remove the information), the greater the value then there is the greatest influence on the accuracy of the classification when this amount.
Shannon entropy is calculated as follows, where p (xi) is the frequency characteristics of this property appears.
Shannon entropy calculation method
And a decision tree construction is arranged in a downwardly descending order factors constructed.
Below is a description of a function of a function:

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

The first function here is to build a data set, it is our next data. data corresponding to the second row in the first column of the label in the two feature data, the last one of his categories. The second function is given a data set to calculate the value of the Shannon entropy.

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)    #subDataSet是除去本次遍历的列的内容,我们需要得到他的长度,并根据长度求得权重系数
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

Here is the first function of a data set, remove the specified index column, and returns the sub-set of data.
The second function is a beginning to say, through all the columns, the biggest impact is the greatest characteristic data entropy change're looking calculated after deleting this column, returns the index corresponding to the characteristics of the data.

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): #当剩下的所有类别都相同时说明对于这个数据特征的值来讲就是对应这个分类的,已经是叶子结点无需递归,直接退出
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #即当没有更多特征时,一定到达了递归出口(叶子结点),对最后一个特征使用举手表决的方式,哪个对应的标签最多就选择哪个标签
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #python是使用引用的方式来传递变量,所以直接用等号会导致同步变化,我们需要拷贝一份新的值给子标签
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)#递归得到下一层内容
    return myTree           

The first function is a show of hands, much explanation
second function is a recursive tree structure (dictionary), are added to each data of the highest impact characteristics.
The results obtained were as follows:
Here Insert Picture Description

Here Insert Picture Description

Published 16 original articles · won praise 3 · Views 1363

Guess you like

Origin blog.csdn.net/weixin_40631132/article/details/89007115