Machine learning - Decision tree -C4.5 tree

For ID3 some problems existing algorithm, 1993 Nian, Quinlan will ID3 algorithm was improved C4.5 algorithm. The algorithm has successfully solved the ID3 many problems encountered algorithm, developed into one of the ten machine learning algorithm.

C4.5 did not change ID3 algorithm logic, the basic program structure and still ID3 same, but the criteria for the classification nodes are improved. C4.5 use gain ratio information ( GainRatio ) instead of information gain ( Gain ) for feature selection, tend to overcome the lack of large number of characteristic values when the information gain selection feature.

Information gain ratio:

GainRatio(S,A) = Gain(S,A) / SplitInfo(S,A)

Wherein Gain (S, A) is ID3 information gain algorithm, the split information SplitInfo (S, A) represents the characteristics according to A divided sample set S breadth and uniformity.

Wherein Si to Sc wherein when A is C subset of samples of different configuration values.

Code

# C4.5 tree, using information gain rate determining optimal feature
 from numpy Import * 
Import Math 
Import Copy 
Import the pickle 

class C45DTree ( Object ): 
    DEF the __init __ (Self): # constructor 
        self.tree = {#} generated tree 
        self.dataSet = [] # dataset 
        self.labels = [] set of labels # 

    # data import function 
    DEF loadDataSet (Self, path, labels): 
        RecordList. = [] 
        FP = Open (path, " R & lt " ) to read the file # content 
        content = fp.read () 
        fp.close ()
        rowlist= Content.splitlines () # rows converted into one-dimensional table 
        RecordList. = [Row.split ( " \ t " ) for Row in rowlist IF row.strip ()] 
        self.dataSet = RecordList. 
        Self.labels = Labels 

    # implementing decisions tree function 
    DEF Train (Self): 
        Labels = copy.deepcopy (self.labels) 
        self.tree = self.buildTree (self.dataSet, Labels) 

        # main program to create decision trees 

    DEF BuildTree (Self, dataSet, Labels): 
        cateList = [Data [- . 1 ] forData in dataSet A] extracting source data set # Decision label column 
        # 1 program termination condition is: if only one decision classList tag, stop dividing, this decision returns the label 
        IF cateList.count (cateList [ 0 ]) == len (cateList) :
             return cateList [ 0 ] 
        # program termination condition 2: If the first decision data set only one tag, the tag returns this decision 
        IF len (dataSet a [ 0 ]) == . 1 :
             return self.maxCate (cateList) 
        # algorithm core: 
        bestFeat, featValueList = self.getBestFeat (dataSet a) returns a data set # optimal feature axis 
        bestFeatLabel = Labels [bestFeat] 
        Tree =bestFeatLabel {: {}} 
        del (Labels [bestFeat]) 
        # optimal feature extraction column vector axis 
        for value in featValueList: # recursive tree growth 
            subLabels = Labels [:] # feature category set after deletion set to establish subcategories 
            # column by optimal feature values and partitioning the data set 
            splitDataset = self.splitDataSet (dataSet A, bestFeat, value) 
            subTree = self.buildTree (splitDataset, subLabels) Construction # subtree 
            tree [bestFeatLabel] [value] = subTree
         return tree 

    calculation # the largest number of class labels appear 
    DEF maxCate (Self, catelist): 
        items = dict ([(catelist.count (I), I) forI incatelist])
         return items [max (items.keys ())] 

    # entropy calculation 
    DEF computeEntropy (Self, dataSet A): 
        the datalen = a float (len (dataSet A)) 
        cateList = [Data [- . 1 ] for Data in dataSet A] # obtained from the data set category labels 
        # get category for key, value the number of occurrences of dictionary 
        items = dict ([(i, cateList.count (i)) for i in cateList]) 
        infoEntropy = 0.0   # initialize Shannon entropy
         for Key in items: # Shannon entropy: 
            Prob float= (Items [Key]) / the datalen 
            infoEntropy - * Math.log = Prob (Prob, 2 )
         return infoEntropy 

    # partitioned data set; data set partition; delete columns of data where the feature axis, returns the remaining data set 
    # dataSet: Data set axis: axis characteristic values: feature axis value 
    DEF splitDataSet (Self, dataSet a, axis, value): 
        rtnList = []
         for featVec in dataSet a:
             IF featVec [axis] == value: 
                rFeatVec = featVec [: axis] # list: extract ~ 0 (axis- . 1 ) element 
                rFeatVec.extend (featVec [Axis + . 1 :]) # operation List: the following elements added back feature axis (column) 
                rtnList.append (rFeatVec) 
        return rtnList 

    # division information is calculated (SpilitInfo) 
    DEF computeSplitInfo (Self, featureVList): 
        numEntries = len (featureVList) 
        featureValueListSetList = List ( SET (featureVList)) 
        valueCounts = [featureVList.count (featVec) for featVec in featureValueListSetList] 
        # Shannon entropy calculation 
        plist = [ a float (Item) / numEntries for Item in valueCounts] 
        LLIST = [* Math.log Item (Item,2 )forItem in plist] 
        splitInfo = - SUM (LLIST)
         return splitInfo, featureValueListSetList 

    # gain ratio information using the optimal division nodes 
    DEF getBestFeat (Self, dataSet A): 
        Num_feats = len (dataSet A [ 0 ] [: - . 1 ]) 
        a totality = len ( dataSet A) 
        BaseEntropy = self.computeEntropy (dataSet A) 
        ConditionEntropy = [] # initialization condition entropy 
        splitInfo = [] # gain ratio information calculated 
        allFeatVList = []
         for F in range(Num_feats):
            featList = [example[f] for example in dataSet]
            [splitI, featureValueList] = self.computeSplitInfo(featList)
            allFeatVList.append(featureValueList)
            splitInfo.append(splitI)
            resultGain = 0.0
            for value in featureValueList:
                subSet = self.splitDataSet(dataSet, f, value)
                appearNum = float(len(subSet))
                subEntropy = self.computeEntropy(subSet)
                resultGain + = (appearNum / a totality) * subEntropy 
            ConditionEntropy.append (resultGain) Total # conditional entropy 
        infoGainArray = BaseEntropy * ones (Num_feats) - Array (ConditionEntropy) 
        infoGainRatio = infoGainArray / Array (splitInfo) # C4.5 information gain is calculated 
        bestFeatureIndex argsort = (-infoGainRatio) [ 0 ]
         return bestFeatureIndex, allFeatVList [bestFeatureIndex] 

    # classification 
    DEF Predict (Self, inputTree, featLabels, testVec): 
        the root = List (inputTree.keys ()) [ 0 ] # root node 
        secondDict = inputTree [ root] # value- subtree structure or classification label
        featIndex = featLabels.index (the root) # root node classification tag centralized location 
        Key = testVec [featIndex] # test set values array 
        valueOfFeat = secondDict [Key]
         IF the isinstance (valueOfFeat, dict): 
            classLabel = self.predict (valueOfFeat , featLabels, testVec) # recursive
         the else : 
            classLabel = valueOfFeat
         return classLabel 


    # persistent 
    DEF storeTree (Self, inputTree, filename): 
        FW = Open (filename, ' WB ' )
        the pickle.dump (inputTree, fw)
        fw.close () 

    # crawled from the file tree 
    DEF grabTree (Self, filename): 
        fr = Open (filename, ' rb ' )
         return pickle.load (fr) 

# training 
dtree = C45DTree () 
dtree.loadDataSet ( " / the Users / FengZhen / Desktop / accumulate / machine learning / tree / tree training set .txt " , [ " Age " , " Revenue " , " Student " , " at Credit " ]) 
dtree.train () 
Print (dtree.tree) 

# persistent 
dtree.storeTree (dtree.tree," / The Users / Fengzhen / Desktop / the accumulate / machine learning / tree / tree C45.tree " ) 
myTree = dtree.grabTree ( " / the Users / Fengzhen / Desktop / the accumulate / machine learning / tree / tree C45.tree " ) 
Print (myTree) 

# test 
Labels = [ " Age " , " Revenue " , " Student " , " Credit " ] 
Vector = [ ' 0 ' , ' . 1 ' , ' 0' , ' 0 ']
print(dtree.predict(mytree, labels, vector))

Machine learning - Decision tree Decision Tree -C4.5

Machine learning - Decision tree -C4.5 tree

Code

Guess you like