Machine learning - Decision tree -C4.5 tree
For ID3 some problems existing algorithm, 1993 Nian, Quinlan will ID3 algorithm was improved C4.5 algorithm. The algorithm has successfully solved the ID3 many problems encountered algorithm, developed into one of the ten machine learning algorithm.
C4.5 did not change ID3 algorithm logic, the basic program structure and still ID3 same, but the criteria for the classification nodes are improved. C4.5 use gain ratio information ( GainRatio ) instead of information gain ( Gain ) for feature selection, tend to overcome the lack of large number of characteristic values when the information gain selection feature.
Information gain ratio:
GainRatio(S,A) = Gain(S,A) / SplitInfo(S,A)
Wherein Gain (S, A) is ID3 information gain algorithm, the split information SplitInfo (S, A) represents the characteristics according to A divided sample set S breadth and uniformity.
Wherein Si to Sc wherein when A is C subset of samples of different configuration values.
Code
# C4.5 tree, using information gain rate determining optimal feature from numpy Import * Import Math Import Copy Import the pickle class C45DTree ( Object ): DEF the __init __ (Self): # constructor self.tree = {#} generated tree self.dataSet = [] # dataset self.labels = [] set of labels # # data import function DEF loadDataSet (Self, path, labels): RecordList. = [] FP = Open (path, " R & lt " ) to read the file # content content = fp.read () fp.close () rowlist= Content.splitlines () # rows converted into one-dimensional table RecordList. = [Row.split ( " \ t " ) for Row in rowlist IF row.strip ()] self.dataSet = RecordList. Self.labels = Labels # implementing decisions tree function DEF Train (Self): Labels = copy.deepcopy (self.labels) self.tree = self.buildTree (self.dataSet, Labels) # main program to create decision trees DEF BuildTree (Self, dataSet, Labels): cateList = [Data [- . 1 ] forData in dataSet A] extracting source data set # Decision label column # 1 program termination condition is: if only one decision classList tag, stop dividing, this decision returns the label IF cateList.count (cateList [ 0 ]) == len (cateList) : return cateList [ 0 ] # program termination condition 2: If the first decision data set only one tag, the tag returns this decision IF len (dataSet a [ 0 ]) == . 1 : return self.maxCate (cateList) # algorithm core: bestFeat, featValueList = self.getBestFeat (dataSet a) returns a data set # optimal feature axis bestFeatLabel = Labels [bestFeat] Tree =bestFeatLabel {: {}} del (Labels [bestFeat]) # optimal feature extraction column vector axis for value in featValueList: # recursive tree growth subLabels = Labels [:] # feature category set after deletion set to establish subcategories # column by optimal feature values and partitioning the data set splitDataset = self.splitDataSet (dataSet A, bestFeat, value) subTree = self.buildTree (splitDataset, subLabels) Construction # subtree tree [bestFeatLabel] [value] = subTree return tree calculation # the largest number of class labels appear DEF maxCate (Self, catelist): items = dict ([(catelist.count (I), I) forI incatelist]) return items [max (items.keys ())] # entropy calculation DEF computeEntropy (Self, dataSet A): the datalen = a float (len (dataSet A)) cateList = [Data [- . 1 ] for Data in dataSet A] # obtained from the data set category labels # get category for key, value the number of occurrences of dictionary items = dict ([(i, cateList.count (i)) for i in cateList]) infoEntropy = 0.0 # initialize Shannon entropy for Key in items: # Shannon entropy: Prob float= (Items [Key]) / the datalen infoEntropy - * Math.log = Prob (Prob, 2 ) return infoEntropy # partitioned data set; data set partition; delete columns of data where the feature axis, returns the remaining data set # dataSet: Data set axis: axis characteristic values: feature axis value DEF splitDataSet (Self, dataSet a, axis, value): rtnList = [] for featVec in dataSet a: IF featVec [axis] == value: rFeatVec = featVec [: axis] # list: extract ~ 0 (axis- . 1 ) element rFeatVec.extend (featVec [Axis + . 1 :]) # operation List: the following elements added back feature axis (column) rtnList.append (rFeatVec) return rtnList # division information is calculated (SpilitInfo) DEF computeSplitInfo (Self, featureVList): numEntries = len (featureVList) featureValueListSetList = List ( SET (featureVList)) valueCounts = [featureVList.count (featVec) for featVec in featureValueListSetList] # Shannon entropy calculation plist = [ a float (Item) / numEntries for Item in valueCounts] LLIST = [* Math.log Item (Item,2 )forItem in plist] splitInfo = - SUM (LLIST) return splitInfo, featureValueListSetList # gain ratio information using the optimal division nodes DEF getBestFeat (Self, dataSet A): Num_feats = len (dataSet A [ 0 ] [: - . 1 ]) a totality = len ( dataSet A) BaseEntropy = self.computeEntropy (dataSet A) ConditionEntropy = [] # initialization condition entropy splitInfo = [] # gain ratio information calculated allFeatVList = [] for F in range(Num_feats): featList = [example[f] for example in dataSet] [splitI, featureValueList] = self.computeSplitInfo(featList) allFeatVList.append(featureValueList) splitInfo.append(splitI) resultGain = 0.0 for value in featureValueList: subSet = self.splitDataSet(dataSet, f, value) appearNum = float(len(subSet)) subEntropy = self.computeEntropy(subSet) resultGain + = (appearNum / a totality) * subEntropy ConditionEntropy.append (resultGain) Total # conditional entropy infoGainArray = BaseEntropy * ones (Num_feats) - Array (ConditionEntropy) infoGainRatio = infoGainArray / Array (splitInfo) # C4.5 information gain is calculated bestFeatureIndex argsort = (-infoGainRatio) [ 0 ] return bestFeatureIndex, allFeatVList [bestFeatureIndex] # classification DEF Predict (Self, inputTree, featLabels, testVec): the root = List (inputTree.keys ()) [ 0 ] # root node secondDict = inputTree [ root] # value- subtree structure or classification label featIndex = featLabels.index (the root) # root node classification tag centralized location Key = testVec [featIndex] # test set values array valueOfFeat = secondDict [Key] IF the isinstance (valueOfFeat, dict): classLabel = self.predict (valueOfFeat , featLabels, testVec) # recursive the else : classLabel = valueOfFeat return classLabel # persistent DEF storeTree (Self, inputTree, filename): FW = Open (filename, ' WB ' ) the pickle.dump (inputTree, fw) fw.close () # crawled from the file tree DEF grabTree (Self, filename): fr = Open (filename, ' rb ' ) return pickle.load (fr) # training dtree = C45DTree () dtree.loadDataSet ( " / the Users / FengZhen / Desktop / accumulate / machine learning / tree / tree training set .txt " , [ " Age " , " Revenue " , " Student " , " at Credit " ]) dtree.train () Print (dtree.tree) # persistent dtree.storeTree (dtree.tree," / The Users / Fengzhen / Desktop / the accumulate / machine learning / tree / tree C45.tree " ) myTree = dtree.grabTree ( " / the Users / Fengzhen / Desktop / the accumulate / machine learning / tree / tree C45.tree " ) Print (myTree) # test Labels = [ " Age " , " Revenue " , " Student " , " Credit " ] Vector = [ ' 0 ' , ' . 1 ' , ' 0' , ' 0 '] print(dtree.predict(mytree, labels, vector))