Source code GitHub ( https://github.com/fansking/Machine/blob/master/Machine/trees.py) in
a variety of attributes of things, but a lot of property does not determine whether it was this thing. For example, some cats are hairy, some cats are hairless, not all hair is not a cat can not decide. Then for so many attributes, how do we choose from here one or more attributes as a decisive factor in it. Here we must introduce the concept of entropy and information gain.
When we add an attribute to a particular category and attribute something so consistent, for example, whether to increase the fish can live in the water again, most of the fish is yes, but not most of the fish can not live in the water. This will not be separated from the fish very good area, underwater life can bring a good gain information. So how do we find this property?
That's (all attributes together information expectations) - (expected value of a property to remove the information), the greater the value then there is the greatest influence on the accuracy of the classification when this amount.
Shannon entropy is calculated as follows, where p (xi) is the frequency characteristics of this property appears.
And a decision tree construction is arranged in a downwardly descending order factors constructed.
Below is a description of a function of a function:
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing','flippers']
#change to discrete values
return dataSet, labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
The first function here is to build a data set, it is our next data. data corresponding to the second row in the first column of the label in the two feature data, the last one of his categories. The second function is given a data set to calculate the value of the Shannon entropy.
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value) #subDataSet是除去本次遍历的列的内容,我们需要得到他的长度,并根据长度求得权重系数
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
Here is the first function of a data set, remove the specified index column, and returns the sub-set of data.
The second function is a beginning to say, through all the columns, the biggest impact is the greatest characteristic data entropy change're looking calculated after deleting this column, returns the index corresponding to the characteristics of the data.
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList): #当剩下的所有类别都相同时说明对于这个数据特征的值来讲就是对应这个分类的,已经是叶子结点无需递归,直接退出
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #即当没有更多特征时,一定到达了递归出口(叶子结点),对最后一个特征使用举手表决的方式,哪个对应的标签最多就选择哪个标签
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #python是使用引用的方式来传递变量,所以直接用等号会导致同步变化,我们需要拷贝一份新的值给子标签
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)#递归得到下一层内容
return myTree
The first function is a show of hands, much explanation
second function is a recursive tree structure (dictionary), are added to each data of the highest impact characteristics.
The results obtained were as follows: