statement
This article refers to the code in the book "Machine Learning Combat", combined with the explanation of the book, plus his own understanding and elaboration
Machine learning combat series blog post
- Machine learning combat-k-nearest neighbor algorithm improves the matching effect of dating sites
- Machine learning combat-decision tree construction, drawing and examples: predicting contact lens types
- Machine learning combat-the application and example of Naive Bayes algorithm: use Bayes classifier to filter spam
- Machine learning combat-Logistic regression and examples: predicting mortality of sick horses from hernia
Decision tree establishment
Definition of decision tree
In layman's terms, a decision tree uses several eigenvalues to separate layers of information. Like a tree, the leaves of the tree are the type of classification
First of all, we need to explain the basis for the establishment of the tree. You can check the concept of information and information entropy on Baidu. The goal of the decision tree is to make the information entropy minimize after the tree is classified according to a certain feature. Information entropy is also called Shannon entropy, the formula is:
Computing information entropy
Count the number of each feature in the data set, find the probability, and then calculate according to the above formula.
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob*log(prob,2)
return shannonEnt
Next, we must divide a certain feature into data sets, and the purpose is to divide the training data into multiple data sets according to a certain feature.
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVec in dataSet:
if featVec[axis]==value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
We traverse the way of division. In other words, each feature is divided once, and then the information entropy is calculated. The corresponding feature with the smallest information entropy is the basis for classification.
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0])-1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob*calcShannonEnt(subDataSet)
infoGain = baseEntropy-newEntropy
if(infoGain>bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
Let's consider that if we run out of all features, but the data on a certain branch still does not belong to the same category, then we must use the most primitive minority to obey the majority method to determine the type of these data.
def majorityCnt(classList):
classCount = {}
for vote in classCount:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
Build a decision tree
So far, the preparation work has been completed, the following began to formally establish a decision tree
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
#类别完全相同,停止分类
if classList.count(classList[0]) == len(classList):
return classList[0]
#用完了所有特征后,用少数服从多数决定类型
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
#一个递归建立所有子树
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(
dataSet,bestFeat,value),subLabels)
return myTree
This recursion may not be easy to understand. For example, when my decision tree selects feature t, there are m different values in feature t, then my tree will be divided into m forks, each fork corresponds to A value, and then each of my forks is a decision tree built on the remaining data, so that the decision tree is built recursively.
Let's verify it below. Looking at the table below, we categorize the five animals below. For data processing, see the code:
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing','flippers']
return dataSet, labels
Then we build a decision tree on the data set
myData ,labels = createDataSet()
myTree = createTree(myData,labels)
print(myTree)
The result is: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
According to our human brain, such a classification is already very reasonable.
Decision tree drawing
In fact, this drawing is a tree drawn with wireframes, with some tags added, and the code is directly placed here. The code in the original book is python2, and I changed it to python3 code here, as follows:
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
def getNumLeafs(myTree):
numLeafs = 0
for keyss in myTree:
firstStr = keyss
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
numLeafs += getNumLeafs(secondDict[key])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
for keyss in myTree:
firstStr = keyss
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[
key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
thisDepth = 1 + getTreeDepth(secondDict[key])
else:
thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt): # if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree) # this determines the x width of this tree
depth = getTreeDepth(myTree)
for keyss in myTree:
firstStr = keyss
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
for key in secondDict.keys():
if type(secondDict[
key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes
plotTree(secondDict[key], cntrPt, str(key)) # recursion
else: # it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD
# if you do get a dictonary you know it's a tree, and the first element will be another dict
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) # no ticks
# createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5 / plotTree.totalW;
plotTree.yOff = 1.0;
plotTree(inTree, (0.5, 1.0), '')
plt.show()
This drawing does not involve many algorithms, so I wo n’t explain it more. If you want to use it, you can call createPlot directly. If you are interested, you can also look at the code. For example, the decision tree we calculated in the last part is drawn as follows:
It's a bit rough, in fact, there are many practical GUI libraries that can be used to draw tree python, so I won't say much here
Example: Predicting the type of contact lens
Problem Description
The contact lens data set is a very famous data set, which contains the observation conditions of many patients' eye conditions and the types of contact lenses recommended by doctors. The types of contact lenses include hard materials, soft materials and are not suitable for wearing contact lenses. The data comes from the UCI database. In order to display the data more easily, this book makes simple changes to the data. The data is shown below:
From the four types of data in the front, determine which of the three types of hard materials, soft materials and unsuitable wearable lenses is
Decision tree classification
#数据文件
fr = open("lenses.txt")
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
#数据标签
lensesLabels = ['age','prescript','astigmatic','tearRate']
#建立决策树
lensesTree = createTree(lenses,lensesLabels)
Then got the classification:
{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'presbyopic': 'no lenses', 'pre': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age': {'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'pre': 'soft', 'young': 'soft'}}}}}}
This looks very unintuitive, we draw it:
In this way, when we obtain a new piece of data, we can filter according to the nodes in the tree, and finally determine which category it belongs to