版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sinat_20177327/article/details/79708357
决策树算法目前最流行的有ID3, C4.5, CART三种,其中C4.5是由ID3改进而来,用信息增益比代替ID3中的信息增益,ID3算法不能直接处理连续型数据,事先要把数据转换成离散型才可以操作,C4.5算法可以处理非离散型数据,而且可以处理不完整数据。CART算法使用基尼指数用于特征选择,并在树构造过程中进行剪枝。在构造决策树的时候,那些挂着几个元素的节点,不考虑最好,不然容易导致overfitting。
决策树(ID3)基本原理可以概括为:通过计算信息增益划分属性集,选择增益最大的属性作为决策树当前节点,依次往下,构建整个决策树。
假如随机变量X = {x1,x2,…,xn},每一个变量取到的概率是P={p(x1),p(x2),…,p(xn)},熵的计算通过下面公式:
熵计算完成后,通过将各个属性划分出来,计算各属性特征的信息增益来确定决策树的节点。信息增益计算公式如下:
其中S为全部样本集合,A为属性,value(A)是属性A的值域,v是属性A的一个取值,Sv是S中属性A值为v的样本集合,|Sv|是样本数量。
以下为ID3算法的python实现,构造决策树并测试。
#!/bin/python
#coding=utf-8
from math import log
import operator
def creatDataSet():
"""
构造数据集
"""
dataset = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataset, labels
def clacShannonEnt(dataset):
"""
计算给定数据集的熵
"""
numEntries = len(dataset)
labelCounts = {}
for featVec in dataset:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
ShannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key]) / numEntries
ShannonEnt -= prob * log(prob, 2)
return ShannonEnt
def splitDataset(dataset, axis, value):
"""
按照给定特征划分数据集
dataset是数据集,axis是数据集特征中的某一个(待划分出去),value是待划分的那个特征的值,
retDataSet是划分之后去掉axis=value特征之后的数据集
"""
retDataSet = []
for featVec in dataset:
if featVec[axis] == value:
reducedFeatVec = featVec[: axis]
reducedFeatVec.extend(featVec[axis + 1 : ])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataset):
"""
选择最好的数据集划分方式
"""
numFeatures = len(dataset[0]) - 1
#原数据集的熵
baseEntropy = clacShannonEnt(dataset)
#print baseEntropy
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataset]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataset = splitDataset(dataset, i, value)
prob = len(subDataset) / float(len(dataset))
newEntropy += prob * clacShannonEnt(subDataset)
infoGain = baseEntropy - newEntropy
#print infoGain
#print i
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
#print i
return bestFeature
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount
def creatTree(dataset, labels):
"""
构造决策树
"""
classList = [example[-1] for example in dataset]
#以下两行,类别完全相同时停止划分
if classList.count(classList[0]) == len(classList):
return classList[0]
#以下两行,遍历完所有的特征后返回出现次数最多的
if len(dataset[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataset)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel: {}}
#得到列表包含的所有属性值
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataset]
uniqueVals = set(featValues)
for value in uniqueVals:
subLables = labels[:]
myTree[bestFeatLabel][value] = creatTree(splitDataset(dataset, bestFeat, value), subLables)
return myTree
def classify(inputTree, featLabels, testVec):
"""
分类测试函数,inputTree为构造的决策树,featLabels为测试数据的label,testVec为测试数据,结果返回分类结果标签
"""
firstStr = inputTree.keys()[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel