机器学习实战之决策树算法

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sinat_20177327/article/details/79708357

决策树算法目前最流行的有ID3, C4.5, CART三种,其中C4.5是由ID3改进而来,用信息增益比代替ID3中的信息增益,ID3算法不能直接处理连续型数据,事先要把数据转换成离散型才可以操作,C4.5算法可以处理非离散型数据,而且可以处理不完整数据。CART算法使用基尼指数用于特征选择,并在树构造过程中进行剪枝。在构造决策树的时候,那些挂着几个元素的节点,不考虑最好,不然容易导致overfitting。

决策树(ID3)基本原理可以概括为:通过计算信息增益划分属性集,选择增益最大的属性作为决策树当前节点,依次往下,构建整个决策树。

假如随机变量X = {x1,x2,…,xn},每一个变量取到的概率是P={p(x1),p(x2),…,p(xn)},熵的计算通过下面公式:

熵计算完成后,通过将各个属性划分出来,计算各属性特征的信息增益来确定决策树的节点。信息增益计算公式如下:

其中S为全部样本集合,A为属性,value(A)是属性A的值域,v是属性A的一个取值,Sv是S中属性A值为v的样本集合,|Sv|是样本数量。

以下为ID3算法的python实现,构造决策树并测试。

#!/bin/python
#coding=utf-8

from math import log
import operator

def creatDataSet():
	"""
	构造数据集
	"""
	dataset = [[1, 1, 'yes'],
			   [1, 1, 'yes'],
			   [1, 0, 'no'],
			   [0, 1, 'no'],
			   [0, 1, 'no']]
	labels = ['no surfacing', 'flippers']
	return dataset, labels

def clacShannonEnt(dataset):
	"""
	计算给定数据集的熵
	"""
	numEntries = len(dataset)
	labelCounts = {}

	for featVec in dataset:
		currentLabel = featVec[-1]
		if currentLabel not in labelCounts.keys():
			labelCounts[currentLabel] = 0
			labelCounts[currentLabel] += 1
		ShannonEnt = 0.0

	for key in labelCounts:
		prob = float(labelCounts[key]) / numEntries
		ShannonEnt -= prob * log(prob, 2)
	return ShannonEnt

def splitDataset(dataset, axis, value):
	"""
	按照给定特征划分数据集
	dataset是数据集,axis是数据集特征中的某一个(待划分出去),value是待划分的那个特征的值,
	retDataSet是划分之后去掉axis=value特征之后的数据集
	"""
	retDataSet = []
	for featVec in dataset:
		if featVec[axis] == value:
			reducedFeatVec = featVec[: axis]
			reducedFeatVec.extend(featVec[axis + 1 : ])
			retDataSet.append(reducedFeatVec)
	return retDataSet

def chooseBestFeatureToSplit(dataset):
	"""
	选择最好的数据集划分方式
	"""
	numFeatures = len(dataset[0]) - 1
	#原数据集的熵
	baseEntropy = clacShannonEnt(dataset)
	#print baseEntropy
	bestInfoGain = 0.0
	bestFeature = -1
	for i in range(numFeatures):
		featList = [example[i] for example in dataset]
		uniqueVals = set(featList)
		newEntropy = 0.0
		for value in uniqueVals:
			subDataset = splitDataset(dataset, i, value)
			prob = len(subDataset) / float(len(dataset))
			newEntropy += prob * clacShannonEnt(subDataset)

		infoGain = baseEntropy - newEntropy
		#print infoGain
		#print i
		if (infoGain > bestInfoGain):
			bestInfoGain = infoGain
			bestFeature = i
			#print i
	return bestFeature

def majorityCnt(classList):
	classCount = {}
	for vote in classList:
		if vote not in classCount.keys():
			classCount[vote] = 0
			classCount[vote] += 1
			sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
	return sortedClassCount

def creatTree(dataset, labels):
	"""
	构造决策树
	"""
	classList = [example[-1] for example in dataset]
	#以下两行,类别完全相同时停止划分
	if classList.count(classList[0]) == len(classList):
		return classList[0]
	#以下两行,遍历完所有的特征后返回出现次数最多的
	if len(dataset[0]) == 1:
		return majorityCnt(classList)
	bestFeat = chooseBestFeatureToSplit(dataset)
	bestFeatLabel = labels[bestFeat]
	myTree = {bestFeatLabel: {}}
	#得到列表包含的所有属性值
	del(labels[bestFeat])
	featValues = [example[bestFeat] for example in dataset]
	uniqueVals = set(featValues)
	for value in uniqueVals:
		subLables = labels[:]
		myTree[bestFeatLabel][value] = creatTree(splitDataset(dataset, bestFeat, value), subLables)
	return myTree

def classify(inputTree, featLabels, testVec):
	"""
	分类测试函数,inputTree为构造的决策树,featLabels为测试数据的label,testVec为测试数据,结果返回分类结果标签
	"""
	firstStr = inputTree.keys()[0]
	secondDict = inputTree[firstStr]
	featIndex = featLabels.index(firstStr)
	for key in secondDict.keys():
		if testVec[featIndex] == key:
			if type(secondDict[key]).__name__ == 'dict':
				classLabel = classify(secondDict[key], featLabels, testVec)
			else:
				classLabel = secondDict[key]
	return classLabel

猜你喜欢

转载自blog.csdn.net/sinat_20177327/article/details/79708357