机器学习实战——贝叶斯分类器

一 贝叶斯定理

    设X是数据样本,用n个属性集的测量值描述。令H为某种假设,如数据样本属于某个特定类C。给定X的属性描述,找出样本属于类C的概率。贝叶斯定理用公式表示:


二 朴素贝叶斯分类

    设D是数据集,每个样本用一个n维向量X={x1,x2,...,xn}表示,对应n个属性A1,A2,...,An。假设有m个类C1,C2,...,Cm。朴素贝叶斯分类法预测X属于类Ci,当且仅当


根据贝叶斯定理


由于P(X)对所有类为常数,只需要P(X|Ci)P(Ci)最大即可。

朴素贝叶斯假设各属性之间独立,则有


三 贝叶斯分类器算法实现

用贝叶斯分类器预测泰坦尼克号生存情况。

import csv
import math
 
def loadCsv(filename):
	lines = csv.reader(open(filename, "r"))
	dataset = list(lines)
	for i in range(len(dataset)):
		dataset[i] = [float(x) for x in dataset[i]]
	return dataset
 
def separateByClass(dataset):
	separated = {}
	for i in range(len(dataset)):
		vector = dataset[i]
		if (vector[-1] not in separated):
			separated[vector[-1]] = []
		separated[vector[-1]].append(vector)
	return separated
 
def mean(numbers):
	return sum(numbers)/float(len(numbers))
 
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
	return math.sqrt(variance)
 
def summarize(dataset):
	summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
	del summaries[-1]
	return summaries
 
def summarizeByClass(dataset):
	separated = separateByClass(dataset)
	summaries = {}
	for classValue, instances in separated.items():
		summaries[classValue] = summarize(instances)
	return summaries
 
def calculateClassProbabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            mean, stdev = classSummaries[i]
            x = inputVector[i]
            exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
            probabilities[classValue] *= (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
    return probabilities
 
def classify0(summaries, inputVector):
	probabilities = calculateClassProbabilities(summaries, inputVector)
	bestLabel, bestProb = None, -1
	for classValue, probability in probabilities.items():
		if bestLabel is None or probability > bestProb:
			bestProb = probability
			bestLabel = classValue
	return bestLabel

def test():	
    trainingSet = loadCsv('titanictrain_3.csv')
    testSet =loadCsv('test_3.csv')
    summaries = summarizeByClass(trainingSet)
    correct = 0
    for i in range(len(testSet)):
        result = classify0(summaries, testSet[i])
        print("计算值: %d, 真实值: %d" % (result, testSet[i][-1])) 
        if testSet[i][-1] == result:
            correct += 1
            accuracy=(correct/float(len(testSet))) * 100.0
    print('正确率:%f' % accuracy)    

test()

正确率:85%

注意:使用拉普拉斯校准避免计算零概率值。假设某个属性具有3种取值,在数据集D中出现次数为0,10,990,使用拉普拉斯校准后,得到如下概率1/1003,11/1003,991/1003。校准后的概率与实际值很接近,但是避免了零概率值。

参考资料:

【1】《机器学习实战》 Peter Harrington 著  人民邮电出版社

【2】《机器学习》 周志华 著  清华大学出版社

【3】《数据挖掘概念与技术》 Jiawei Han等 著  机械工业出版社


猜你喜欢

转载自blog.csdn.net/evitachan/article/details/80784244