第四章基于概率论的分类方法：朴素贝叶斯

本章内容

对于分类，使用概率有时比使用硬规则更为有效。贝叶斯概率及贝叶斯准备提供了一个利用已知值来估计未知概率的有效方法。

朴素贝叶斯：通过特征之间的条件独立性假设，降低对数据量的需求。独立性是指一个词的出现概率与文档中的其他词没有关联关系。这也是称为朴素贝叶斯的原因。虽然条件独立性假设不成立，但是朴素贝叶斯仍然是一种有效的分类器。朴素贝叶斯的另外一个假设是每个特征同等重要。

程序实现需要朴素贝叶斯需要考虑的因素：1、下溢出，通过对概率取对数来解决。2、词袋模型在解决文档分类问题上比词集模型有所提高。3、停用关键词。

贝叶斯准则：如何交换条件概率中的条件和结果，即如果已知 $p(x|c)$ ，求解 $p(c|x)$ ，方法：
$p (c | x) = p ( x | c ) \times p ( c ) p ( c )$ $p(c|x)=\frac{p(x|c)\times p(c)}{p(c)}$

如果每个特征需要 $N$ 个样本，那么对于1000个特征将需要 $N^{1000}$ 个样本。如果特征间相互独立，则样本数可从 $N^{1000}$ 减少到 $1000\times N$ 。所谓的独立（independence），指的是统计意义上的独立，即一个特征或者单词出现的可能性与它和其他单词相邻没有关系。

如果将贝叶斯准则中的 $x$ 替换为向量 $\mathbf w$ ，则 $\mathbf w$ 出现在类别 $c_i$ 中的概率为：
$p (c i | w) = p ( w | c i ) p ( c i ) p ( w )$ $p(c_i|\mathbf w)=\frac{p(\mathbf w | c_i)p(c_i)}{p(\mathbf w)}$

计算 $p(\mathbf w | c_i)$ ，需要用到朴素贝叶斯假设。将 $\mathbf w$ 展开为一个个独立特征，则上述可以写成 $p(w_0, w_1, w_2..w_N|c_i)$ ，假设所有词相互独立，即条件独立性假设。则可以使用使用 $p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_N|c_i)$ 来计算上述概率。由于大部分因此都可能非常小，程序会出现下溢出或者得到不正确的答案。一种解决办法是对乘积使用自然对数。 $ln(a\times b)=ln(a)+ln(b)$

留存交叉验证（hold-out cross validation）：随机选择数据的一部分作为训练集，而剩余部分作为测试集的过程称为留存交叉验证。

Universal Feed Parser：Python中最常用的RSS程序库。在https://github.com/kurtmckee/feedparser获取，解压 | 进入对应目录，执行python setup.py install

其他概念：词条向量，便利函数（convenience function），词集模型（set-of-words model），词袋模型（bag-of-words model），停用词集（stop word list）

使用的函数

函数	功能
set1 \| set(arr)	获取集合set1和集合set(arr)的并集
[0]*n	创建一个长度为n的数组，元素均为0，与zeros(n)类似
range(5)	[0,1,2,3,4]
range(2,5)	[2,3,4]
range(m,n,a)	创建一个int数组，从m开始到n（不含），步长为a
log(a)	$ln(a)$
arr1.extend(arr2)	将arr2中元素逐个添加到arr1中
arr1.append(arr2)	将arr2作为一个元素添加到arr1中
`re.compile('\\W*'`)	创建一个正则表达式，除单词、数字外的任意字符串
open(filename).read()	打开文件，并读取字符串
random.uniform(x,y)	随机生成一个范围为x-y的实数
del(list[i])	删除list中的第i个元素
str.index(str1)	查询str中，第一次出现str1的索引，没有出现的话，抛出异常。find与之不同的是，没有子串的话，返回-1
arr.index(item)	arr中，item元素出现的索引
feedparser.parse(rssstr)	获取rssstr链接中的RSS内容
arr.count(value)	arr中值为value的元素个数

程序代码

# coding=utf-8

# 词表到向量的转换函数
def loadDataSet() :
    postingList = [\
        ['my','dog','has','flea','problems','help','please'],\
        ['maybe','not','take','him','to','dog','park','stupid'],\
        ['my','dalmation','is','so','cute','I','love','him'],\
        ['stop','posting','stupid','worthless','garbage'],\
        ['mr','licks','ate','my','steak','how','to','stop','him'],\
        ['quit','buying','worthless','dog','food','stupid']]
    classVec = [0,1,0,1,0,1]            # 1代表侮辱性文字，0代表正常言论
    return postingList, classVec

def createVocabList(dataSet) :
    # 创建一个空集，将每篇文档返回的新词集合添加到该集合中
    vocabSet = set([])
    for document in dataSet : 
        # 两个集合的并集
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet) : 
    # 创建一个和词汇量等长的向量，其中所包含的元素都是0
    returnVec = [0]*len(vocabList)
    for word in inputSet : 
        if word in vocabList : 
            returnVec[vocabList.index(word)] = 1
        else :
            print "the word : %s is not in my Vocabulary!" % word
    return returnVec


from numpy import *
# trainMatrix: 文档矩阵
# trainCategory: 由每篇文档类别标签所构成的向量
def trainNB0(trainMatrix, trainCategory) :
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    # 初始化概率，计算p(wi|c1)和p(wi|c0)，初始化程序中的分子变量和分母变量，w是一条训练记录向量
    # 利用贝叶斯分类器对文档进行分类时，要计算多个概率的乘积以获取文档属于某个类别的概率，即p(w0|1)*p(w1|1)*p(w2|1)
    # 如果概率值为0，那么最后乘积也是0。为了降低这种影响，将所有词的出现次数初始化为1，并将分母初始化为2
    p0Num = ones(numWords)          # p0Num = zeros(numWords)
    p1Num = ones(numWords)          # p1Num = zeros(numWords)
    p0Denom = 2.0                   # p0Denom = 0.0
    p1Denom = 2.0                   # p1Denom = 0.0
    for i in range(numTrainDocs) :
        if trainCategory[i] == 1 :
            # 向量相加
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else :
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 对每个元素做除法
    # 下溢出问题，这是因为太多很小的数相乘造成的。当计算乘积p(w0|ci)*p(w1|ci)*p(w2|ci)...p(wN|ci)，由于大部分因子都非常小
    # 所以程序会下溢出或得不到正确答案。
    # 一种解决办法是对乘积取自然对数。ln(a*b) = ln(a)+ln(b)，避免下溢出或者浮点数舍入导致错误。
    p1Vect = log(p1Num/p1Denom)
    p0Vect = log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

# 对贝叶斯垃圾邮件分类器进行自动化处理
def spamTest() :
    docList=[]; classList=[]; fullText=[]
    # 导入文件夹spam和ham下的文本文件，并将它们解析为词列表
    for i in range(1, 26) :
        # 读取垃圾邮件，进行处理
        wordList = textParse(open('c:\python27\ml\email\spam\%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        # 读取正常邮件，进行处理
        wordList = textParse(open('c:\python27\ml\email\ham\%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    trainingSet = range(50); testSet=[]
    # 本例中共有50封电子邮件，随机选择其中10封作为测试集
    for i in range(10) :
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        # 将随机选中的邮件从训练集中剔除，而剩余部分作为测试集的过程称为留存交叉验证
        del(trainingSet[randIndex])
    trainMat=[]; trainClasses=[]
    # 遍历训练集的所有文档，对每封邮件基于词汇表并使用setOfWords2Vec()函数来构建词向量
    for docIndex in trainingSet : 
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0
    # 对测试集中每封邮件进行分类
    for docIndex in testSet :
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        # 测试集中元素分类不正确
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex] :
            errorCount += 1
            print 'classification error ', docList[docIndex]
    print 'the error rate is: ', float(errorCount)/len(testSet)

# RSS源分类器及高频词去除函数
# calcMostFreq遍历词汇表中的每个词，并统计它在文本中出现的次数，
# 然后根据出现从高到低对词典进行排序，最后返回排序最高的30个单词
def calcMostFreq(vocabList, fullText) :
    import operator
    freqDict = {}
    for token in vocabList :
        # 计算fullText中token出现的次数
        freqDict[token] = fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedFreq[:30]

# feed1, feed0是两个RSS源
def localWords(feed1, feed0) :
    import feedparser
    docList=[]; classList=[]; fullText=[]
    minLen = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLen) :    # 每次访问一条RSS源
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    top30Words = calcMostFreq(vocabList, fullText)
    for pairW in top30Words :
        if pairW[0] in vocabList : vocabList.remove(pairW[0])
    # trainingSet是一个元素为0-（2*minLen-1）的数组
    trainingSet = range(2*minLen); testSet=[]
    for i in range(20) :
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        # 删除索引为randIndex的元素randIndex
        del(trainingSet[randIndex])
    trainMat=[]; trainClasses=[]
    # 此时trainingSet已经是删除随机生成的20个randomIndex元素的数组
    for docIndex in trainingSet :
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0
    for docIndex in testSet : 
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex] :
            errorCount += 1
    print 'the error rate is: ', float(errorCount)/len(testSet)
    return vocabList, p0V, p1V

# 最具表征性的词汇显示函数
def getTopWords(ny, sf) :
    import operator
    vocabList, p0V, p1V = localWords(ny, sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)) :
        if p0V[i] > -4.5 : topSF.append((vocabList[i], p0V[i]))
        if p1V[i] > -4.5 : topNY.append((vocabList[i], p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair:pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF :
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair:pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY :
        print item[0]

在命令行中执行：

>>> import ml.bayes as bayes
>>> listOPosts,listClasses=bayes.loadDataSet()
>>> myVocabList=bayes.createVocabList(listOPosts)
>>> myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop
', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has
', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 't
ake', 'mr', 'steak', 'my']
>>> bayes.setOfWords2Vec(myVocabList, listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0
, 0, 0, 0, 0, 1]
>>> bayes.setOfWords2Vec(myVocabList, listOPosts[3])
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1
, 0, 0, 0, 0, 0]

>>> reload(bayes)
<module 'ml.bayes' from 'C:\Python27\ml\bayes.py'>
>>> listOPosts, listClasses=bayes.loadDataSet()         # 从预先加载值中调入数据
>>> myVocabList = bayes.createVocabList(listOPosts)     # 包含所有词的列表
>>> myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop
', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has
', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 't
ake', 'mr', 'steak', 'my']
>>> trainMat = []
>>> for postinDoc in listOPosts :               # 该for循环使用词向量来填充trainMat列表
...     trainMat.append(bayes.setOfWords2Vec(myVocabList, postinDoc))
...
>>> trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
 0, 1, 0, 1, 0, 0, 0, 0, 0]]
>>> p0V,p1V,pAb=bayes.trainNB0(trainMat, listClasses)
>>> p0V
array([ 0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
        0.04166667,  0.04166667,  0.04166667,  0.        ,  0.04166667,
        0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
        0.08333333,  0.        ,  0.        ,  0.04166667,  0.        ,
        0.04166667,  0.04166667,  0.        ,  0.04166667,  0.04166667,
        0.04166667,  0.        ,  0.04166667,  0.        ,  0.04166667,
        0.04166667,  0.125     ])
>>> p1V
array([ 0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.05263158,  0.05263158,  0.05263158,  0.        ,  0.10526316,
        0.        ,  0.05263158,  0.05263158,  0.        ,  0.10526316,
        0.        ,  0.15789474,  0.        ,  0.05263158,  0.        ,
        0.        ,  0.        ])
>>> pAb
0.5

# 切分文本
>>> mySent='This book is the best book on Python or M.L. I have ever laid eyes u
pon.'
>>> mySent.split()
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I',
 'have', 'ever', 'laid', 'eyes', 'upon.']
>>> import re
>>> regEx = re.compile('\\W*')                  # 除单词、数字外的任意字符串，用来分割字符串
>>> listOfTokens = regEx.split(mySent)
>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I
', 'have', 'ever', 'laid', 'eyes', 'upon', '']
>>> [tok for tok in listOfTokens if len(tok)>0]         # 将空字符串去掉
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I
', 'have', 'ever', 'laid', 'eyes', 'upon']
>>> 
[tok.lower() for tok in listOfTokens if len(tok)>0]     # 将字符串全部转变为小写
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i
', 'have', 'ever', 'laid', 'eyes', 'upon']
>>> emailText = open('c:\python27\ml\email\ham\6.txt').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] invalid mode ('r') or filename: 'c:\\python27\\ml\\email\\ha
m\x06.txt'
>>> emailText = open('c:\python27\ml\email\ham\\6.txt').read()
>>> listOfTokens=regEx.split(emailText)

# 使用朴素贝叶斯进行交叉验证
>>> reload(bayes)
<module 'ml.bayes' from 'C:\Python27\ml\bayes.py'>
>>> bayes.spamTest()
classification error  ['home', 'based', 'business', 'opportunity', 'knocking', '
your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'ear
n', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed'
, 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder',
'experts']
the error rate is:  0.1
>>> bayes.spamTest()
the error rate is:  0.0
>>> bayes.spamTest()
classification error  ['oem', 'adobe', 'microsoft', 'softwares', 'fast', 'order'
, 'and', 'download', 'microsoft', 'office', 'professional', 'plus', '2007', '201
0', '129', 'microsoft', 'windows', 'ultimate', '119', 'adobe', 'photoshop', 'cs5
', 'extended', 'adobe', 'acrobat', 'pro', 'extended', 'windows', 'professional',
 'thousand', 'more', 'titles']
classification error  ['home', 'based', 'business', 'opportunity', 'knocking', '
your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'ear
n', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed'
, 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder',
'experts']
the error rate is:  0.2

# 使用朴素贝叶斯分类器从个人广告中获取区域倾向
>>> reload(bayes)
<module 'ml.bayes' from 'C:\Python27\ml\bayes.pyc'>
>>> ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.3
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.3
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.2
>>> vocabList,pSF,pNY=bayes.localWords(ny,sf)
the error rate is:  0.45
# pSF,pNY为负值的原因是使用ln(p)代替了p
>>> pSF
array([-5.66296048, -4.05352257, -5.66296048, -5.66296048, -5.66296048,
       -4.9698133 , -5.66296048, -4.9698133 , -5.66296048, -4.56434819,
    ....... 
       -4.9698133 , -5.66296048, -4.9698133 , -5.66296048, -4.9698133 ,
       -4.27666612, -4.9698133 , -4.9698133 ])
>>> pNY
array([-5.70711026, -5.70711026, -5.01396308, -5.01396308, -5.70711026,
       -5.70711026, -5.01396308, -5.70711026, -5.01396308, -5.01396308,
    ......
       -5.01396308, -5.70711026, -5.70711026, -5.70711026, -5.01396308,
       -5.70711026, -5.70711026, -4.60849798])

>>> bayes.getTopWords(ny, sf)
the error rate is:  0.35
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
more
tonight
four
cool
here
nine
man
meet
real
time
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
work
any
available
demands
some
manhattan
guy
lunch
care
take
find
enough
working
hello

[完]机器学习实战第四章基于概率论的分类方法：朴素贝叶斯（Naive Bayesian Classification）

第四章基于概率论的分类方法：朴素贝叶斯

本章内容

使用的函数

程序代码

猜你喜欢

[完]机器学习实战 第四章 基于概率论的分类方法：朴素贝叶斯（Naive Bayesian Classification）

第四章 基于概率论的分类方法：朴素贝叶斯

本章内容

使用的函数

程序代码

猜你喜欢

[完]机器学习实战第四章基于概率论的分类方法：朴素贝叶斯（Naive Bayesian Classification）

第四章基于概率论的分类方法：朴素贝叶斯