使用朴素贝叶斯过滤垃圾邮件

split文本分割函数

mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
ret=mySent.split()
print(ret)

输出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']
>>>

使用正则表达式可以解决单词中的其他符号：

\d 匹配任何十进制数；它相当于类 [0-9]。
\D 匹配任何非数字字符；它相当于类 [^0-9]。
\s 匹配任何空白字符；它相当于类 [ \t\n\r\f\v]。
\S 匹配任何非空白字符；它相当于类 [^ \t\n\r\f\v]。
\w 匹配任何字母数字字符；它相当于类 [a-zA-Z0-9_]。
\W 匹配任何非字母数字字符；它相当于类 [^a-zA-Z0-9_]。

import re
regEx=re.compile('\\W')    #大写的W
listOfTokens=regEx.split(mySent)
print(listOfTokens)

输出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', '', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

为了消除其中的空元素，剔除长度为0的元素：

ret=[tok for tok in listOfTokens if len(tok)>0]
print(ret)

等效为一下语句：

ret=[]
for tok in listOfTokens:
    if len(tok)>0:
        ret.append(tok)
print(ret)

其中for关键字前方的字符即为需要append的内容，为了统一大小写，全部返回小写：

ret=[tok.lower() for tok in listOfTokens if len(tok)>0]
print(ret)

输出：

['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

在附件中有很多email文本，以其中任意一个为例：

Hello,

Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.

For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you’re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.

you have received this mandatory email service announcement to update you about important changes to Google Groups.

同样也可以将所有单词都分割出来。

emailText = open('email/ham/6.txt').read()
listOfTokens=regEx.split(emailText)
print(listOfTokens)

定义一个split函数，返回长度大于2的单词list

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

最后附上文件解析及完整的垃圾邮件测试函数：

from numpy import *

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    #因为传入是二维数组，所以将二维数组内的所有元素全部压入Set中（顺序可能会被打乱）
    #最后再以list的形式返回
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

#统计单词出现的次数，用于创建向量集
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec


def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                     #测试集数目  6
    numWords = len(trainMatrix[0])                      #总单词（去重）数目  32
    pAbusive = sum(trainCategory)/float(numTrainDocs)   #该文档属于侮辱类的概率=被标记为侮辱类句子数量/总句子数量=3/6.0=0.5
    #变量初始化
    p0Num = zeros(numWords); p1Num = zeros(numWords)    #标记向量初始化为[0,0,0,0...]
    p0Denom = 0; p1Denom = 0                            #统计数为0

    #计算概率时，需要计算多个概率的乘积以获得文档属于某个类别的概率
    #即计算p(w0|ci)*p(w1|ci)*...p(wN|ci)，然后当其中任意一项的值为0，那么最后的乘积也为0.
    #为降低这种影响，采用拉普拉斯平滑，在分子上添加a(一般为1)，分母上添加ka(k表示类别总数)，
    #即在这里将所有词的出现数初始化为1，并将分母初始化为2*1=2
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()   
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

    #对于每个句子
    #如果该句被人工标记为侮辱性的，则其中出现的每个词汇p1Num都该被认为是侮辱性的，侮辱性词汇总数p1Denom也做相应统计
    #如果该句不是侮辱性的，同样做统计
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #每个单词的是侮辱词的条件概率=在侮辱词中出现的次数p1Num/侮辱词出现总数p1Denom
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    #计算概率时，由于大部分因子都非常小，最后相乘的结果四舍五入为0,造成下溢出或者得不到准确的结果，
    #所以，我们可以对成绩取自然对数，即求解对数似然概率。这样，可以避免下溢出或者浮点数舍入导致的错误。
    #同时采用自然对数处理不会有任何损失。
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    #p1=(单词A出现的次数*单词A出现在侮辱句时的概率+单词B出现的次数*单词B出现在侮辱句时的概率+...)*正常句出现的概率
    #p0=(单词A出现的次数*单词A出现在正常句时的概率+单词B出现的次数*单词B出现在正常句时的概率+...)*正常句出现的概率
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0


def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        #分别读取25封垃圾邮件和正常邮件
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        #因ham/23.txt中包含商标R符号，读取时需要忽略掉错误
        wordList = textParse(open('email/ham/%d.txt' % i,encoding='utf-8',errors='ignore').read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    #vocabulary 去重
    vocabList = createVocabList(docList)                #create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        #numpy包含ramdom，random.uniform用于生成一个0到len(trainingSet)的随机数
        randIndex = int(random.uniform(0,len(trainingSet)))
        #在range(50)中随机选取10封不重复的邮件
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  

    trainMat=[]; trainClasses = []
    #剩下的40封邮件用于统计训练
    for docIndex in trainingSet:    #train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    #计算每种条件对应的概率
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    #选中的10封由于测试
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        #如果用贝叶斯分类器的结果和实际结果不一样
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    #计算平均错误率
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText


spamTest()

注意其中的第23篇正常邮件中包含一个utf-8不支持的字符：


SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.

SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:

需要跳过。

样本数据下载

使用朴素贝叶斯过滤垃圾邮件

猜你喜欢