朴素贝叶斯实战--根据电锯惊魂豆瓣影评判断新写的影评是好评还是差评

读万卷书行万里路，故此学完朴素贝叶斯函数后，写了个预测评论好坏的例子。

本例子是通过根据电锯惊魂8的豆瓣评论而进行的预测，由于未登录的情况下只能获取220条评论，为了检验贝叶斯函数也就没在爬虫上下功夫，只爬取了好评和差评各220条，各位如果想改进准确率的话可以多爬些数据，并且多爬些不同电影的评论。

对爬出的数据使用jieba包进行分词和拆分权重最大的几个关键字，再将所有的关键字（包括好评和差评）放到两个map中（好评和差评），键为关键字值为当前类别中出现的次数。默认次数都为1，是为了计算概率时候出现0的情况，并且将概论的乘法改为log的加法。为了处理0.00000000000.....1这种无法显示这种情况

废话不多说，直接上代码：

先爬取数据

Crawler.py：

# *_*coding:utf-8 *_*
import  requests
import sys
import re
import jieba
import jieba.analyse
import os
reload(sys)
sys.setdefaultencoding("utf-8")

#好评
#url="https://movie.douban.com/subject/25788426/comments?start=0&limit=20&sort=new_score&status=P&percent_type=h"
#差评
url="https://movie.douban.com/subject/25788426/comments?start=0&limit=20&sort=new_score&status=P&percent_type=l"
#常用词
oftenWord=['是','不是','电影','竖','锯','电锯','']
words=''
def getPageContent(url):
    page=requests.get(url)
    page.encoding=requests.utils.get_encodings_from_content(page.text)[0]
    nextPage=re.findall(r'href="*.?start(.*?)"*.?data-page=',page.text,re.S)
    if not nextPage:
        return '',''
    nextPage=nextPage[-1]
    items=re.findall(r'<span class="short">(.*?)</span>',page.text,re.S)
    return nextPage,items

def saveKeys(name):
    filePath = 'F:\\vidoTake\\'
    if not os.path.exists(filePath):
        os.makedirs(filePath)
    file = open(filePath+name.decode("utf-8")+".txt", "w")
    file.write(words)
    file.close()


def textKey(text):
    # 1.分词，这里使用结巴分词全模式
    cutTexts=jieba.cut(text)
    # 2.去除常用词
    text=""
    for cutText in cutTexts:
        if cutText not in oftenWord:
            text=text+" "+cutText
    # 3.提取关键词
    # text 为待提取的文本
    # topK:返回几个 TF/IDF 权重最大的关键词，默认值为20。
    # withWeight:是否一并返回关键词权重值，默认值为False。
    # allowPOS:仅包括指定词性的词，默认值为空，即不进行筛选
    keys=jieba.analyse.extract_tags(text,topK=10,withWeight=True,allowPOS=())
    for word,qz in keys:
        global words
        words=words+" "+word
    words=words+"\n"


def CheakText(url):
    page = [0]
    for i in page:
        nextPage, items = getPageContent(url)
        if nextPage == '':
            break

        url = "https://movie.douban.com/subject/25788426/comments?start" + nextPage
        for item in items:
            textKey(item)
        page.append(0)
    saveKeys("差评")


CheakText(url)

评论分类：

PuSuBayes.py：

# *_*coding:utf-8 *_*
import sys
import math
reload(sys)
sys.setdefaultencoding('utf8')
import jieba

#读取爬下来的文件，并生成向量
#格式：{特征：出现的次数}
def getSet(url):
    file=open(url,"r")
    contents=file.readlines()
    setContent={}
    for content in contents:
        items=content.replace("\n", "").split(" ")
        for item in items:
            if item=='':
                continue
            setContent[item]=setContent.get(item,1)+1
    return setContent,len(contents)

#统计所有类别所有特征出现的次数，并且统计各个类别的评论数
def countData():
    goodContent,goodLen = getSet("F:\\vidoTake\\" + "好评".decode("utf-8") + ".txt")
    badContent,badLen = getSet("F:\\vidoTake\\" + "差评".decode("utf-8") + ".txt")
    keys = set(goodContent.keys()) | set(badContent.keys());
    for key in keys:
        if key not in goodContent.keys():
            goodContent[key] = 1
        if key not in badContent.keys():
            badContent[key] = 1

    return  goodContent,badContent,goodLen,badLen

#计算各个类别的贝叶斯
#根据输入的文件进行词语拆分，然后选择出在上述特征向量中出现的词语（X,Y,Z）
#计算P(类别i|特征1,特征2,特征3)
# =P(特征1,特征2,特征3|类别i)*P(类别i)/P(特征1,特征2,特征3)
# =P(特征1|类别i)*P(特征2|类别i)*P(特征3|类别i)*P(类别i)/P(特征1,特征2,特征3)

def bayes(word):
    goodContent, badContent, goodLen, badLen=countData()
    curWords=jieba.cut(word)
    #计算贝叶斯，由于拆分的分母一样所以无需计算,小数点太多的话无法计算。故此转成log,乘法变为加法
    goodBayes =math.log(goodLen*1.0/(goodLen+badLen))
    badBayes = math.log(badLen*1.0/(goodLen+badLen))
    for curWord in curWords:
        if curWord in goodContent.keys():
            goodBayes=math.log(goodContent.get(curWord.encode("utf-8"))*1.0/goodLen)+goodBayes
            badBayes=math.log(badContent.get(curWord.encode("utf-8"))*1.0/badLen)+badBayes
    return goodBayes,badBayes


goodBayes,badBayes=bayes("就这东西，什么玩意，期待了很久")
print goodBayes
print badBayes
if goodBayes>badBayes:
    print '好评'
else:
    print '差评'

-----------------------------------------------------------------------------------------------------------------------------------------------------------

经手动测试写了几个好评和差评都，都成功的分类了出来。

朴素贝叶斯实战--根据电锯惊魂豆瓣影评判断新写的影评是好评还是差评

猜你喜欢