python3.6爬取豆瓣电影《我不是药神》的短评、去停用词、词云分析处理

注意库包的导入。

1、爬取短评内容部分：因为爬取的短评不只一页，还要进行翻页处理，这里我用了while循环、str(k)

首先打开网页查看源代码，

使用requests要爬取部分的网页源代码

def getHtml(url):
    try:
        r = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
        r.raise_for_status()
        r.encoding = "utf-8"
        return r.text
    except:
        print("Failed!!!")

用BeautifulSoup进行解析，open()写入xxx.txt中，这里要注意写入方式：wb+/ab+，中文编码问题，详见pythton菜鸟教程；

f = open("E:/movieComment.txt",'wb+')
def getData(html):
    soup = BeautifulSoup(html,"html.parser")
    comment_list = soup.find('div',attrs={'class':'mod-bd'})
    for comment in comment_list.find_all('div',attrs={'class':'comment-item'}):
        comment_content = comment.find('span',attrs={'class':'short'}).get_text()
        f.write(comment_content.encode('UTF-8'))

2、爬取了短评的内容保存在moveComment.txt中，接下来进行去停用词处理

去停用词需要下载停用词表，比较出名的是哈工大停用词表，直接百度下载即可
导入jieba库，进行结巴分词和去停用词处理



    def seg_sentence():
        #创建停用词列表
        filefath = 'E:/stopwords.txt'
        stopwords = [line.strip() for line in open(filefath,'r').readlines()]

        #实现句子的分词
        final = ''
        fn1 = open("E:/movieComment.txt", 'r',encoding='utf-8').read() #加载爬取的内容
        sentence_seged = jieba.cut(fn1,cut_all=False) #结巴分词：精确模式
        fn2 = open("E:/new.txt", "w", encoding='utf-8')
        for word in sentence_seged:
            if word not in stopwords:
                if word != '\t':
                    final +=word
                    final +=" "
        fn2.write(final)   #写入去掉停用词的内容

生成新的new.txt文件，内容如下：

3、最后是词云的展示，图片的展示

先要导入wordcloud是比较麻烦的，在windows下是不能直接使用pip install xxxxxx的，在https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下载和自己windows/python环境下相对于的wordcloud包，然后在cmd命令行下pip install安装，要注意路径

def wordcloud():

    # 加载图片
    image = Image.open("E:/wc.jpg", 'r')
    img = np.array(image)

    # 词云
    cut_text = open('E:/new.txt', 'r', encoding='utf-8').read()  # 加载去掉停用词的内容
    wordcloud = WordCloud(
        mask=img,  # 使用该参数自动忽略height,width
        height=2000,  # 设置图片高度
        width=4000,  # 设置图片宽度
        background_color='white',
        max_words=1000,  # 设置最大词数
        max_font_size=400,
        font_path="C:\Windows\Fonts\msyh.ttc",  # 如有口型乱码问题,可进入目录更换字体
    ).generate(cut_text)

    # 显示图片
    plot.imshow(wordcloud, interpolation='bilinear')
    plot.axis('off')  # 去掉坐标轴
    plot.show()        #直接显示
    #plot.savefig('E:/wc1.jpg') #存为图片

4、完整代码：

import requests
from bs4 import BeautifulSoup
import time
import jieba
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plot

def getHtml(url):
    try:
        r = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
        r.raise_for_status()
        r.encoding = "utf-8"
        return r.text
    except:
        print("Failed!!!")

f = open("E:/movieComment.txt",'wb+')
def getData(html):
    soup = BeautifulSoup(html,"html.parser")
    comment_list = soup.find('div',attrs={'class':'mod-bd'})
    for comment in comment_list.find_all('div',attrs={'class':'comment-item'}):
        comment_content = comment.find('span',attrs={'class':'short'}).get_text()
        f.write(comment_content.encode('UTF-8'))

def seg_sentence():
    #创建停用词列表
    filefath = 'E:/stopwords.txt'
    stopwords = [line.strip() for line in open(filefath,'r').readlines()]

    #实现句子的分词
    final = ''
    fn1 = open("E:/movieComment.txt", 'r',encoding='utf-8').read() #加载爬取的内容
    sentence_seged = jieba.cut(fn1,cut_all=False) #结巴分词：精确模式
    fn2 = open("E:/new.txt", "w", encoding='utf-8')
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                final +=word
                final +=" "
    fn2.write(final)   #写入去掉停用词的内容

def wordcloud():

    # 加载图片
    image = Image.open("E:/wc.jpg", 'r')
    img = np.array(image)

    # 词云
    cut_text = open('E:/new.txt', 'r', encoding='utf-8').read()  # 加载去掉停用词的内容
    wordcloud = WordCloud(
        mask=img,  # 使用该参数自动忽略height,width
        height=2000,  # 设置图片高度
        width=4000,  # 设置图片宽度
        background_color='white',
        max_words=1000,  # 设置最大词数
        max_font_size=400,
        font_path="C:\Windows\Fonts\msyh.ttc",  # 如有口型乱码问题,可进入目录更换字体
    ).generate(cut_text)

    # 显示图片
    plot.imshow(wordcloud, interpolation='bilinear')
    plot.axis('off')  # 去掉坐标轴
    plot.show()        #直接显示
    #plot.savefig('E:/wc1.jpg') #存为图片

def main():
    # 翻页处理 : max(start)=200
    k = 0  #start = k
    i = 0
    while k <200:
        url = 'https://movie.douban.com/subject/26752088/comments?start=' + str(k) + '&limit=20&sort=new_score&status=P'
        k += 20
        i += 1
        print("正在爬取第" + str(i) + "页的数据")
        #time.sleep(2) # 设置睡眠时间
        html = getHtml(url)
        getData(html)
    seg_sentence()
    wordcloud()
if __name__ == "__main__":
   main()

5、运行结果

python3.6爬取豆瓣电影《我不是药神》的短评、去停用词、词云分析处理

猜你喜欢