Python爬取豆瓣网图书评论

准备工作

1、进入豆瓣网图书频道：https://book.douban.com

2、寻找感兴趣的图书，进入其页面并查看该图书的评论

3、分析评论数据URL地址特性，得到其共有部分为：https://book.douban.com/subject/book_id/comments?

　　其中book_id为图书在网页地址栏中的编号

编码实现爬虫

# 获取HTML页面
def getHtml(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        return r.text
    except:
        return ''


# 获取评论
def getComment(html):
    soup = BeautifulSoup(html, 'html.parser')
    comments_list = []  # 评论列表
    comment_nodes = soup.select('.comment > p')
    for node in comment_nodes:
        comments_list.append(node.get_text().strip().replace("\n", "") + u'\n')
    return comments_list


# 获取并将评论保存到文件中
def saveCommentText(fpath):
    pre_url = "https://book.douban.com/subject/1799652/comments?"
    # 爬取深度
    depth = 16

    with open(fpath, 'w', encoding='utf-8') as f:
        for i in range(depth):
            print('开始爬取第{}页评论...'.format(i))
            url = pre_url + 'start=' + str(20 * i) + '&limit=20&sort=new_score&' + 'status=P'
        html = getHtml(url)
        f.writelines(getComment(html))
        # 设置随机休眠防止IP被封，好像也没有必要
        time.sleep(1 + float(random.randint(1, 20)) / 20)

生成词云

词云的生成要使用wordcloud组件

此外要指定背景图片，以及文字字体文件资源路径，否则中文无法显示

此外还要进行切词操作

切词　

# 切词
def cutWords(fpath):
    text = ''
    with open(fpath, 'r', encoding='utf-8') as fin:
        for line in fin.readlines():
            line = line.strip('\n')
            text += ' '.join(jieba.cut(line))
            text += ' '
    with codecs.open('cut_word.txt', 'w', encoding='utf-8') as f:
        f.write(text)

    print("\n分词完成,文件保存成功！")

创建词云图片

# 绘制词云
def drawWordcloud():
    with codecs.open('cut_word.txt', encoding='utf-8') as f:
        comment_text = f.read()

    color_mask = imread("comment.jpeg")  # 读取背景图片
    Stopwords = [u'就是', u'作者', u'你们', u'这么', u'不过', u'但是', u'什么', u'没有',
                 u'这个', u'那个', u'大家', u'比较', u'看到', u'真是',
                 u'除了', u'时候', u'已经', u'可以', u'，'u'。']
    cloud = WordCloud(font_path="FZYTK.TTF",  # 中文字体，否则无法显示
                      background_color='white',
                      max_words=200,
                      max_font_size=200,
                      min_font_size=4,
                      mask=color_mask,
                      stopwords=Stopwords)
    word_cloud = cloud.generate(comment_text)  # 产生词云
    image_colors = ImageColorGenerator(color_mask)

    # 以下代码显示图片
    plt.imshow(cloud)
    plt.axis("off")
    # 绘制词云
    plt.figure()

    # 重新着色，使用背景图片中的颜色
    plt.imshow(cloud.recolor(color_func=image_colors))
    plt.axis("off")
    # 绘制背景图片为颜色的图片
    plt.figure()
    plt.imshow(color_mask, cmap=plt.cm.gray)
    plt.axis("off")
    plt.show()
    # 保存图片
    word_cloud.to_file("comment_cloud.jpg")
    print('词云图保存成功')

运行结果：

Python爬取豆瓣网图书评论

猜你喜欢