Python crawler - word cloud analysis of the most popular movie "Later Us"

1 Module library usage instructions
1.1 requests library
requests is an HTTP library written in Python language, based on urllib, and using Apache2 Licensed open source protocol. It is more convenient than urllib, can save us a lot of work, and fully meet the requirements of HTTP testing.
1.2
The request module of the urllib library urllib can easily grab the URL content, that is, send a GET request to the specified page, and then return the HTTP response.
1.3jieba library

结巴”中文分词:做最好的 Python 中文分词组件

1.4 BeautifulSoup library
   Beautiful Soup is an HTML/XML parser written in Python, which can handle irregular tags and generate parse trees. It provides simple and common operations for navigating, searching, and modifying the parse tree.
1.5 pandas library

pandas is a very powerful data analysis library for python, which is often used for data analysis.
1.6 re library
Regular expression re (general term formula) is an expression used to express a set of strings concisely. The advantage is simplicity. Use it for string manipulation.
1.7 The word cloud map generated by the wordcloud package in the wordcloud library
python. We finally want to generate an analysis word cloud of the current hit movies.
2 Requirement statement
Introduce what to do, the method to be used, what the expected result is and other requirements statement.
Crawling the Douban website https://movie.douban.com/cinema/nowplaying/ankang/ The Douban movie data of the city is Ankang mainly completes the following three steps
 Crawling web page data
 Cleaning data
 Using word cloud for display
using python The version is 3.6. And using Chinese word segmentation, the word cloud performs data analysis on the movie that ranks first on the Douban movie list, and displays the corresponding word cloud.

3 Algorithms for grabbing and processing data

1) Install the request module

1.1) Install the beautifulsoup module you need to use

2) View the structure of the website to be crawled

3) Preliminary code implementation

3.1) Preliminary crawling to the current theatrical release information

4.1) Capture the first hot comment information code of the hot movie

4.2) Successfully displayed hot review information

5.1) Carry out data cleaning for the malformed code in the previous step

5.2) Review information of "Later Us" after data cleaning

5.3) Perform data cleaning again to remove punctuation codes

5.4) Data after removing punctuation marks

6.1) Install the pandas module, and use this method to install the wordcloud library and so on.

def main():
    # 循环获取第一个电影的前10页评论
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range(10):
        num = i + 1
        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
        commentList.append(commentList_temp)

Use a for statement to loop through the first ten pages of reviews to get the top movie

Full code:

# coding:utf-8
__author__ = 'LiuYang'

import warnings

warnings.filterwarnings("ignore")
import jieba  # 分词包
import numpy  # numpy计算包
import codecs  # codecs提供的open方法来指定打开的文件的语言编码,它会在读取的时候自动转换为内部unicode
import re
import pandas as pd
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs


import matplotlib

matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud  # 词云包


# 分析网页函数
def getNowPlayingMovie_list():
    resp = request.urlopen('https://movie.douban.com/nowplaying/ankang/')  # 爬取安康地区的豆瓣电影信息
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    nowplaying_movie = soup.find_all('div', id='nowplaying')
    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')
    nowplaying_list = []
    for item in nowplaying_movie_list:
        nowplaying_dict = {}
        nowplaying_dict['id'] = item['data-subject']
        for tag_img_item in item.find_all('img'):
            nowplaying_dict['name'] = tag_img_item['alt']
            nowplaying_list.append(nowplaying_dict)
    return nowplaying_list


# 爬取评论函数
def getCommentsById(movieId, pageNum):
    eachCommentList = [];
    if pageNum > 0:
        start = (pageNum - 1) * 20
    else:
        return False
    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' + '?' + 'start=' + str(start) + '&limit=20'
    print(requrl)
    resp = request.urlopen(requrl)
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    comment_div_lits = soup.find_all('div', class_='comment')
    for item in comment_div_lits:
        if item.find_all('p')[0].string is not None:
            eachCommentList.append(item.find_all('p')[0].string)
    return eachCommentList


def main():
    # 循环获取第一个电影的前10页评论
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range(10):
        num = i + 1
        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
        commentList.append(commentList_temp)

    # 将列表中的数据转换为字符串
    comments = ''
    for k in range(len(commentList)):
        comments = comments + (str(commentList[k])).strip()

    # 使用正则表达式去除标点符号
    pattern = re.compile(r'[\u4e00-\u9fa5]+')
    filterdata = re.findall(pattern, comments)
    cleaned_comments = ''.join(filterdata)

    # 使用结巴分词进行中文分词
    segment = jieba.lcut(cleaned_comments)
    words_df = pd.DataFrame({'segment': segment})

    # 去掉停用词
    stopwords = pd.read_csv("stopwords.txt", index_col=False, quoting=3, sep="\t", names=['stopword'],
                            encoding='utf-8')  # quoting=3全不引用
    words_df = words_df[~words_df.segment.isin(stopwords.stopword)]

    # 统计词频
    words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数": numpy.size})
    words_stat = words_stat.reset_index().sort_values(by=["计数"], ascending=False)

    # 用词云进行显示
    wordcloud = WordCloud(font_path="simhei.ttf", background_color="white", max_font_size=80)
    word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values}

    word_frequence_list = []
    for key in word_frequence:
        temp = (key, word_frequence[key])
        word_frequence_list.append(temp)

    wordcloud = wordcloud.fit_words(dict (word_frequence_list))
    plt.imshow(wordcloud)
    plt.savefig("ciyun_jieguo .jpg")

# 主函数
main()

Result obtained successfully

Go to the code path to get the word cloud result picture as shown in the figure:

word cloud result graph

4. Analysis of the results:
Select the movie ranking information of cinemas in Ankang area. First, analyze the movies that are currently being released to obtain the most popular movie information. The second step is to comment on the most popular movie "Later Us" in the ranking. Data cleaning, remove incorrectly formatted error information, remove punctuation, Chinese overlapping words, and obtain words with the highest frequency. Then obtain the word cloud information of this movie through the word cloud.
From the finally obtained word cloud analysis chart, we can successfully crawl the Douban movie information in Ankang area, the movie information currently being shown in the theater, and thus get the feature tag of the popular movie "Later Us", which is basically It reflects the situation of the movie, the feelings of the viewers, the main characters of the movie, the director's information, etc. at a glance.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326173068&siteId=291194637