Python crawling "Dollar Earth" Douban movie review and data analysis visualization

"Wandering Earth" was released nationwide on New Year's Day. In the Douban score, the opening score on the first day stood above 8 points, continuing the high reputation that was previously shown. On Weibo, a 31-day guest search of Wu Jing and a hot search of 60 million investment followed. Zhihu's answer to "how to evaluate Liu Cixin's novel adaptation of the movie" Wandering Earth "" has attracted many people's attention, including the film director Guo Fan's highest praise answer.

This article crawled part of the movie reviews of "Wandering Earth" on the Douban website, and performed data analysis and visualization. The following is the entire process of crawling analysis, let us begin happily!

1. Webpage Analysis

Douban.com has banned crawling data since October 2017. Only 200 short comments can be crawled in the non-login state, and only 500 data can be crawled in the login state. You can crawl up to 40 times a minute in the day and 60 times in the evening. If the number is exceeded, the IP address will be blocked. When Satoshi Satoshi obtained 400 pieces of crawled data, he was blocked from IP, and his account was forced to log off. After that, his account was restored by sending a text message. Therefore, it is not recommended to crawl multiple times.

Get object

Comment users
comments
score
Comment Date
User's city

It is worth noting that in the address bar, we will find that the movie ID number is 26266893 (other movies only need to change the ID), and each page has 20 short reviews, so I crawled 20 pages. The comment page does not have the user's city. You need to enter the user page to get information.

Second, data acquisition and storage

1 get cookies

Komoto uses Chrome browser, Ctrl + F12 to enter the developer tools page. F5 refreshes the data that appears and finds cookies and headers.

2 Load headers and cookies, and use the requests library to get information

def get_content(id, page):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    cookies = {'cookie': 'bid=GOOb4vXwNcc; douban-fav-remind=1; ps=y; ue="[email protected]"; push_noty_num=0; push_doumail_num=0; ap=1; ll="108288"; dbcl2="181095881:BSb6IVAXxCI"; ck=Fd1S; ct=y'}
    url = "https://movie.douban.com/subject/" + str(id) + "/comments?start=" + str(page * 10) + "&limit=20&sort=new_score&status=P"
    res = requests.get(url, headers=headers, cookies=cookies)

3 Parsing demand data

Use xpath analysis here. It was found that although some users gave comments, they did not give ratings, so the xpath positions of score and date will change. Therefore, it is necessary to add judgment, if it is found that the score is parsed in the date, it proves that the comment did not give a rating.

for i in range(1, 21):   # 每页20个评论用户
    name = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i))
    # 下面是个大bug，如果有的人没有评分，但是评论了，那么score解析出来是日期，而日期所在位置spen[3]为空
    score = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[2]/@title'.format(i))
    date = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[3]/@title'.format(i))
    m = '\d{4}-\d{2}-\d{2}'
    try:
        match = re.compile(m).match(score[0])
    except IndexError:
        break
    if match is not None:
        date = score
        score = ["null"]
    else:
        pass
    content = x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/span/text()'.format(i))
    id = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/@href'.format(i))
    try:
        city = get_city(id[0], i)  # 调用评论用户的ID城市信息获取
    except IndexError:
        city = " "
    name_list.append(str(name[0]))
    score_list.append(str(score[0]).strip('[]\''))  # bug 有些人评论了文字，但是没有给出评分
    date_list.append(str(date[0]).strip('[\'').split(' ')[0])
    content_list.append(str(content[0]).strip())
    city_list.append(city)

4 Get movie name

Only the 8-bit ID value of the subject of the movie can be obtained from the url, causing the need to parse the web page to obtain the movie name corresponding to the ID number. This function is added in later improvements, so in order to avoid more changes in the existing code (steal a lazy) The use of global variable assignment is movie_namerequired to pay attention to the globaldeclaration of global variables .

pattern = re.compile('<div id="wrapper">.*?<div id="content">.*?<h1>(.*?) 短评</h1>', re.S)
global movie_name
movie_name = re.findall(pattern, res.text)[0]  # list类型

5 Data storage

Since there is not much data, choose CSV storage.

def main(ID, pages):
    global movie_name
    for i in tqdm(range(0, pages)):  # 豆瓣只开放500条评论
        get_content(ID, i)  # 第一个参数是豆瓣电影对应的id序号，第二个参数是想爬取的评论页数
        time.sleep(round(random.uniform(3, 5), 2))  # 设置延时发出请求
    infos = {'name': name_list, 'city': city_list, 'content': content_list, 'score': score_list, 'date': date_list}
    data = pd.DataFrame(infos, columns=['name', 'city', 'content', 'score', 'date'])
    data.to_csv(movie_name + ".csv")  # 存储名为 电影名.csv

3. Data analysis and visualization

1 get cookies

City Information Screening Chinese

def translate(str):
    line = str.strip()
    p2 = re.compile('[^\u4e00-\u9fa5]')   # 中文的编码范围是：\u4e00到\u9fa5
    zh = " ".join(p2.split(line)).strip()
    zh = ",".join(zh.split())
    str = re.sub("[A-Za-z0-9!！，%\[\],。]", "", zh)
    return str

Match the list of cities supported by pyecharts

 d = pd.read_csv(csv_file, engine='python', encoding='utf-8')
 motion_list = []
 for i in d['content']:
   try:
       s = round(SnowNLP(i).sentiments, 2)
       motion_list.append(s)
   except TypeError:
       continue
   result = {}
   for i in set(motion_list):
       result[i] = motion_list.count(i)
   return result

2 Sentiment analysis based on snownlp

Snownlp can mainly perform Chinese word segmentation (the algorithm is Character-Based Generative Model), part-of-speech tagging (the principle is TnT, 3-gram hidden horse), sentiment analysis (the official website has an introduction principle, but the accuracy rate of shopping reviews is higher. , In fact, because its corpus is mainly for shopping, you can build your own corpus in related fields, replace the original one, and the accuracy is quite good), text classification (the principle is Naive Bayes), convert Pinyin, traditional to simplified, Extract text keywords (the principle is TextRank), extract abstracts (the principle is TextRank), segment sentences, and text is similar (the principle is BM25) [extracted from CSDN]. Before reading this, it is recommended to take a look at the official website, which contains the most basic commands. Official website link: https://pypi.org/project/snownlp/

Since snownlp is all unicode encoded, so pay attention to whether the data is unicode encoded. Because it is unicode encoding, there is no need to remove the English contained in the Chinese text, because it will be transcoded into a unified encoding. The above just calls the snownlp native corpus to analyze the text. Snownlp focuses on the shopping evaluation field, so in order to improve the accuracy of sentiment analysis Degree can take the method of training corpus.

attr, val = [], []
info = count_sentiment(csv_file)
info = sorted(info.items(), key=lambda x: x[0], reverse=False)  # dict的排序方法
for each in info[:-1]:
    attr.append(each[0])
    val.append(each[1])
line = Line(csv_file+":影评情感分析")
line.add("", attr, val, is_smooth=True, is_more_utils=True)
line.render(csv_file+"_情感分析曲线图.html")

3 City analysis

Call pyecharts page function, you can create multiple in an image object chart, only need to add the corresponding.

    geo1 = Geo("", "评论城市分布", title_pos="center", width=1200, height=600,
              background_color='#404a59', title_color="#fff")
    geo1.add("", attr, val, visual_range=[0, 300], visual_text_color="#fff", is_geo_effect_show=False,
            is_piecewise=True, visual_split_number=10, symbol_size=15, is_visualmap=True, is_more_utils=True)
    # geo1.render(csv_file + "_城市dotmap.html")
    page.add_chart(geo1)
    geo2 = Geo("", "评论来源热力图",title_pos="center", width=1200,height=600, background_color='#404a59', title_color="#fff",)
    geo2.add("", attr, val, type="heatmap", is_visualmap=True, visual_range=[0, 50],visual_text_color='#fff', is_more_utils=True)
    # geo2.render(csv_file+"_城市heatmap.html")  # 取CSV文件名的前8位数
    page.add_chart(geo2)
    bar = Bar("", "评论来源排行", title_pos="center", width=1200, height=600 )
    bar.add("", attr, val, is_visualmap=True, visual_range=[0, 100], visual_text_color='#fff',mark_point=["average"],mark_line=["average"],
            is_more_utils=True, is_label_show=True, is_datazoom_show=True, xaxis_rotate=45)
    bar.render(csv_file+"_城市评论bar.html")  # 取CSV文件名的前8位数
    page.add_chart(bar)
    pie = Pie("", "评论来源饼图", title_pos="right", width=1200, height=600)
    pie.add("", attr, val, radius=[20, 50], label_text_color=None, is_label_show=True, legend_orient='vertical', is_more_utils=True, legend_pos='left')
    pie.render(csv_file + "_城市评论Pie.html")  # 取CSV文件名的前8位数
    page.add_chart(pie)
    page.render(csv_file + "_城市评论分析汇总.html")

4 Film sentiment analysis

Negative emotions below 0.5, positive emotions above 0.5. It can be seen that the praise is still very good. As for Douban, there are only a few comments on the decline.

5 Analysis of the trend of movie ratings

Read the csv file and save it in the form of dataframe (df)
Traverse df line, save to list
Count the number of same ratings on the same date
Convert to df format, set column name
Sort by date
Traversing the new df, the score of each date is divided into 5 types, so it is necessary to insert supplementary missing values.

creat_df = pd.DataFrame(columns = ['score', 'date', 'votes']) # 创建空的dataframe
for i in list(info_new['date']):
    location = info_new[(info_new.date==i)&(info_new.score=="力荐")].index.tolist()
    if location == []:
        creat_df.loc[mark] = ["力荐", i, 0]
        mark += 1
    location = info_new[(info_new.date==i)&(info_new.score=="推荐")].index.tolist()
    if location == []:
        creat_df.loc[mark] = ["推荐", i, 0]
        mark += 1
    location = info_new[(info_new.date==i)&(info_new.score=="还行")].index.tolist()
    if location == []:
        creat_df.loc[mark] = ["还行", i, 0]
        mark += 1
    location = info_new[(info_new.date==i)&(info_new.score=="较差")].index.tolist()
    if location == []:
        creat_df.loc[mark] = ["较差", i, 0]
        mark += 1
    location = info_new[(info_new.date==i)&(info_new.score=="很差")].index.tolist()
    if location == []:
        creat_df.loc[mark] = ["很差", i, 0]
        mark += 1
info_new = info_new.append(creat_df.drop_duplicates(), ignore_index=True)

Due to the small amount of crawling allowed and time issues, some data is not very obvious. But some findings can still be drawn. Within a week of the film ’s release, it ’s the peak of reviews, especially within 3 days of the release, which is common sense, but may also be biased, because the data obtained by the crawler is sorted by Douban movies. Closer to the real situation.

In addition, it was found that there are some comments before the film is released. The analysis may be a small-scale preview before the theater is released, and the average rating of these pre-approved users is almost close to the final rating of the large-scale reviews after the film review. From these details We may guess that those who can watch the film in advance may be senior movie fans or film and television practitioners, and their comments have a very good reference value.

6 Film Review Word Cloud

When creating the word cloud image, first read the CSV file and save it in the form of a dataframe, remove the non-Chinese text in the comments, select the Hu Ge photo as the background, and set a stop word list.

wc = WordCloud(width=1024, height=768, background_color='white',
     mask=backgroud_Image, font_path="C:\simhei.ttf",
     stopwords=stopwords, max_font_size=400,random_state=50)

It can be seen that the high-frequency word "may" shows the recognition of the film, "special effects" reflects the importance of special effects lenses for science fiction films, and "science fiction movies" shows the strong interest of fans in science fiction movies.

The above is the process and data analysis of this short review of Douban.com "Wandering Earth".

Original link to WeChat official account

WeChat public account " financial learner who learns programming " back-end " wandering earth " to get the source code .

Xiaobencong

Published 11 original articles · Like 11 · Visits 5726

Private letter concerns

Python crawling "Dollar Earth" Douban movie review and data analysis visualization

Guess you like