Analysis Python movie "South Station party"


"South Station party" Yinan directed by Diao, the main actors include: Hu Ge, Kwai Lun Mei, Liao Fan, Wan Qian, etc., the film 2019 May 18 premiere at the Cannes Film Festival, December 6, 2019 in the Chinese official release. The story was inspired by real news events, mainly about the theft gang leader Zhou Zenong (Hu Ge decoration), embarked on the road to exile in heavily reward, difficult search for self-redemption story.

Release of the film more than a week, close to 200 million at the box office, as an art film, this performance should be regarded as belonging to the middle and upper levels. The following open watercress look at the score situation, as shown below:

From the figure we can see the score currently has 13 million people, reaching 7.5 points, hit 4 stars and 3 stars in the majority, not what some call the Internet 口碑两极分化( If polarization, should be 5 star and 1 star playing the majority of it?!).

Critics position pulled down the page, as shown below:

We can see that there are more than 50,000 critic, critics view the current restrictions on watercress data are: not logged in can view up to 200 data, logged-in user can view up to 500 data we have to do is crawl watercress 500 critics data by Python, and data analysis.

First, get a list of URL movie, the specific operation as follows: Click the image above 全部 52846 条, enter critics Homepage, as shown below:

But we found a problem, and not the URL parameter information line number, etc. (need to turn pages), this issue we just need to click on the 后页button to see the results as shown:

now we can see this information from the URL, and because startthe parameters for the variables, we will modify the above URL is: https://movie.douban.com/subject/27668250/comments?start=%d&limit=20&sort=new_score&status=Pa crawling start URL.

接着我们看一下如何实现登陆,首先打开登录页,如下图所示:

我们先在手机号/邮箱密码输入框处随意输入(不要输入正确的用户名和密码),再按 F12 键打开开发者工具,最后点击登录豆瓣按钮,结果如图所示:

我们点击上面图中所示 basic 项,点击后结果如图所示:

此时可以看到 Request URL(登录所需 URL) 和 Form Data 项,这两项是我们登录时需要的,当然我们还需 User-Agent,点击上面图中所示的 Request Headers 项即可看到,如图所示:

所需要的东西都找好了,接下来就是具体实现了,豆瓣登录和影评数据爬取的具体实现如下所示:

import requests
import time
import random
from lxml import etree
import csv

# 新建 csv 文件
csvfile = open('南方车站的聚会.csv','w',encoding='utf-8',newline='')
writer = csv.writer(csvfile)
# 表头
writer.writerow(['时间','星级','评论内容'])

def spider():
    url = 'https://accounts.douban.com/j/mobile/login/basic'
    headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
    comment_url = 'https://movie.douban.com/subject/27668250/comments?start=%d&limit=20&sort=new_score&status=P'
    data = {
        'ck': '',
        'name': '自己的用户名',
        'password': '自己的密码',
        'remember': 'false',
        'ticket': ''
    }
    session = requests.session()
    session.post(url=url, headers=headers, data=data)
    # 总共 500 条,每页 20 条
    for i in range(0, 500, 20):
        # 获取 HTML
        data = session.get(comment_url % i, headers=headers)
        print('第', i, '页', '状态码:', data.status_code)
        # 暂停 0-1 秒
        time.sleep(random.random())
        # 解析 HTML
        selector = etree.HTML(data.text)
        # 获取当前页所有评论
        comments = selector.xpath('//div[@class="comment"]')
        # 遍历所有评论
        for comment in comments:
            # 获取星级
            star = comment.xpath('.//h3/span[2]/span[2]/@class')[0][7]
            # 获取时间
            t = comment.xpath('.//h3/span[2]/span[3]/text()')
            # 获取评论内容
            content = comment.xpath('.//p/span/text()')[0].strip()
            # 排除时间为空的项
            if len(t) != 0:
                t = t[0].strip()
                writer.writerow([t, star, content])

接下来我们通过词云直观的来展示下整体评论情况,具体实现如下所示:

import csv
import jieba
from wordcloud import WordCloud
import numpy as np
from PIL import Image

# jieba 分词处理
def jieba_():
    csv_list = csv.reader(open('南方车站的聚会.csv', 'r', encoding='utf-8'))
    print('csv_list',csv_list)
    comments = ''
    for i,line in enumerate(csv_list):
        if i != 0:
            comment = line[2]
            comments += comment
    print("comment-->",comments)
    # jieba 分词
    words = jieba.cut(comments)
    new_words = []
    # 要排除的词
    remove_words = ['以及', '在于', '一些', '一场', '只有',
                    '不过', '东西', '场景', '所有', '这么',
                    '但是', '全片', '之前', '一部', '一个',
                    '作为', '虽然', '一切', '怎么', '表现',
                    '人物', '没有', '不是', '一种', '个人'
                    '如果', '之后', '出来', '开始', '就是',
                    '电影', '还是', '不是', '武汉', '镜头']
    for word in words:
        if word not in remove_words:
            new_words.append(word)
    global word_cloud
    # 用逗号分隔词语
    word_cloud = ','.join(new_words)

# 生成词云
def world_cloud():
    # 背景图
    cloud_mask = np.array(Image.open('bg.jpg'))
    wc = WordCloud(
        # 背景图分割颜色
        background_color='white',
        # 背景图样
        mask=cloud_mask,
        # 显示最大词数
        max_words=600,
        # 显示中文
        font_path='./fonts/simhei.ttf',
        # 字的尺寸限制
        min_font_size=20,
        max_font_size=100,
        margin=5
    )
    global word_cloud
    x = wc.generate(word_cloud)
    # 生成词云图片
    image = x.to_image()
    # 展示词云图片
    image.show()
    # 保存词云图片
    wc.to_file('wc.png')

整体评论词云图

因为有人说了影片口碑两级分化,接下来我们看一下打 1 星和 5 星的词云效果如何,主要实现如下所示:

for i,line in enumerate(csv_list):
    if i != 0:
        star = line[1]
        comment = line[2]
        # 一星评论用 1,五星评论用 5
        if star == '1':
            comments += comment

一星评论词云图

五星评论词云图


上面我们只使用了评论内容信息,还有时间和星级信息没有使用,最后我们可以用这两项数据分析下随着时间的变化影片星级的波动情况,以月为单位统计影片从首映(2019 年 5 月)到当前时间(2019 年 12月)的星级波动情况,具体实现如下所示:

import csv
from pyecharts.charts import Line
import pyecharts.options as opts
import numpy as np
from datetime import datetime

def score():
    csv_list = csv.reader(open('南方车站的聚会.csv', 'r', encoding='utf-8'))
    print('csv_list', csv_list)
    comments = ''
    ts = []
    ss = set()
    for i, line in enumerate(csv_list):
        if i != 0:
            t = line[0][0:7]
            s = line[1]
            ts.append(t+':'+s)
            ss.add(t)
    new_times = []
    new_starts = []
    new_ss = []
    for i in ss:
        new_ss.append(i)
    arr = np.array(new_ss)
    new_ss = arr[np.argsort([datetime.strptime(i, '%Y-%m') for i in np.array(new_ss)])].tolist()
    print('new_ss',new_ss)
    for i in new_ss:
        x = 0
        y = 0
        z = 0
        for j in ts:
            t = j.split(':')[0]
            s = int(j.split(':')[1])
            if i == t:
                x += s
                z += 1
        new_times.append(i)
        new_starts.append(round(x / z, 1))
    c = (
            Line()
           .add_xaxis(new_times)
           .add_yaxis('南方车站的聚会',new_starts)
            .set_global_opts(title_opts=opts.TitleOpts(title='豆瓣星级波动图'))
        ).render()

影片星级波动效果如下图所示:

根据影片星级的波动情况我们也能大致预测到影片评分的波动情况。

参考:
https://baike.baidu.com/item/%E5%8D%97%E6%96%B9%E8%BD%A6%E7%AB%99%E7%9A%84%E8%81%9A%E4%BC%9A/22547693?fr=aladdin

Guess you like

Origin www.cnblogs.com/ityard/p/12075904.html