Python movie crawler, using Excel to store and perform data visualization analysis

1. Crawling web data

1. Analyze web pages


(1) Web page data type

First check what type the data in the web page belongs to, such as: text, json, etc.


(2) Differences in URLs of different types of movie rankings

Then we analyze the differences between the URLs in each ranking:

https://movie.douban.com/typerank?type_name=Drama&type=11&interval_id=100:90&action=
https://movie.douban.com/typerank?type_name=Comedy&type=24&interval_id=100:90&action=
https:// movie.douban.com/typerank?type_name=love&type=13&interval_id=100:90&action=

Here are three observations. It is easy to find that the correspondence is a dictionary correspondence, that is, each category name (type_name) has its own corresponding number (type).

So I recorded it and stored it in a dictionary:

type_name={
    
    "剧情":"11","喜剧":"24","动作":"5","爱情":"13","科幻":"17","动画":"25","悬疑":"10", \
    "惊悚":"19","恐怖":"20","纪录片":"1","短片":"23","情色":"6","同性":"26",\
    "音乐":"14","歌舞":"7","家庭":"28","儿童":"8","传记":"2","历史":"4","战争":"22","犯罪":"3","西部":"27",\
    "奇幻":"16","冒险":"15","灾难":"12","武侠":"29","古装":"30","运动":"18","黑色电影":"31"}

Movie classification correspondence table

The following table is what I checked and compiled. You can skip it if the table is quite long.

type_name type
plot 11
comedy 24
action 5
love 13
science fiction 17
animation 25
Suspense 10
Thriller 19
fear 20
documentary 1
short film 23
Erotic 6
homosexual 26
music 14
Song and dance 7
family 28
child 8
biography 2
history 4
war 22
crime 3
west 27
Fantasy 16
adventure 15
disaster 12
martial arts 29
Ancient costume 30
sports 18
film noir 31



2. Write a crawler

After having the above preparations, you can start writing the crawler.

(1) First find the URL of the web page

Since I have already written it, I will give it directly.

url = "https://movie.douban.com/j/chart/top_list"

(2) Write the corresponding data used in the request

First, Params :

# A dictionary of parameters that are used in the request.
# 请求中使用的参数字典。
Params = {
    
    
        'type': f"{
      
      type_name[word1]}",
        'interval_id': "100:90",
        'action': None,
        'start': "0",
        'limit': f"{
      
      Number}",
        }

The type_name is the dictionary corresponding to the movie category above; the word1 behind it represents the key in the dictionary; Number is the number of movies we crawled for each type of ranking (can be larger than the total number of movies in the ranking, I set it at the time is 1000)

Then there are headers (required):

# A header that is used to identify the browser.
# 用于标识浏览器的标头。
headers = {
    
    
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }

(3) Send response

Send the request with response and then convert the response data tojsonobject

# Sending a request to the url with the parameters and headers.
# 向带有参数和标头的 url 发送请求。
response = requests.get(url=url , params=Params , headers=headers)

# Converting the response to a json object.
# 将响应转换为 json 对象。
result = response.json()



3. Save the data to Excel in xlsx format

(1) Use lists to store temporary data

Because each data label (that is, what the data represents) is different, lists are used to store them separately:

# Creating a list of the column names.
# 创建列名列表。
title=["film_title",] # 片名
release_date=["release_date",] # 发布日期
actors=["actors",] # 演员
regions = ["country_of_production"] # 制片国家/地区
rating=["rating",] # 分数
vote_count = ["vote_count"] # 评分人数
rank=["rank",] # 在单分类榜中的排名
types = ["types"] # 影片类型
url=["url",] # 影片简介链接

(2) Use dictionaries to further store data

Use a dictionary to store the data in the list and form key-value pairs, and then generateDataFrame

# Creating a dictionary with the keys being the column names and the values being the lists of values.
# 创建一个字典,键是列名,值是值列表。
output_excel = {
    
    "film_title": title, "release_date": release_date, "actors": actors, "regions": regions, "rating": rating, "vote_count": vote_count, "rank": rank, "types": types, "url": url}
output = pd.DataFrame(output_excel)

crawler function

At this point we have roughly written the crawler we use, and then we write it into a function for easy use.

import requests
import pandas as pd

def DouBan_Movie_Sperider(word1,Number):
    type_name={
    
    "剧情":"11","喜剧":"24","动作":"5","爱情":"13","科幻":"17","动画":"25","悬疑":"10", \
    "惊悚":"19","恐怖":"20","纪录片":"1","短片":"23","情色":"6","同性":"26",\
    "音乐":"14","歌舞":"7","家庭":"28","儿童":"8","传记":"2","历史":"4","战争":"22","犯罪":"3","西部":"27",\
    "奇幻":"16","冒险":"15","灾难":"12","武侠":"29","古装":"30","运动":"18","黑色电影":"31"}
    
    url = "https://movie.douban.com/j/chart/top_list"
    Params = {
    
    
        'type': f"{
      
      type_name[word1]}",
        'interval_id': "100:90",
        'action': None,
        'start': "0",
        'limit': f"{
      
      Number}",
    }
    headers = {
    
    
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }
    response = requests.get(url=url , params=Params , headers=headers)
    result = response.json()
    
    title=["film_title",] # 片名
    release_date=["release_date",] # 发布日期
    actors=["actors",] # 演员
    regions = ["country_of_production"] # 制片国家/地区
    rating=["rating",] # 分数
    vote_count = ["vote_count"] # 评分人数
    rank=["rank",] # 在单分类榜中的排名
    types = ["types"] # 影片类型
    url=["url",] # 影片简介链接

    for i in result:
        title.append(i["title"])
        release_date.append(i["release_date"])
        actors.append(i["actors"])
        regions.append(i["regions"])
        rating.append(i["rating"][0])
        vote_count.append(i["vote_count"])
        rank.append(i["rank"])
        types.append(i["types"])
        url.append(i["url"])

    output_excel = {
    
    "film_title": title, "release_date": release_date, "actors": actors, "regions": regions, "rating": rating, "vote_count": vote_count, "rank": rank, "types": types, "url": url}
    output = pd.DataFrame(output_excel)
    return output

(3) Save data into Excel

Next we store the data into Excel. I use the exec statement (because I think it is easier to use) to perform the storage.

# Creating a new dataframe for each word in the word_list and saving it to a new sheet in the excel file.
# 为 word_list 中的每个单词创建一个新的数据框,并将其保存到 excel 中的新工作表文件。
for m, i in enumerate(word_list, start=1):
    exec(f'output{
      
      m} = DouBan_Movie_Sperider(i,Number)')
    exec(f'output{
      
      m}.to_excel(writer,i)')
writer.save()

Because it is stored by type, the stored tables are also classified, which means that we also need to merge different sheets into one sheet, which will facilitate our subsequent data analysis.

(4) Merge multiple sheets into one sheet

This function is written to pass in two parameters, one is the original file name and the other is the file name to be merged: reads each worksheet, appends it to the data frame, and then saves the data frame.

def merge_sheets(file, save_file):
    file = pd.ExcelFile(file)
    sheet_names = file.sheet_names
    print(sheet_names)
    sheet_concat = pd.DataFrame()
    for sheet in sheet_names:
        df = pd.read_excel(file, sheet_name=sheet, header=1, index_col=0)
        sheet_concat = sheet_concat.append(df)
    sheet_concat.to_excel(save_file)

file = 'Movie_douban.xlsx'
save_file = 'Movie_douban_classification_rankings.xlsx'
merge_sheets(file, save_file)

Among them, file is the file name of the original file, and save_file is the merged file name. At this point we can perform data analysis.

4. Complete code

# Importing the requests library.
# 导入 requests 库。
import requests

# Importing the pandas library and renaming it to pd.
# 导入 pandas 库并将其重命名为 pd.
import pandas as pd

"""
It takes a list of words and a number, and returns a dataframe of the top movies of each word in the
list.
:param word1: The type of movie you want to search for
:param Number: The number of movies you want to crawl
:return: A DataFrame.

它接受一个单词列表和一个数字,并返回每个单词的热门电影的数据帧
列表。
:param word1: 你要搜索的电影类型
:param Number: 要抓取的电影数量
:return: 一个 DataFrame。
"""
def DouBan_Movie_Sperider(word1,Number):
    type_name={
    
    "剧情":"11","喜剧":"24","动作":"5","爱情":"13","科幻":"17","动画":"25","悬疑":"10", \
    "惊悚":"19","恐怖":"20","纪录片":"1","短片":"23","情色":"6","同性":"26",\
    "音乐":"14","歌舞":"7","家庭":"28","儿童":"8","传记":"2","历史":"4","战争":"22","犯罪":"3","西部":"27",\
    "奇幻":"16","冒险":"15","灾难":"12","武侠":"29","古装":"30","运动":"18","黑色电影":"31"}
    url = "https://movie.douban.com/j/chart/top_list"
    # A dictionary of parameters that are used in the request.
    # 请求中使用的参数字典。
    Params = {
    
    
        'type': f"{
      
      type_name[word1]}",
        'interval_id': "100:90",
        'action': None,
        'start': "0",
        'limit': f"{
      
      Number}",
    }

    # A header that is used to identify the browser.
    # 用于标识浏览器的标头。
    headers = {
    
    
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }

    # Sending a request to the url with the parameters and headers.
    response = requests.get(url=url , params=Params , headers=headers)

# Converting the response to a json object.
# 将响应转换为 json 对象。
    result = response.json()

# Creating a list of the column names.
# 创建列名列表。
    title=["film_title",] # 片名
    release_date=["release_date",] # 发布日期
    actors=["actors",] # 演员
    regions = ["country_of_production"] # 制片国家/地区
    rating=["rating",] # 分数
    vote_count = ["vote_count"] # 评分人数
    rank=["rank",] # 在单分类榜中的排名
    types = ["types"] # 影片类型
    url=["url",] # 影片简介链接

    # A for loop that iterates through the result and appends the values of the keys to the lists.
    # 遍历结果并将键的值附加到列表的 for 循环。
    for i in result:
        title.append(i["title"])
        release_date.append(i["release_date"])
        actors.append(i["actors"])
        regions.append(i["regions"])
        rating.append(i["rating"][0])
        vote_count.append(i["vote_count"])
        rank.append(i["rank"])
        types.append(i["types"])
        url.append(i["url"])

# Creating a dictionary with the keys being the column names and the values being the lists of values.
# 创建一个字典,键是列名,值是值列表。
    output_excel = {
    
    "film_title": title, "release_date": release_date, "actors": actors, "regions": regions, "rating": rating, "vote_count": vote_count, "rank": rank, "types": types, "url": url}
    output = pd.DataFrame(output_excel)
    return output

def merge_sheets(file, save_file):
    """
    It takes an Excel file and a save file name, reads in each sheet, appends it to a dataframe, and then saves the dataframe to the save file name
    :param file: the file path of the excel file
    :param save_file: the file name of the merged file
    
    读取每个工作表,将其附加到数据框,然后将数据框保存。
    :param file:excel文件的文件路径
    :param save_file: 合并文件的文件名
    """
    file = pd.ExcelFile(file)
    sheet_names = file.sheet_names
    print(sheet_names)
    sheet_concat = pd.DataFrame()
    for sheet in sheet_names:
        df = pd.read_excel(file, sheet_name=sheet, header=1, index_col=0)
        print(df.shape)
        sheet_concat = sheet_concat.append(df)
    sheet_concat.to_excel(save_file)
    print('合并之后的数据大小:',sheet_concat.shape)

word_list=["剧情","喜剧","动作","爱情","科幻","动画","悬疑","惊悚","恐怖",\
            "纪录片","短片","情色","同性","音乐","歌舞","家庭","儿童","传记","历史","战争","犯罪","西部",\
            "奇幻","冒险","灾难","武侠","古装","运动","黑色电影"]

writer = pd.ExcelWriter("Movie_douban.xlsx")
Number = 1000

# Creating a new dataframe for each word in the word_list and saving it to a new sheet in the excel file.
# 为 word_list 中的每个单词创建一个新的数据框,并将其保存到 excel 中的新工作表文件。
for m, i in enumerate(word_list, start=1):
    exec(f'output{
      
      m} = DouBan_Movie_Sperider(i,Number)')
    exec(f'output{
      
      m}.to_excel(writer,i)')
writer.save()

file = 'Movie_douban.xlsx'
save_file = 'Movie_douban_classification_rankings.xlsx'
merge_sheets(file, save_file)

2. Visual data analysis

Visual data analysis is recommendedJupyter notebookenvironment, I use Jupyter notebook here. There are % parts that are not needed in some IDEs. Please pay attention.

1. Preparation

(1) Import the library and set drawing parameters

# 导入所必须使用的库
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#运行配置参数中的字体(font)为 黑体(SimHei)
matplotlib.rcParams['font.family'] = 'Microsoft YaHei'
#加sans-serif,这样如果所列出的字体都不能用,则默认的sans-serif字体能保证调用
matplotlib.rcParams['font.sans-serif'] = ['SimHei'] 
#解决默认生成的统计图表不清楚问题,设定输出表格为SVG
%config InlineBackend.figure_format = 'svg'
#坐标轴符号无法正常显示问题
matplotlib.rcParams['axes.unicode_minus'] = False 
#设置绘图风格
matplotlib.style.use('tableau-colorblind10')

(2) Import data

Import the data and check the data

# 导入所使用的数据excel表(加载数据)
df = pd.read_excel("Movie_douban_classification_rankings.xlsx", sheet_name="Sheet1")
# 查看 df 的内容
df

Raw data
The import data table operation in the first line needs to be in the working directory. If it is elsewhere, you can copy and paste its path, for example:

D:\Python\Movie_douban_classification_rankings.xlsx


Here we label the data in the table

film_title = 片名
release_date = 发布日期
actors = 演员
country_of_production = 制片国家/地区
rating = 分数
vote_count = 评分人数
rank = 在单分类榜中的排名
types = 影片类型
url = 影片简介链接

2. Data preprocessing

(1) Missing value processing

First, find missing values:

# 数据预处理之缺失值发现
df.isnull()

Find missing values

Because there is a lot of data above, we need to use another method to count missing values. This method can directly display rows with missing values.

# 由于数据量较大,接下来调用 any 的方法判断是否表中有缺失值
nan_any_rows = df.isnull().any(axis=1)
df[nan_any_rows]

Show missing values
The above results indicate that there are no missing values ​​in the table

(2) Duplicate value processing

First check if there are duplicate values:

# 调用 duplicated 方法检测是否有重复值
df['film_title'].duplicated()

Duplicate value statistics
From the results, it is found that there are still many movie titles that are repeated, which means that there are movies with more than one category, so the next step is to delete the duplicate values ​​to make them unique values.

Next, remove duplicate values:

# 这边需要先删除掉第一列也就是合并之后出来的 unnamed(无名)列,因为这一列数据会阻碍我删除重复行,所以我们先把它删除
# 删除列的两种方法我们选择用 drop 方法,目的是为了让原始数据继续存在
new_df = df.drop(columns=["Unnamed: 0"])
# 然后我们用 drop_duplicates 方法将重复值去掉
new_df = new_df.drop_duplicates(subset= 'film_title')
# 查看删除重复之后的数据
new_df

Remove duplicate data
Looking at the 3218 rows below, it is obvious that duplicate values ​​in the data have been deleted.


3. Visual analysis

(1) Analysis of the situation in the country where the movie is produced

Because we used a list for storage before, the symbols of the list are also stored in the table, so we must first delete the redundant symbols in the data.

# (1) 首先处理一下数据内部,将列表符号给删去
new_df_country1 = new_df['country_of_production'].str.replace('[', '', regex=False)
new_df_country2 = new_df_country1.str.replace(']', '', regex=False)
new_df_country3 = new_df_country2.str.replace("'", '', regex=False)
new_df_country3

Next is the statistical data. Here I only selected the data of the top ten countries where movies are produced (too many will be too crowded and the words will be unclear when drawing)

# (2) 开始进行统计
movies_country = []
for country1 in new_df_country3:
    country2 = country1.split(', ' or '/')
    movies_country.extend(iter(country2))
new_df_country_number1 = pd.DataFrame({
    
    'country_of_production':movies_country})
new_df_country_number2 = new_df_country_number1['country_of_production'].value_counts()
new_df_country_number2.head(10)

Top ten countries for film production

Then there is drawing. Drawing is very simple and I won’t explain it much. Just set the X-axis and Y-axis titles and the main title and it’s over.

# (3) 绘制电影出品排名前十的柱形图
plt.title('Country/Numders') # 设立主标题
plt.xlabel('Country') # 设置X轴标题
plt.ylabel('Numders') # 设置Y轴标题
new_df_country_number2.head(10).plot(kind = "bar") # 绘图

Insert image description here
If you are not using Jupyter notebook, add a line at the endplt.show(), the function is to display the chart, ps: the brackets must not be missing! ! !

(2) Analysis of movie types

Same as above, because this is also stored in a list, it will also have a list symbol, so it must be deleted first:

# (1) 首先处理一下数据内部,将列表符号给删去
new_df_types1 = new_df['types'].str.replace('[', '', regex=False)
new_df_types2 = new_df_types1.str.replace(']', '', regex=False)
new_df_types3 = new_df_types2.str.replace("'", '', regex=False)
new_df_types3

Then do statistics to see how many movies there are in each genre

# (2) 进行统计(后期可优化与上代码进行合并成算法)
movies_types = []
for types1 in new_df_types3:
    types2 = types1.split(', ')
    movies_types.extend(iter(types2))
new_df_types_number1 = pd.DataFrame({
    
    'types_of_production':movies_types})
new_df_types_number2 = new_df_types_number1['types_of_production'].value_counts()
new_df_types_number2

Insert image description here

Then also draw the picture, set the X-axis and Y-axis titles and the main title. Remember to add it if you are not using Jupyter notebook.plt.show()to see the image! ! !

# (3) 绘制不同类型电影的数量前十名的图像
new_df_types_number2.head(10).plot(kind = 'bar') # 绘图
plt.xlabel('Types') # 设置X轴标题
plt.ylabel('Numders') # 设置Y轴标题
plt.title('Types/Numbers') # 设立主标题

Insert image description here

(3) Analysis of movie quantity

To count the number of movies by year, because we store text data, we need to convert the text data first:

# (1) 先将字符数据转换成年份
new_df_datetime = pd.to_datetime(new_df['release_date']).apply(lambda x:x.strftime('%Y'))
new_df_datetime

Insert image description here
only one%YBecause only the year is needed for annual statistics, so I won’t write any redundant information. If you need the month or date, you can write it as%Y-%m-%dSimilar to this format.

Then count the number of movies by year:

# (2) 按年份统计每年的电影数量
new_df_datetime_numbers_years = new_df_datetime.value_counts().sort_index()
(new_df_datetime_numbers_years

Insert image description here

Draw an image. This time, we draw a line chart. Because we are looking at trends, we draw a line chart:

# (3) 根据所统计的绘图
fig = plt.figure(figsize = (7, 5), dpi = 100)
plt.xlabel('Years')
plt.ylabel('Numders')
plt.title('Years/Numbers')
new_df_datetime_numbers_years.plot(kind = 'line')

Insert image description here


3. Summary of analysis results

(1) Countries or regions with high film production:

United States, United Kingdom, Japan, France, Mainland China, Hong Kong, Germany, Italy, Canada, South Korea.

(2) The era of rapid development of the film industry:

1991-2013

(3) Rating ranking:

Most are distributed between 8-9 points, and a few are <7.5 or >9.6

Thank you for your patience in reading this. If you have any questions or suggestions about the above, you can comment or send me a private message to discuss it together.

Guess you like

Origin blog.csdn.net/m0_68192925/article/details/125416913