[Big Data Foundation] Movie Data Analysis Based on TMDB Dataset

https://dblab.xmu.edu.cn/blog/2400/

Experimental content

Environment build

pip3 install bottle

insert image description here

data preprocessing

The data set used in this project comes from the tmdb-movie-metadata movie data set of the well-known data website Kaggle, which contains data about 5,000 movies. This experiment uses the data table tmdb_5000_movies.csv about movies in the dataset to conduct experiments. Data contains the following fields:

字段名称 解释 例子
budget 预算 10000000
genres 体裁 "[{""id"": 18, ""name"": ""Drama""}]"
homepage 主页 ""
id id 268238
keywords 关键词 "[{""id"": 14636, ""name"": ""india""}]"
original_language 原始语言 en
original_title 原标题 The Second Best Exotic Marigold Hotel
overview 概览 As the Best Exotic Marigold Hotel ...
popularity 流行度 17.592299
production_companies 生产公司 "[{""name"": ""Fox Searchlight Pictures"", ""id"": 43}, ...]"
production_countries 生产国家 "[{""iso31661"": ""GB"", ""name"": ""United Kingdom""}, ...]"
release_date 发行日期 2015-02-26
revenue 盈收 85978266
runtime 片长 122
spoken_languages 语言 "[{""iso6391"": ""en"", ""name"": ""English""}]"
status 状态 Released
tagline 宣传语 ""
title 标题 The Second Best Exotic Marigold Hotel
vote_average 平均分 6.3
vote_count 投票人数 272

Because some fields in the data contain json data, there will be segmentation errors when reading directly using DataFrame, so if you want to create a DataFrame, you need to read the file directly to generate an RDD, and then convert the RDD to a DataFrame. During the process, the data is parsed and converted using the csv module in python3.

In order to process the RDD converted from the csv file more conveniently, the header line of the csv file needs to be removed first. After completion, store the processed file tmdb_5000_movies.csv on HDFS for further processing. Use the following command to upload the file to HDFS:

# 启动Hadoop
cd /usr/local/hadoop
./sbin/start-dfs.sh
# 在HDFS文件系统中创建/OverDue目录
./bin/hdfs dfs -mkdir /data
# 上传文件到HDFS文件系统中
./bin/hdfs dfs -put ~/tmdb_5000_movies.csv  #这里和原po不一样，以本博客为准

At this time, the path of the file on HDFS is /user/hadoop/tmdb_5000_movies.csv. Then in the program, use the following statement to read the file:

sc.textFile(tmdb_5000_movies.csv)

Convert data to DataFrame using Spark

Critical Path:

hdfs://localhost:8020/user/hadoop/tmdb_5000_movies.csv

In order to create a DataFrame, you first need to load the data on HDFS into an RDD, and then convert the RDD into a DataFrame. The following code segment completes the conversion from file to RDD and then to DataFrame:
The following code segment completes the conversion from file to RDD and then to DataFrame:

from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StringType, StructField, StructType
import json # 用于后面的流程
import csv

# 1. 创建 SparkSession 和 SparkContext 对象
sc = SparkContext('local', 'spark_project')
sc.setLogLevel('WARN') # 减少不必要的 LOG 输出
spark = SparkSession.builder.getOrCreate()

# 2. 为 RDD 转为 DataFrame 创建 schema
schemaString = "budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count"
fields = [StructField(field, StringType(), True)
          for field in schemaString.split(")]
schema = StructType(fields)

# 3. 对于每一行用逗号分隔的数据，使用 csv 模块进行解析并转为 Row 对象，得到可以转为 DataFrame 的 RDD
moviesRdd = sc.textFile('tmdb_5000_movies.csv').map(
    lambda line: Row(*next(csv.reader([line]))))

# 4. 使用 createDataFrame 创建 DataFrame
mdf = spark.createDataFrame(moviesRdd, schema)

The above code accomplishes 4 things:
First, create SparkSession and SparkContext objects.
Then, make a table header (schema) for converting RDD to DataFrame. schema is a StructType object created using an array of StructFields.
Each StructField represents a field in structured data, and constructing StructField requires 3 parameters

Field Name
Field Type
Whether the field can be empty
The following is the structural relationship of these objects:

StructType([StructField(name, type, null), ..., StructField(name, type, null)])

Next, start creating an RDD for conversion to a DataFrame. This process first reads the data files on HDFS, and then converts each row of the data into a Row object in order to convert the RDD into a DataFrame.
This process is first parsed using the csv module, resulting in an iterator containing each field:

csv.reader([line]) # 这里 [line] 表示包含一行数据的数组

Then use the next function to read the data in the iterator into the array:

next(csv.reader([line]))

Finally, use * to convert the array into the constructor parameter of the Row object to create the Row object:

Row(*next(csv.reader([line])))

So far, each row in moviesRdd is a Row object.
Finally, use the prepared table header (schema) and RDD to create a DataFrame through the SparkSession interface createDataFrame:

mdf = spark.createDataFrame(moviesRdd, schema)

This completes the creation of the DataFrame.

Data analysis with Spark

Next, use the DataFrame mdf processed by Spark for data analysis. First, analyze the main fields in the data separately (overview section), and then analyze the relationship between different fields (relationship section).
In order to facilitate data visualization, for each different analysis, the analysis results are exported as json files, which are read and visualized by web pages. Export directly using the save function below:

def save(path, data):
  with open(path, 'w') as f:
    f.write(data)

This function writes data to path.
The generation process of each analysis is described below.

1. Overview

This part analyzes the data as a whole.

1. Genre Distribution in TMDb Movies

As can be seen from the above data dictionary description, the genre field of a movie is data in json format. Therefore, in order to count the number of movies of different genres, it is necessary to first parse the json data, extract the genre array corresponding to each movie, and then use The method of word frequency statistics counts the frequency of occurrence of different genres, and the genre distribution of movies can be obtained.
First implement a function countByJson(field), which implements the function of parsing the json format field to extract the name and perform word frequency statistics:

def countByJson(field):
    return mdf.select(field).filter(mdf[field] != '').rdd.flatMap(lambda g: [(v, 1) for v in map(lambda x: x['name'], json.loads(g[field]))]).repartition(1).reduceByKey(lambda x, y: x + y)

The function returns an RDD, and the whole process is shown below.

Based on this function, countByGenres is used to generate statistics of the number of movies in different genres:

def countByGenres():
    res = countByJson("genres").collect()
    return list(map(lambda v: {
    
    "genre": v[0], "count": v[1]}, res))

This function calls countByJson to get the frequency statistics result, converts it into json data format and returns it, which is convenient for visualization. The format of the data returned by the final function is as follows:

[{
    
    
    "genre": ...,
    "count": ...
}, {
    
    
    "genre": ...,
    "count": ...
}, ...]

Then, use the following code to export the data to genres.json for visualization later

save('genres.json', json.dumps(countByGenres())) # 确保 json 包已导入

2. Top 100 common keywords

This item analyzes the top 100 movie keywords with the highest frequency. Since the keyword field is also data in json format, call countByJson for frequency statistics, and at the same time sort the statistical results in descending order and take the first 100 items:

def countByKeywords():
    res = countByJson("keywords").sortBy(lambda x: -x[1]).take(100)
    return list(map(lambda v: {
    
    "x": v[0], "value": v[1]}, res))

Finally, the function returns json data in the following format:

[{
    
    
    "x": ...,
    "value": ...
}, {
    
    
    "x": ...,
    "value": ...
}, ...]

Next, use the following code to export the data to keywords.json for later visualization

save('keywords.json', json.dumps(countByKeywords()))

3. The 10 most common budget numbers in TMDb

This item explores the common budget of movies, so it is necessary to perform frequency statistics on movie budgets. The code is as follows:

def countByBudget(order='count', ascending=False):
    return mdf.filter(mdf["budget"] != 0).groupBy("budget").count().orderBy(order, ascending=ascending).toJSON().map(lambda j: json.loads(j)).take(10)

First, you need to filter the budget field to remove items with a budget of 0, then aggregate and count according to the budget, then sort according to the count, and export the result as a json string. In order to unify the output, here the json string is converted to python object, and finally take the top 10 items as the final result.
Finally, the function returns json data in the following format:

[{
    
    
    "budget": ...,
    "count": ...
}, {
    
    
    "budget": ...,
    "count": ...
}, ...]

Then, use the following code to export the data to budget.json for visualization later

save('budget.json', json.dumps(countByBudget()))

4. The most common movie durations in TMDb (only show the duration of movies with more than 100)

This item counts the most common movie durations in TMDb. First, you need to filter movies with a duration of 0, then aggregate and count them according to the duration field, and then filter out durations with a frequency of less than 100 (this step is for the convenience of visualization and avoids excessive redundant information) to get the final result.

def distrbutionOfRuntime(order='count', ascending=False):
    return mdf.filter(mdf["runtime"] != 0).groupBy("runtime").count().filter('count>=100').toJSON().map(lambda j: json.loads(j)).collect()

Finally, the function returns json data in the following format:

[{
    
    
    "runtime": ...,
    "count": ...
}, {
    
    
    "runtime": ...,
    "count": ...
}, ...]

Then, use the following code to export data to runtime.json for visualization later

save('runtime.json', json.dumps(distrbutionOfRuntime()))

5. Top 10 companies producing the most movies

This item counts the 10 companies with the most film output, and also uses countByJson to perform frequency statistics on the JSON data, and then sort the top 10 items in descending order.

def countByCompanies():
    res = countByJson("production_companies").sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "company": v[0], "count": v[1]}, res))

Finally, the function returns JSON data in the following format:

[{
    
    
    "company": ...,
    "count": ...
}, {
    
    
    "company": ...,
    "count": ...
}, ...]

Then, use the following code to export the data to company_count.json for visualization later

save('company_count.json', json.dumps(countByCompanies()))

6. TMDb's Top 10 Film Languages

This item counts the languages that appear the most in TMDb. Similar to the previous one, this field is also JSON data, so firstly count the word frequency of each item, then filter out the items whose language is empty, and finally sort the top ten.

def countByLanguage():
    res = countByJson("spoken_languages").filter(
        lambda v: v[0] != '').sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "language": v[0], "count": v[1]}, res))

Finally, the function returns json data in the following format:

[{
    
    
    "language": ...,
    "count": ...
}, {
    
    
    "language": ...,
    "count": ...
}, ...]

Next, use the following code to export data to language.json for easy visualization

save('language.json', json.dumps(countByLanguage()))

2. Relationship

This part considers the relationship between data.

1. Relationship between budget and evaluation

This part considers the relationship between budget and evaluation, so for each movie, the following data needs to be exported:

[movie title, budget, rating]

Field filtering of data based on DataFrame is enough:

def budgetVote():
    return mdf.select("title", "budget", "vote_average").filter(mdf["budget"] != 0).filter(mdf["vote_count"] > 100).collect()

Attention should also be paid to filtering out the data whose budget is empty. At the same time, only the data with more than 100 votes is kept in the result to ensure fairness.
The resulting data is stored in budget_vote.json:

save('budget_vote.json', json.dumps(budgetVote()))

2. The relationship between release time and evaluation

This part considers the relationship between release time and evaluation, so for each movie, the following data needs to be exported:

[movie title, release date, rating]

Field filtering of data based on DataFrame is enough:

def dateVote():
    return mdf.select(mdf["release_date"], "vote_average", "title").filter(mdf["release_date"] != "").filter(mdf["vote_count"] > 100).collect()

Here we still need to filter out the data whose release time is empty, and keep the data with more than 100 votes.
The resulting data is stored in date_vote.json:

save('date_vote.json', json.dumps(dateVote()))

3. The relationship between popularity and evaluation

This part considers the relationship between popularity and evaluation, so for each movie, the following data needs to be exported:

[movie title, popularity, rating]

Field filtering of data based on DataFrame is enough:

def popVote():
    return mdf.select("title", "popularity", "vote_average").filter(mdf["popularity"] != 0).filter(mdf["vote_count"] > 100).collect()

At the same time, filter out the data whose popularity is 0, and keep the data with more than 100 votes.
The resulting data is stored in pop_vote.json:

save('pop_vote.json', json.dumps(popVote()))

4. The relationship between the average score and the number of movies produced by the company

This part calculates the number of movies produced by each company and the average score distribution of those movies. First, the data needs to be filtered to remove the movies whose production company field is empty and the number of reviewers is less than 100, and then for each record, a record of the following form is obtained:

[Company Name, (Rating, 1)]

Then add up the scores and counts of all records, and finally divide the total score by the count to get the average score and number of movies of a company. The whole process is shown below.

def moviesVote():
    return mdf.filter(mdf["production_companies"] != '').filter(mdf["vote_count"] > 100).rdd.flatMap(lambda g: [(v, [float(g['vote_average']), 1]) for v in map(lambda x: x['name'], json.loads(g["production_companies"]))]).repartition(1).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]).map(lambda v: [v[0], v[1][0] / v[1][1], v[1][1]]).collect()

Store the resulting data in movies_vote.json:

save('movies_vote.json', json.dumps(moviesVote()))

5. The relationship between movie budget and revenue

This part considers the revenue of the movie, so for each movie, the following data needs to be exported:

[movie title, budget, revenue]

Field filtering of data based on DataFrame is enough:

def budgetRevenue():
    return mdf.select("title", "budget", "revenue").filter(mdf["budget"] != 0).filter(mdf['revenue'] != 0).collect()

Filter out data with budget and income of 0.
The resulting data is stored in budget_revenue.json:

save('budget_revenue.json', json.dumps(budgetRevenue()))

3. Integrated calls

Finally, the above process is integrated for easy calling, so add the main function in analyst.py:

if __name__ == "__main__":
    m = {
    
    
        "countByGenres": {
    
    
            "method": countByGenres,
            "path": "genres.json"
        },
        "countByKeywords": {
    
    
            "method": countByKeywords,
            "path": "keywords.json"
        },
        "countByCompanies": {
    
    
            "method": countByCompanies,
            "path": "company_count.json"
        },
        "countByBudget": {
    
    
            "method": countByBudget,
            "path": "budget.json"
        },
        "countByLanguage": {
    
    
            "method": countByLanguage,
            "path": "language.json"
        },
        "distrbutionOfRuntime": {
    
    
            "method": distrbutionOfRuntime,
            "path": "runtime.json"
        },
        "budgetVote": {
    
    
            "method": budgetVote,
            "path": "budget_vote.json"
        },
        "dateVote": {
    
    
            "method": dateVote,
            "path": "date_vote.json"
        },
        "popVote": {
    
    
            "method": popVote,
            "path": "pop_vote.json"
        },
        "moviesVote": {
    
    
            "method": moviesVote,
            "path": "movies_vote.json"
        },
        "budgetRevenue": {
    
    
            "method": budgetRevenue,
            "path": "budget_revenue.json"
        }
    }
    base = "static/" # 生成文件的 base 目录
    if not os.path.exists(base): # 如果目录不存在则创建一个新的
        os.mkdir(base)

    for k in m: # 执行上述所有方法
        p = m[k]
        f = p["method"]
        save(base + m[k]["path"], json.dumps(f()))
        print ("done -> " + k + " , save to -> " + base + m[k]["path"])

The above code integrates all the functions in the variable m, and then calls all the above methods through a loop and exports the json file.

4. Complete code

from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StringType, StructField, StructType
import json
import csv
import os

sc = SparkContext('local', 'spark_project')
sc.setLogLevel('WARN')
spark = SparkSession.builder.getOrCreate()

schemaString = "budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count"
fields = [StructField(field, StringType(), True)
          for field in schemaString.split(",")]
schema = StructType(fields)

moviesRdd = sc.textFile('hdfs://localhost:8020/user/hadoop/tmdb_5000_movies.csv').map(
    lambda line: Row(*next(csv.reader([line]))))
mdf = spark.createDataFrame(moviesRdd, schema)


def countByJson(field):
    return mdf.select(field).filter(mdf[field] != '').rdd.flatMap(lambda g: [(v, 1) for v in map(lambda x: x['name'], json.loads(g[field]))]).repartition(1).reduceByKey(lambda x, y: x + y)

# 体裁统计


def countByGenres():
    res = countByJson("genres").collect()
    return list(map(lambda v: {
    
    "genre": v[0], "count": v[1]}, res))

# 关键词词云


def countByKeywords():
    res = countByJson("keywords").sortBy(lambda x: -x[1]).take(100)
    return list(map(lambda v: {
    
    "x": v[0], "value": v[1]}, res))

# 公司电影产出数量


def countByCompanies():
    res = countByJson("production_companies").sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "company": v[0], "count": v[1]}, res))

# 预算统计


def countByBudget(order='count', ascending=False):
    return mdf.filter(mdf["budget"] != 0).groupBy("budget").count().orderBy(order, ascending=ascending).toJSON().map(lambda j: json.loads(j)).take(10)

# 语言统计


def countByLanguage():
    res = countByJson("spoken_languages").filter(
        lambda v: v[0] != '').sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "language": v[0], "count": v[1]}, res))

# 电影时长分布 > 100 min


def distrbutionOfRuntime(order='count', ascending=False):
    return mdf.filter(mdf["runtime"] != 0).groupBy("runtime").count().filter('count>=100').toJSON().map(lambda j: json.loads(j)).collect()

# 预算评价关系


def budgetVote():
    return mdf.select("title", "budget", "vote_average").filter(mdf["budget"] != 0).filter(mdf["vote_count"] > 100).collect()

# 上映时间评价关系


def dateVote():
    return mdf.select(mdf["release_date"], "vote_average", "title").filter(mdf["release_date"] != "").filter(mdf["vote_count"] > 100).collect()

# 流行度评价关系


def popVote():
    return mdf.select("title", "popularity", "vote_average").filter(mdf["popularity"] != 0).filter(mdf["vote_count"] > 100).collect()


# 电影数量和评价的关系
def moviesVote():
    return mdf.filter(mdf["production_companies"] != '').filter(mdf["vote_count"] > 100).rdd.flatMap(lambda g: [(v, [float(g['vote_average']), 1]) for v in map(lambda x: x['name'], json.loads(g["production_companies"]))]).repartition(1).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]).map(lambda v: [v[0], v[1][0] / v[1][1], v[1][1]]).collect()

# 预算和营收的关系
def budgetRevenue():
    return mdf.select("title", "budget", "revenue").filter(mdf["budget"] != 0).filter(mdf['revenue'] != 0).collect()

def save(path, data):
  with open(path, 'w') as f:
    f.write(data)

if __name__ == "__main__":
    m = {
    
    
        "countByGenres": {
    
    
            "method": countByGenres,
            "path": "genres.json"
        },
        "countByKeywords": {
    
    
            "method": countByKeywords,
            "path": "keywords.json"
        },
        "countByCompanies": {
    
    
            "method": countByCompanies,
            "path": "company_count.json"
        },
        "countByBudget": {
    
    
            "method": countByBudget,
            "path": "budget.json"
        },
        "countByLanguage": {
    
    
            "method": countByLanguage,
            "path": "language.json"
        },
        "distrbutionOfRuntime": {
    
    
            "method": distrbutionOfRuntime,
            "path": "runtime.json"
        },
        "budgetVote": {
    
    
            "method": budgetVote,
            "path": "budget_vote.json"
        },
        "dateVote": {
    
    
            "method": dateVote,
            "path": "date_vote.json"
        },
        "popVote": {
    
    
            "method": popVote,
            "path": "pop_vote.json"
        },
        "moviesVote": {
    
    
            "method": moviesVote,
            "path": "movies_vote.json"
        },
        "budgetRevenue": {
    
    
            "method": budgetRevenue,
            "path": "budget_revenue.json"
        }
    }
    base = "static/"
    if not os.path.exists(base):
        os.mkdir(base)

    for k in m:
        p = m[k]
        f = p["method"]
        save(base + m[k]["path"], json.dumps(f()))
        print ("done -> " + k + " , save to -> " + base + m[k]["path"])
    # save("test.jj", json.dumps(countByGenres()))

5. Data Analysis Results

insert image description here
The following is the data processing result in json format:

data visualization

Data visualization is implemented based on Ali's open source data visualization tool G2. G2 is a graphical grammar based on visual coding. It is data-driven and highly usable and scalable. Users do not need to pay attention to various tedious implementation details. A variety of interactive statistics can be constructed with one statement. chart. The following takes the genre distribution of movies in TMDb as an example to illustrate the visualization process.
First, use the python web framework bottle to access the visualization page to facilitate the reading of json data. Use the following code web.py to implement a simple static file reading:

import bottle
from bottle import route, run, static_file
import json

@route('/static/<filename>')
def server_static(filename):
    return static_file(filename, root="/home/hadoop/jupyternotebook/static")

@route("/<name:re:.*\.html>")
def server_page(name):
    return static_file(name, root=".")

@route("/")
def index():
    return static_file("index.html", root=".")


run(host="0.0.0.0", port=9996)

bottle routes incoming requests

For files under the static folder in the web service startup directory, directly return the file with the corresponding file name;
For the html files in the startup directory, the corresponding pages are also returned.
Direct access to port 9999 of this machine will return to the home page.
Finally, bind the web service to port 9999 of this machine. According to the above implementation, for web pages (html files), they are directly placed in the directory where the service starts, and for the results of Spark analysis, they are stored in the static directory.
Next, implement the homepage file index.html.

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width,height=device-height">
  <title>TMDb 电影数据分析</title>
  <style>
    /* 这里省略 */
  </style>
</head>

<body>
  <div class="container">
    <h1 style="font-size: 40px;"># TMDb Movie Data Analysis <br> <small style="font-size: 55%;color: rgba(0,0,0,0.65);">>
        Big Data Processing Technology on Spark</small> </h1>
    <hr>
    <h1 style="font-size: 30px;color: #404040;">I. Overviews</h1>
    <div class="chart-group">
      <h2>- Distribution of Genres in TMDb <br> <small style="font-size: 72%;">> This figure
          compares the genre
          distribution in TMDb, and you can see that most of the movies in TMDb is Drama.</small> </h2>
      <iframe src="genres.html" class="frame" frameborder="0"></iframe>
    </div>
  </div>

  <script>/*Fixing iframe window.innerHeight 0 issue in Safari*/document.body.clientHeight;</script>
</body>

</html>

Each chart is brought into the main page via an iframe. For each chart, the home page contains the title and an iframe of the page the chart is on. For the genre distribution analysis results in TMDb, it is implemented in genres.html, which is implemented below.

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width,height=device-height">
  <title>TOP 5000 电影数据分析</title>
  <style>
    ::-webkit-scrollbar {
      
      
      display: none;
    }

    html,
    body {
      
      
      font-family: 'Ubuntu Condensed';
      height: 100%;
      margin: 0;
      color: rgba(0, 0, 0, 0.85);
    }
  </style>
</head>

<body>
  <div id="mountNode"></div>
  </div>
  <script>/*Fixing iframe window.innerHeight 0 issue in Safari*/document.body.clientHeight;</script>
  <script src="static/g2.min.js"></script>
  <script src="static/data-set.min.js"></script>
  <script src="static/jquery-3.2.1.min.js"></script>
  <script>
    function generateChart(id, type, xkey, xlabel, ykey, ylabel) {
      
      
      var chart = new G2.Chart({
      
       // 初始化 chart
        container: id,
        forceFit: true,
        height: 500,
        padding: [40, 80, 80, 80],
      });
      chart.scale(ykey, {
      
       // 对 y 尺度进行设置
        alias: ylabel,
        min: 0,
        // max: 3000,
        tickCount: 4
      });

      chart.axis(xkey, {
      
       // 对 x 坐标轴设置
        label: {
      
      
          textStyle: {
      
      
            fill: '#aaaaaa'
          }
        },
        tickLine: {
      
      
          alignWithLabel: false,
          length: 0
        }
      });

      chart.axis(ykey, {
      
       // 对 y 坐标轴设置
        label: {
      
      
          textStyle: {
      
      
            fill: '#aaaaaa'
          }
        },
        title: {
      
      
          offset: 50
        }
      });
      chart.legend({
      
          // 设置图例
        position: 'top-center'
      });
        //设置标签和颜色等
      chart.interval().position(`${ 
        xkey}*${ 
        ykey}`).label(ykey).color('#ffb877').opacity(1).adjust([{
      
      
        type,
        marginRatio: 1 / 32
      }]);
      chart.render();

      return chart;
    }
  </script>
  <script>
    // 调用上述函数创建图表
    let chart = generateChart('mountNode', 'dodge', 'genre', 'genres', 'count', '# movies');

    window.onload = () => {
      
      
      // 当页面加载后使用 jQuery 提供的方法进行json文件的读取。
      $.getJSON("/static/genres.json", d => {
      
      
        chart.changeData(d) // 使用 chart 的更新数据 API 进行数据更新。
      })
    }
  </script>
</body>

</html>

The explanation of the code process is given in comments. Before using this page, you need to put the corresponding js library ( g2.js, data-set.js, jquery ) into the static folder.
Then execute the code:
insert image description here

Experimental results

visualize the results

overview

1. Genre Distribution in TMDb Movies

insert image description here
It can be seen from the figure that Drama movies account for a large proportion in TMDb, followed by Science Fiction, Action and Thriller.

2. Top 100 common keywords

insert image description here
The most common keyword in TMDb is Woman Director, followed by independent film and so on.

3. The 10 most common budget numbers in TMDb

insert image description here
There are 144 movies with a budget of $20,000,000, which is the most common budget value.

4. The most common movie durations in TMDb (only show the duration of movies with more than 100)

insert image description here
Most movies are 90 or 100 minutes long.

5. Top 10 companies producing the most movies

insert image description here
Companies that produce more movies are Warner Bros., Universal Pictures, etc.

6. TMDb's Top 10 Film Languages

insert image description here
The language in most films is English.

relation

The relationship between budget and evaluation

insert image description here

The relationship between release time and evaluation

insert image description here

The relationship between popularity and ratings

insert image description here

The relationship between the average score and the number of movies produced by the company

insert image description here
As can be seen from the figure, the more movies a company produces, the closer its movie average score is to the overall average.

The relationship between film budget and revenue

insert image description here

web view

insert image description here