[Spark Big Job] Comprehensive inspection of Spark rapid big data analysis


1. Comprehensive training basic questions (Case 4: Movie data analysis based on TMDB data set)

Case 4: Movie Data Analysis Based on TMDB Dataset

1.1 Environment setup

This project installs the library in PyCharm: pyspark,

1.2 Data preprocessing

  • The data set used in this project comes from the tmdb-movie-metadata movie data set of the well-known data website Kaggle. This data set contains related data of approximately 5,000 movies.
  • This experiment uses the movie data table tmdb_5000_movies.csv in the data set. The data contains the following fields (field name explanation examples):
    • budget budget 10000000
    • genres Format “[{”“id”“: 18, ““name””: ““Drama””}]”
    • homepage homepage ""
    • id id 268238
    • keywords keywords "[{""id"": 14636, ""name"": ""india""}]"
    • original_language original language en
    • original_title original title The Second Best Exotic Marigold Hotel
    • overviewOverviewAs the Best Exotic Marigold Hotel…
    • popularity popularity17.592299
    • production_companies Production companies "[{""name"": ""Fox Searchlight Pictures"", ""id"": 43}, …]"
    • production_countries Production countries "[{""iso31661"": ""GB"", ""name"": ""United Kingdom""}, …]"
    • release_date Release date 2015-02-26
    • revenue 85978266
    • runtime length 122
    • spoken_languages 语言 “[{”“iso6391"”: ““en””, ““name””: ““English””}]"
    • status statusReleased
    • tagline slogan ""
    • title 标题 The Second Best Exotic Marigold Hotel
    • vote_average average score 6.3
    • vote_count Number of voters 272
  • Since some fields in the data contain json data, segmentation errors will occur when reading directly using DataFrame. Therefore, if you want to create a DataFrame, you need to directly read the file to generate an RDD, and then convert the RDD into a DataFrame. During the process, the csv module in python3 is used to parse and convert the data.
  • In order to more conveniently process the RDD converted from the csv file, the header row of the csv file needs to be removed first. After completion, store the processed file tmdb_5000_movies.csv on HDFS for further processing. Use the following command to upload the file to HDFS:
hdfs dfs -put tmdb_5000_movies.csv
  • At this time, the path of the file on HDFS is /user/hadoop/tmdb_5000_movies.csv. Then in the program, use the following statement to read the file:
sc.textFile('tmdb_5000_movies.csv')
  • The tmdb_5000_movies.csv file can also be downloaded directly from Baidu Netdisk, Baidu Netdisk address , extraction code: cui7

1.3 Use Spark to convert data into DataFrame

  • Since the read file is a csv file and is structured data, the data can be created as a DataFrame for easy analysis.

Starting from this section, all subsequent codes are stored in analyze.py. The code will be explained line by line below. After the explanation is completed, the full picture of the analyze.py code will be given.
(Note: All code files can be downloaded directly from Baidu Netdisk, Baidu Netdisk address: https://pan.baidu.com/s/1lt7PHF17-gHieOU0B0zJ3A Extraction code: cui7)

  • In order to create a DataFrame, you first need to load the data on HDFS into an RDD, and then convert the RDD into a DataFrame. The following code snippet completes the conversion from file to RDD and then to DataFrame:
from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StringType, StructField, StructType
import json # 用于后面的流程
import csv

# 1. 创建 SparkSession 和 SparkContext 对象
sc = SparkContext('local', 'spark_project')
sc.setLogLevel('WARN') # 减少不必要的 LOG 输出
spark = SparkSession.builder.getOrCreate()

# 2. 为 RDD 转为 DataFrame 创建 schema
schemaString = "budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count"
fields = [StructField(field, StringType(), True)
          for field in schemaString.split(")]
schema = StructType(fields)

# 3. 对于每一行用逗号分隔的数据,使用 csv 模块进行解析并转为 Row 对象,得到可以转为 DataFrame 的 RDD
moviesRdd = sc.textFile('tmdb_5000_movies.csv').map(
    lambda line: Row(*next(csv.reader([line]))))

# 4. 使用 createDataFrame 创建 DataFrame
mdf = spark.createDataFrame(moviesRdd, schema)

The above code does 4 things:

  • First, create SparkSession and SparkContext objects. Then, create a header (schema) for converting RDD into DataFrame.
    • schema is a StructType object created using an array of StructFields
    • Each StructField represents a field in structured data
    • Constructing StructField requires 3 parameters: field name, field type, and whether the field can be empty. The following is the structural relationship of these objects:
StructType([StructField(name, type, null), ..., StructField(name, type, null)])
  • Next, start creating the RDD for conversion to DataFrame.
  • This process first reads the data file on HDFS, and then in order to convert the RDD into a DataFrame, each row of the data needs to be converted into a Row object.
  • This process first parses using the csv module to get an iterator containing each field:
csv.reader([line]) # 这里 [line] 表示包含一行数据的数组
  • Then use the next function to read the data in the iterator into an array:
next(csv.reader([line]))
  • Finally, use * to convert the array into the constructor parameter of the Row object and create the Row object:
Row(*next(csv.reader([line])))
  • At this point, each row in moviesRdd is a Row object. Finally, use the prepared header (schema) and RDD to create the DataFrame through the SparkSession interface createDataFrame:
mdf = spark.createDataFrame(moviesRdd, schema)
  • This completes the creation of DataFrame.

Insert image description here

1.4 Using Spark for data analysis

  • The following uses the DataFrame mdf obtained through Spark processing for data analysis. First, the main fields in the data are analyzed individually (overview section), and then the relationships between different fields are analyzed (relationship section).
  • In order to facilitate data visualization, for each different analysis, the analysis results are exported to json files, which are read and visualized by the web page. Export directly using the save function below:
def save(path, data):
  with open(path, 'w') as f:
    f.write(data)
  • This function writes data to path.
  • The generation process of each analysis is introduced below.

1.4.1 Overview

(1) Genre distribution in TMDb movies

  • As can be seen from the above data dictionary description, the genre field of a movie is data in json format. Therefore, in order to count the number of movies of different genres, you need to first parse the json data, take out the genre array corresponding to each movie, and then use word frequency
  • By counting the frequency of different genres using statistical methods, the genre distribution of movies can be obtained.
  • First implement a function countByJson(field), which implements the function of parsing the json format field, extracting the name and performing word frequency statistics:
def countByJson(field):
    return mdf.select(field).filter(mdf[field] != '').rdd.flatMap(lambda g: [(v, 1) for v in map(lambda x: x['name'], json.loads(g[field]))]).repartition(1).reduceByKey(lambda x, y: x + y)
  • The function returns an RDD and the entire process is as follows.
    Insert image description here
  • Based on this function, countByGenres is implemented to generate statistical results of the number of movies in different genres:
def countByGenres():
    res = countByJson("genres").collect()
    return list(map(lambda v: {
    
    "genre": v[0], "count": v[1]}, res))
  • This function calls countByJson to obtain the frequency statistics results, converts them into json data format and returns them for easy visualization. The final function return data format is as follows:
[{
    
    
    "genre": ...,
    "count": ...
}, {
    
    
    "genre": ...,
    "count": ...
}, ...]
  • Next, use the following code to export the data to genres.json for later visualization.
save('genres.json', json.dumps(countByGenres())) # 确保 json 包已导入

(2) Top 100 common keywords

  • This analysis analyzes the top 100 keywords that appear most frequently in movies. Since the keyword field is also json format data, call countByJson for frequency statistics, and at the same time sort the statistical results in descending order and take the first 100 items:
def countByKeywords():
    res = countByJson("keywords").sortBy(lambda x: -x[1]).take(100)
    return list(map(lambda v: {
    
    "x": v[0], "value": v[1]}, res))
  • Finally, the function returns json data in the following format:
[{
    
    
    "x": ...,
    "value": ...
}, {
    
    
    "x": ...,
    "value": ...
}, ...]
  • Next, use the following code to export the data to keywords.json for later visualization.
save('keywords.json', json.dumps(countByKeywords()))

(3) The 10 most common budget numbers in TMDb

  • This item explores the common budget numbers of movies, so it is necessary to conduct frequency statistics on movie budgets. The code is as follows:
def countByBudget(order='count', ascending=False):
    return mdf.filter(mdf["budget"] != 0).groupBy("budget").count().orderBy(order, ascending=ascending).toJSON().map(lambda j: json.loads(j)).take(10)
  • First, you need to filter the budget field to remove projects with a budget of 0, then aggregate and count according to the budget, then sort according to the count, and export the result as a json string. For unified output, the json string is converted to python object, and finally take the top 10 items as the final result.
  • Finally, the function returns json data in the following format:
[{
    
    
    "budget": ...,
    "count": ...
}, {
    
    
    "budget": ...,
    "count": ...
}, ...]
  • Next, use the following code to export the data to budget.json for later visualization.
save('budget.json', json.dumps(countByBudget()))

(4) The most common movie duration in TMDb (only shows the duration of movies with more than 100)

  • This item counts the most common movie durations in TMDb. First, you need to filter movies with a duration of 0, then aggregate and count them according to the duration field, and then filter out durations with a frequency of less than 100 (this step is to facilitate visualization and avoid too many redundant information) to obtain the final result.
def distrbutionOfRuntime(order='count', ascending=False):
    return mdf.filter(mdf["runtime"] != 0).groupBy("runtime").count().filter('count>=100').toJSON().map(lambda j: json.loads(j)).collect()
  • Finally, the function returns json data in the following format:
[{
    
    
    "runtime": ...,
    "count": ...
}, {
    
    
    "runtime": ...,
    "count": ...
}, ...]
  • Then, use the following code to export data to runtime.json for visualization later
save('runtime.json', json.dumps(distrbutionOfRuntime()))

(5) The top 10 companies that produce the most movies

  • This item counts the 10 companies with the largest movie output. We also use countByJson to perform frequency statistics on the JSON data, and then sort the top 10 items in descending order.
def countByCompanies():
    res = countByJson("production_companies").sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "company": v[0], "count": v[1]}, res))
  • Finally, the function returns JSON data in the following format:
[{
    
    
    "company": ...,
    "count": ...
}, {
    
    
    "company": ...,
    "count": ...
}, ...]
  • Then, use the following code to export the data to company_count.json for visualization later
save('company_count.json', json.dumps(countByCompanies()))

(6) Top 10 movie languages ​​in TMDb

  • This item counts the languages ​​that appear most in TMDb. Similar to the previous one, this field is also JSON data. Therefore, word frequency statistics are first performed on each item, and then items with empty languages ​​are filtered out. Finally, the top ten can be sorted.
def countByLanguage():
    res = countByJson("spoken_languages").filter(
        lambda v: v[0] != '').sortBy(lambda x: -x[1]).take(10)
    return list(map(lambda v: {
    
    "language": v[0], "count": v[1]}, res))
  • Finally, the function returns json data in the following format:
[{
    
    
    "language": ...,
    "count": ...
}, {
    
    
    "language": ...,
    "count": ...
}, ...]
  • Next, use the following code to export data to language.json for easy visualization
save('language.json', json.dumps(countByLanguage()))

1.4.2 Relationship

This part considers the relationship between data

(1) The relationship between budget and evaluation

  • This part considers the relationship between budget and evaluation, so for each movie, the following data needs to be exported:
[电影标题,预算,评价]
  • Field filtering of data based on DataFrame is enough:
def budgetVote():
    return mdf.select("title", "budget", "vote_average").filter(mdf["budget"] != 0).filter(mdf["vote_count"] > 100).collect()
  • Attention should also be paid to filtering out the data whose budget is empty. At the same time, only the data with more than 100 votes is kept in the result to ensure fairness.
  • The resulting data is stored in budget_vote.json:
save('budget_vote.json', json.dumps(budgetVote()))

(2) The relationship between release time and evaluation

  • This part considers the relationship between release time and evaluation, so for each movie, the following data needs to be exported:
[电影标题,发行时间,评价]
  • Field filtering of data based on DataFrame is enough:
def dateVote():
    return mdf.select(mdf["release_date"], "vote_average", "title").filter(mdf["release_date"] != "").filter(mdf["vote_count"] > 100).collect()
  • Here we still need to filter out the data whose release time is empty, and keep the data with more than 100 votes.
  • The resulting data is stored in date_vote.json:
save('date_vote.json', json.dumps(dateVote()))

(3) The relationship between popularity and evaluation

  • This part considers the relationship between popularity and evaluation, so for each movie, the following data needs to be exported:
[电影标题,流行度,评价]
  • Field filtering of data based on DataFrame is enough:
def popVote():
    return mdf.select("title", "popularity", "vote_average").filter(mdf["popularity"] != 0).filter(mdf["vote_count"] > 100).collect()
  • At the same time, filter out the data whose popularity is 0, and keep the data with more than 100 votes.
  • The resulting data is stored in pop_vote.json:
save('pop_vote.json', json.dumps(popVote()))

(4) The relationship between the average score and the number of movies produced by the company

  • This section calculates the number of movies produced by each company and the average score distribution of these movies. First, the data needs to be filtered to remove movies with empty production company fields and movies with less than 100 reviewers. Then for each record, a record in the following form is obtained:
[公司名,(评分,1)]
  • Then the ratings and counts of all records are accumulated, and finally the total rating is divided by the count to obtain a company's average rating and number of movies. The entire process is as follows.
def moviesVote():
    return mdf.filter(mdf["production_companies"] != '').filter(mdf["vote_count"] > 100).rdd.flatMap(lambda g: [(v, [float(g['vote_average']), 1]) for v in map(lambda x: x['name'], json.loads(g["production_companies"]))]).repartition(1).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]).map(lambda v: [v[0], v[1][0] / v[1][1], v[1][1]]).collect()
  • Below is a diagram of the process
    Insert image description here
  • The resulting data is stored in movies_vote.json:
save('movies_vote.json', json.dumps(moviesVote()))

(5) The relationship between film budget and revenue

  • This part considers the revenue of the movie, so for each movie, the following data needs to be exported:
[电影标题,预算,收入]
  • Field filtering of data based on DataFrame is enough:
def budgetRevenue():
    return mdf.select("title", "budget", "revenue").filter(mdf["budget"] != 0).filter(mdf['revenue'] != 0).collect()
  • Filter out the data with budget and income of 0.
  • The resulting data is stored in budget_revenue.json:
save('budget_revenue.json', json.dumps(budgetRevenue()))

1.4.3 Summary

  • Finally, the above process is integrated for easy calling, so add the main function in analyze.py:
if __name__ == "__main__":
    m = {
    
    
        "countByGenres": {
    
    
            "method": countByGenres,
            "path": "genres.json"
        },
        "countByKeywords": {
    
    
            "method": countByKeywords,
            "path": "keywords.json"
        },
        "countByCompanies": {
    
    
            "method": countByCompanies,
            "path": "company_count.json"
        },
        "countByBudget": {
    
    
            "method": countByBudget,
            "path": "budget.json"
        },
        "countByLanguage": {
    
    
            "method": countByLanguage,
            "path": "language.json"
        },
        "distrbutionOfRuntime": {
    
    
            "method": distrbutionOfRuntime,
            "path": "runtime.json"
        },
        "budgetVote": {
    
    
            "method": budgetVote,
            "path": "budget_vote.json"
        },
        "dateVote": {
    
    
            "method": dateVote,
            "path": "date_vote.json"
        },
        "popVote": {
    
    
            "method": popVote,
            "path": "pop_vote.json"
        },
        "moviesVote": {
    
    
            "method": moviesVote,
            "path": "movies_vote.json"
        },
        "budgetRevenue": {
    
    
            "method": budgetRevenue,
            "path": "budget_revenue.json"
        }
    }
    base = "static/" # 生成文件的 base 目录
    if not os.path.exists(base): # 如果目录不存在则创建一个新的
        os.mkdir(base)

    for k in m: # 执行上述所有方法
        p = m[k]
        f = p["method"]
        save(base + m[k]["path"], json.dumps(f()))
        print ("done -> " + k + " , save to -> " + base + m[k]["path"])
  • The above code integrates all functions in the variable m, then calls all the above methods through a loop and exports the json file.

Insert image description here

1.5 Data visualization methods

  • Data visualization is implemented based on Ali's open source data visualization tool G2. G2 is a set of graphical grammar based on visual coding. It is data-driven and has a high degree of ease of use and scalability. Users do not need to pay attention to various tedious implementation details. A variety of interactive statistics can be constructed with one statement. chart. The following takes the genre distribution of movies in TMDb as an example to illustrate the visualization process.
  • First, use the python web framework bottle to access the visualization page to facilitate reading of json data. Use the following code web.py to implement a simple static file reading:
from bottle import route, run, static_file
import json

@route('/static/<filename>')
def server_static(filename):
  return static_file(filename, root="./static")

@route("/<name:re:.*\.html>")
def server_page(name):
  return static_file(name, root=".")

@route("/")
def index():
  return static_file("index.html", root=".")

run(host="0.0.0.0", port=9999)
  • bottle routes received requests:

      1. For files in the static folder in the web service startup directory, the file corresponding to the file name is directly returned;
      1. For html files in the startup directory, the corresponding page is also returned.
      1. Directly access the 9999 port of this machine to return to the home page.
    • Finally, bind the web service to the local port 9999. According to the above implementation, the web page (html file) is placed directly in the directory where the service is started, and the results of Spark analysis are saved in the static directory.
  • Next, implement the homepage file index.html.

<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width,height=device-height">
  <title>TMDb 电影数据分析</title>
  <style>
    /* 这里省略 */
  </style>
</head>

<body>
  <div class="container">
    <h1 style="font-size: 40px;"># TMDb Movie Data Analysis <br> <small style="font-size: 55%;color: rgba(0,0,0,0.65);">>
        Big Data Processing Technology on Spark</small> </h1>
    <hr>
    <h1 style="font-size: 30px;color: #404040;">I. Overviews</h1>
    <div class="chart-group">
      <h2>- Distribution of Genres in TMDb <br> <small style="font-size: 72%;">> This figure
          compares the genre
          distribution in TMDb, and you can see that most of the movies in TMDb is Drama.</small> </h2>
      <iframe src="genres.html" class="frame" frameborder="0"></iframe>
    </div>
  </div>

  <script>/*Fixing iframe window.innerHeight 0 issue in Safari*/document.body.clientHeight;</script>
</body>

</html>
  • Each chart is brought into the home page via an iframe. For each chart, the home page contains a title and an iframe of the page the chart is on.
  • The genre distribution analysis results in TMDb are implemented in genres.html, and this file is implemented below.
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width,height=device-height">
  <title>TOP 5000 电影数据分析</title>
  <style>
    ::-webkit-scrollbar {
      
      
      display: none;
    }

    html,
    body {
      
      
      font-family: 'Ubuntu Condensed';
      height: 100%;
      margin: 0;
      color: rgba(0, 0, 0, 0.85);
    }
  </style>
</head>

<body>
  <div id="mountNode"></div>
  </div>
  <script>/*Fixing iframe window.innerHeight 0 issue in Safari*/document.body.clientHeight;</script>
  <script src="static/g2.min.js"></script>
  <script src="static/data-set.min.js"></script>
  <script src="static/jquery-3.2.1.min.js"></script>
  <script>
    function generateChart(id, type, xkey, xlabel, ykey, ylabel) {
      
      
      var chart = new G2.Chart({
      
       // 初始化 chart
        container: id,
        forceFit: true,
        height: 500,
        padding: [40, 80, 80, 80],
      });
      chart.scale(ykey, {
      
       // 对 y 尺度进行设置
        alias: ylabel,
        min: 0,
        // max: 3000,
        tickCount: 4
      });

      chart.axis(xkey, {
      
       // 对 x 坐标轴设置
        label: {
      
      
          textStyle: {
      
      
            fill: '#aaaaaa'
          }
        },
        tickLine: {
      
      
          alignWithLabel: false,
          length: 0
        }
      });

      chart.axis(ykey, {
      
       // 对 y 坐标轴设置
        label: {
      
      
          textStyle: {
      
      
            fill: '#aaaaaa'
          }
        },
        title: {
      
      
          offset: 50
        }
      });
      chart.legend({
      
          // 设置图例
        position: 'top-center'
      });
        //设置标签和颜色等
      chart.interval().position(`${ 
        xkey}*${ 
        ykey}`).label(ykey).color('#ffb877').opacity(1).adjust([{
      
      
        type,
        marginRatio: 1 / 32
      }]);
      chart.render();

      return chart;
    }
  </script>
  <script>
    // 调用上述函数创建图表
    let chart = generateChart('mountNode', 'dodge', 'genre', 'genres', 'count', '# movies');

    window.onload = () => {
      
      
      // 当页面加载后使用 jQuery 提供的方法进行json文件的读取。
      $.getJSON("/static/genres.json", d => {
      
      
        chart.changeData(d) // 使用 chart 的更新数据 API 进行数据更新。
      })
    }
  </script>
</body>

</html>
  • The process explanation of the code is given in comments. Before using this page, you need to put the corresponding js library (g2.js, data-set.js, jquery) into the static folder.
  • After the code is completed, in the root directory where the code is located, execute:
spark-submit web.py
  • The command line appears:

Insert image description here

  • After the startup is completed, open the browser and visit http://127.0.0.1:9999 to see the visual results.

1.6 Data chart

  • Genre distribution in TMDb movies
    Insert image description here
  • Top 100 common keywords
    Insert image description here
  • The 10 most common budget numbers in TMDb
    Insert image description here
  • The most common movie durations in TMDb (only show durations with more than 100 movies)
    Insert image description here
  • Top 10 companies that produce the most movies
    Insert image description here
  • Top 10 movie languages ​​in TMDb
    Insert image description here
  • The relationship between budget and evaluation
    Insert image description here
  • The relationship between release time and evaluation
    Insert image description here
  • The relationship between popularity and evaluation
    ! [Insert image description here](https://img-blog.csdnimg.cn/89f28f835ee441f9a5ac32426fde40e2.png =800 =800x)
  • The relationship between the average score and the number of movies produced by the company
    Insert image description here
  • The relationship between movie budget and revenue
    Insert image description here

Supplementary items


2. Optional questions (Topic 4: Analysis of factors affecting fiscal revenue and forecast model)

[Data Mining Case] ​​Analysis and prediction model of factors affecting fiscal revenue

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131399763