[IMDB] Import IMDB dataset into PostgreSQL and generate join order benchmark (JOB) query

brief description

  • The IMDB database is a large and widely used database of movies, TV shows, and actor information. It includes information about movies, TV shows, actors, production companies, screenwriters, and directors. The IMDB dataset can provide useful reference information for movie reviews, classification, prediction, and other machine learning tasks.
  • The Join Order Benchmark (JOB) is a database benchmark designed to evaluate the ability of a database optimizer, especially in determining the join order between relational tables. This benchmark involves joins of multiple relational tables, challenging the database optimizer's ability to make optimal join order decisions when processing complex queries. Paper link: http://www.vldb.org/pvldb/vol9/p204-leis.pdf

join order benchmark (JOB) query acquisition

The IMDB dataset is imported into PostgreSQL and the join order benchmark (JOB) query is generated:

join order benchmark(JOB)-github-contains installation tutorial

Enter github, you need to download the query statement directly:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-UK92X7yI-1686801757755)(typora_img/image-20230615111540448.png)]

Note that the download of the IMDB data set is given in the code, but the website link in the second step is invalid , so use other methods to import:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-3AJ1dPZb-1686801757756)(typora_img/image-20230615111713043.png)]

Import data from IMDB to PG

Import and use of data sets TPC-H, TPC-DS, and IMDB

(1) Download CSV and other files

Download imdb.tgz , put it in a certain path, remember this path, it will be useful later . The author places it here in/var/lib/pgsql/benchmark

Next, unzip imdb.tgz:

tar -zxvf imdb.tgz

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-L6aDiH1J-1686801757757)(typora_img/image-20230615112333547.png)]

The following commands need to be run after entering psql:

(2) psqlEnter PG and create a database:

CREATE DATABASE imdbload;

Use imdbload database:

\c imdbload

(2) Execute the sql script to create the table , pay attention to modify the previous path to the placement path of imdb.tgz:

\i /var/lib/pgsql/benchmark/schematext.sql;

(3) Import data , pay attention to modify the previous path to the placement path of imdb.tgz:

\copy aka_name from '/var/lib/pgsql/benchmark/aka_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy aka_title from '/var/lib/pgsql/benchmark/aka_title.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy cast_info from '/var/lib/pgsql/benchmark/cast_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy char_name from '/var/lib/pgsql/benchmark/char_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy comp_cast_type from '/var/lib/pgsql/benchmark/comp_cast_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_name from '/var/lib/pgsql/benchmark/company_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_type from '/var/lib/pgsql/benchmark/company_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy complete_cast from '/var/lib/pgsql/benchmark/complete_cast.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy info_type from '/var/lib/pgsql/benchmark/info_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy keyword from '/var/lib/pgsql/benchmark/keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy kind_type from '/var/lib/pgsql/benchmark/kind_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy link_type from '/var/lib/pgsql/benchmark/link_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_companies from '/var/lib/pgsql/benchmark/movie_companies.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info from '/var/lib/pgsql/benchmark/movie_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info_idx from '/var/lib/pgsql/benchmark/movie_info_idx.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_keyword from '/var/lib/pgsql/benchmark/movie_keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_link from '/var/lib/pgsql/benchmark/movie_link.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy name from '/var/lib/pgsql/benchmark/name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy person_info from '/var/lib/pgsql/benchmark/person_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy role_type from '/var/lib/pgsql/benchmark/role_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy title from '/var/lib/pgsql/benchmark/title.csv' with delimiter as ',' csv quote '"' escape as '\';

(4) Inspection data ( optional )

After importing, we don't know whether the import is successful, we can write a shell script to check it. Of course, if you find it troublesome, you can skip it and check one or two tables.

bash command to display all tables of imdbload:

echo "\dt" | psql -t -A -d imdbload

If the result displayed is:

public|aka_name|table|postgres
public|aka_title|table|postgres
public|cast_info|table|postgres
public|char_name|table|postgres
public|comp_cast_type|table|postgres
public|company_name|table|postgres
public|company_type|table|postgres
public|complete_cast|table|postgres
public|info_type|table|postgres
public|keyword|table|postgres
public|kind_type|table|postgres
public|link_type|table|postgres
public|movie_companies|table|postgres
public|movie_info|table|postgres
public|movie_info_idx|table|postgres
public|movie_keyword|table|postgres
public|movie_link|table|postgres
public|name|table|postgres
public|person_info|table|postgres
public|role_type|table|postgres
public|title|table|postgres

Then the script needs to be split |:

#!/bin/bash

# 获取所有表格名称
TABLES=$(echo "\dt" | psql -t -A -d imdbload)

# 遍历每个表格并获取其记录数
for table in $TABLES; do
	table=$(echo "${table}" | cut -d '|' -f 2)
    count=$(echo "SELECT COUNT(*) FROM $table" | psql -t -A -d imdbload)
    echo "$table: $count"
done

If only movie_infothen remove the eighth linetable=$(echo "${table}" | cut -d '|' -f 2)

test:

vim test_imdb.sh, write the complete script, and wqexit. Then sh test_imdb.sh, if the result:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-vYnH2HRs-1686801757757)(typora_img/image-20230615114426024.png)]

It is found that there is data, then the data import is successful!

Guess you like

Origin blog.csdn.net/aruewds/article/details/131225201