Table of contents
brief description
- The IMDB database is a large and widely used database of movies, TV shows, and actor information. It includes information about movies, TV shows, actors, production companies, screenwriters, and directors. The IMDB dataset can provide useful reference information for movie reviews, classification, prediction, and other machine learning tasks.
- The Join Order Benchmark (JOB) is a database benchmark designed to evaluate the ability of a database optimizer, especially in determining the join order between relational tables. This benchmark involves joins of multiple relational tables, challenging the database optimizer's ability to make optimal join order decisions when processing complex queries. Paper link: http://www.vldb.org/pvldb/vol9/p204-leis.pdf
join order benchmark (JOB) query acquisition
The IMDB dataset is imported into PostgreSQL and the join order benchmark (JOB) query is generated:
join order benchmark(JOB)-github-contains installation tutorial
Enter github, you need to download the query statement directly:
Note that the download of the IMDB data set is given in the code, but the website link in the second step is invalid , so use other methods to import:
Import data from IMDB to PG
Import and use of data sets TPC-H, TPC-DS, and IMDB
(1) Download CSV and other files
Download imdb.tgz , put it in a certain path, remember this path, it will be useful later . The author places it here in/var/lib/pgsql/benchmark
Next, unzip imdb.tgz:
tar -zxvf imdb.tgz
The following commands need to be run after entering psql:
(2) psql
Enter PG and create a database:
CREATE DATABASE imdbload;
Use imdbload database:
\c imdbload
(2) Execute the sql script to create the table , pay attention to modify the previous path to the placement path of imdb.tgz:
\i /var/lib/pgsql/benchmark/schematext.sql;
(3) Import data , pay attention to modify the previous path to the placement path of imdb.tgz:
\copy aka_name from '/var/lib/pgsql/benchmark/aka_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy aka_title from '/var/lib/pgsql/benchmark/aka_title.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy cast_info from '/var/lib/pgsql/benchmark/cast_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy char_name from '/var/lib/pgsql/benchmark/char_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy comp_cast_type from '/var/lib/pgsql/benchmark/comp_cast_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_name from '/var/lib/pgsql/benchmark/company_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_type from '/var/lib/pgsql/benchmark/company_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy complete_cast from '/var/lib/pgsql/benchmark/complete_cast.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy info_type from '/var/lib/pgsql/benchmark/info_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy keyword from '/var/lib/pgsql/benchmark/keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy kind_type from '/var/lib/pgsql/benchmark/kind_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy link_type from '/var/lib/pgsql/benchmark/link_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_companies from '/var/lib/pgsql/benchmark/movie_companies.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info from '/var/lib/pgsql/benchmark/movie_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info_idx from '/var/lib/pgsql/benchmark/movie_info_idx.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_keyword from '/var/lib/pgsql/benchmark/movie_keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_link from '/var/lib/pgsql/benchmark/movie_link.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy name from '/var/lib/pgsql/benchmark/name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy person_info from '/var/lib/pgsql/benchmark/person_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy role_type from '/var/lib/pgsql/benchmark/role_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy title from '/var/lib/pgsql/benchmark/title.csv' with delimiter as ',' csv quote '"' escape as '\';
(4) Inspection data ( optional )
After importing, we don't know whether the import is successful, we can write a shell script to check it. Of course, if you find it troublesome, you can skip it and check one or two tables.
bash command to display all tables of imdbload:
echo "\dt" | psql -t -A -d imdbload
If the result displayed is:
public|aka_name|table|postgres
public|aka_title|table|postgres
public|cast_info|table|postgres
public|char_name|table|postgres
public|comp_cast_type|table|postgres
public|company_name|table|postgres
public|company_type|table|postgres
public|complete_cast|table|postgres
public|info_type|table|postgres
public|keyword|table|postgres
public|kind_type|table|postgres
public|link_type|table|postgres
public|movie_companies|table|postgres
public|movie_info|table|postgres
public|movie_info_idx|table|postgres
public|movie_keyword|table|postgres
public|movie_link|table|postgres
public|name|table|postgres
public|person_info|table|postgres
public|role_type|table|postgres
public|title|table|postgres
Then the script needs to be split |
:
#!/bin/bash
# 获取所有表格名称
TABLES=$(echo "\dt" | psql -t -A -d imdbload)
# 遍历每个表格并获取其记录数
for table in $TABLES; do
table=$(echo "${table}" | cut -d '|' -f 2)
count=$(echo "SELECT COUNT(*) FROM $table" | psql -t -A -d imdbload)
echo "$table: $count"
done
If only movie_info
then remove the eighth linetable=$(echo "${table}" | cut -d '|' -f 2)
test:
vim test_imdb.sh
, write the complete script, and wq
exit. Then sh test_imdb.sh
, if the result:
It is found that there is data, then the data import is successful!