Talk about TPCD-DS&TPC-H and query performance in the data warehouse

Abstract: When using GaussDB (DWS), how to use standard data models such as TPC-DS/TPC-H to obtain the query performance data of DWS.

This article is shared from the HUAWEI CLOUD community " GaussDB (DWS) "Things about TPCD-DS&TPC-H and Query Performance of DWS ", author: One Sword and Eight Wilderness.

1 Overview

The goal of this article is to describe in detail how to use standard data models such as TPC-DS/TPC-H to obtain DWS query performance data when using GaussDB (DWS). It mainly includes four chapters: overall process overview, DWS cluster and ECS elastic cloud server environment preparation, TPC-DS/TPC-H data creation, table building and data import, query execution and result collection.

Limited by the display of the editor, in order to better read the effect, please download the original document attached to view and get the relevant script .

Many of the operational details involved cannot be described one by one, and the main purpose is to sort out and show the overall logic. The main tools involved, OBS/GDS/JDBC copy, will be described separately later. If you have any questions that cannot be resolved, please leave a comment.

2 Overview of the overall process

3 DWS cluster and ECS elastic cloud server environment preparation

3.1 Creating an ECS Elastic Cloud

3.2 Create DWS data warehouse

4 TPC-DS/TPC-H造数

4.1 Preparing the data generation tool

  1. Connecting to ECS Elastic Cloud Remotely
  2. Execute yum install git to install git
  3. Execute yum install gcc to install gcc
  4. Execute mkdir -p /data1/script/tpcds-kit/tpcds1000X ; mkdir -p /data1/script/tpch-kit/tpch100X to create a storage directory for tpc-ds or tpc-h
  5. Please get the latest version from the official website of TPC-DS digitizing tool dsdgen.

Upload to /data1/script/tpcds-kit of ECS via FTP or OBS service; (See Appendix 1 for details on how to use OBS)

The TPC-H counting tool can be downloaded directly from git clone.

cd /data1/script/tpch-kit;

git clone https://github.com/gregrahn/tpch-kit.git

  1. Unzip the tpch package, enter the dbgen directory, make and compile the corresponding number creation tool dbgen
  2. Unzip the tpcds package, enter the tools directory, make and compile the corresponding number creation tool dsdgen

4.2 Generate data files

  • Generate TPCH data file

After entering the dbgen directory, execute ./dbgen –s 100 > ./dbgen_100.log 2>&1 &, and the command to generate 100Xtpch data will be executed in the background

You can use du –sh dbgen/*.tbl to judge the generation progress of the data file. The total size of 100Xtpch data files is about 107GB,

You can also use ps ux|grep dbgen to check whether the process of generating the data file has exited

  • Generate TPCDS data file

Because the data of tpcds1000X, the single target data file is large, we adopt the strategy of fragmentation.

After entering the tools directory, execute

for c in {1..10};do (./dsdgen –sc 1000 –parallel 10 –child ${c} –dir /data1/script/tpcdsk-kit/tpcds1000X  > /dev/null 2>&1 &);done

in,

-sc specifies the data size

-parallel specifies the number of shards

-child specifies the current number of slices in the generated slice

-dir specifies the directory where the generated data files are stored

You can use du –sh tpcds100X/*.dat to judge the progress of data file generation. The total size of 1000Xtpcds data files is about 920GB,

You can also use ps ux|grep dsdgen to check whether the process of generating the data file has exited.

5 Create tables and import data

5.1 Import by GDS

5.1.1 Download the gsql client of the corresponding version of ECS from the connection management page of the data warehouse service, and upload it to the ECS through ftp or obs; (see Appendix 1 for the usage of OBS)

5.1.2 Deploy GDS on ECS, see HUAWEI CLOUD official documentation https://support.huaweicloud.com/tg-dws/dws_07_0759.html

5.1.3 Use the gsql tool to connect to the cluster on ECS. The IP and port number information required to connect to the cluster can be obtained from the connection management page of the data warehouse service

5.1.4 Use gsql to connect to the cluster on ECS, and create tpch/tpcds inner table and gds outer table. See the following sql file for the table creation statement,

5.1.5 Use gsql to connect to the cluster on ECS, import data into the cluster through the GDS external table, and use the insert into [target table] select * from [target table external] method.

5.2 Import by JDBC copy

5.2.1 Download the JDBC driver of the corresponding version of ECS from the connection management page of the data warehouse service, and upload it to the ECS through ftp or obs; (See Appendix 1 for the usage of OBS)

5.2.2 Upload the JDBC driver and copy java script to ECS, here is the source code of dws_copy.java

5.2.3 Compile java files with javac on ECS, and then generate the jar package, Copy.jar, of the copied source code and JDBC driver. The detailed process of compiling and generating jar package is as follows:

5.2.4 On ECS, java –jar Copy.jar copies data to the cluster through JDBC.

Executable source code and secondary encapsulated shell JDBC derivative execution script are detailed in the following compressed package

6 Execute query and result collection

6.1 Automate query execution and result collection by writing shell scripts.

The script package is as follows, which contains two files, query.conf and run_query.sh.

query.conf is the cluster information configuration file, which contains the following four variables

db_name=tpcds_test database name

db_port=6000 database port number

db_user=tpcds_user database user

user_passwd=Gauss_234 Database user password

After editing query.conf as the information corresponding to the cluster, execute sh run_query.sh to start query execution and result collection.

Precautions:

  1. The use of the gsql client requires that after each connection, source gsql_env, and make sure that gsql is executable before executing the query script;
  2. Each query will be run 6 times, one time to collect the execution plan, two times to warm up, three times to formally query, and the final result is the average of the last three queries;
  3. A directory with the name query_log_yymmdd_hhmmss is generated immediately after the query script is executed, where

The exlain_log subdirectory stores query plans,

The pre_warm subdirectory stores the preheat execution results,

The real_test subdirectory stores formal query execution results,

query_result.csv file, the csv format summarizes the execution results of all queries, and an example of the results in csv is as follows

7 Appendix

7.1 HUAWEI CLOUD OBS Official User Guide

https://support.huaweicloud.com/browsertg-obs/obs_03_1000.html

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324213331&siteId=291194637