Introduction to database import and export tool BatchTool

Performance comparison

In the performance experiment, the software, versions and system resources used are as shown in the following table:

Test table

The test table is a lineitem table with TPC-H specifications, with a total of 59.98 million rows. The size of the exported single csv file is 7.4G.

CREATE TABLE `lineitem` (
  `l_orderkey` bigint(20) NOT NULL,
  `l_partkey` int(11) NOT NULL,
  `l_suppkey` int(11) NOT NULL,
  `l_linenumber` bigint(20) NOT NULL,
  `l_quantity` decimal(15,2) NOT NULL,
  `l_extendedprice` decimal(15,2) NOT NULL,
  `l_discount` decimal(15,2) NOT NULL,
  `l_tax` decimal(15,2) NOT NULL,
  `l_returnflag` varchar(1) NOT NULL,
  `l_linestatus` varchar(1) NOT NULL,
  `l_shipdate` date NOT NULL,
  `l_commitdate` date NOT NULL,
  `l_receiptdate` date NOT NULL,
  `l_shipinstruct` varchar(25) NOT NULL,
  `l_shipmode` varchar(10) NOT NULL,
  `l_comment` varchar(44) NOT NULL,
  PRIMARY KEY (`l_orderkey`,`l_linenumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Data output

Test Results

Note: mysqldump supports exporting to csv files, but it relies on the select... into outfile capability of the server MySQL. The cloud database used in this article does not allow this function, so here we only test the efficiency of mysqldump's native exporting to sql files.
When mysqldump exports data to a sql file, it will automatically splice multiple rows of data into a batch insert statement. The size of the statement is controlled by the parameter net-buffer-length. The default size is 1MB; if you want to improve the performance of subsequent imports, you can adjust it appropriately. Larger than this value.

When BatchTool imports PolarDB-X distributed tables, its performance is significantly improved compared to the mysqldump method. This is because BatchTool is adapted to PolarDB-X partition tables and can establish multiple connections at the same time after obtaining the metadata of the logical table. Export physical tables of underlying storage in parallel to make full use of network bandwidth.

data import

Test Results

Note: Each round of import testing creates a new empty table for import.

  • source import sql file:

The entire process of importing sql files using source is executed serially, but since mysqldump has already spliced ​​batch insert statements when exporting to sql files, the import efficiency will not be too low.

  • load data import csv file:

In MySQL, although load data is also executed in a single thread, its execution efficiency is still much higher than that of source importing sql files, because load data only needs to transmit text files over the network, and does not need to go through sql when executed on the MySQL side. Analyze and optimize processes. To further improve performance, you need to manually split the file and open multiple database connections to import the database in parallel.

However, in PolarDB-X, load data is relatively slower because the text stream needs to calculate the route on the computing node and then splice it into a batch insert statement and send it to the storage node for execution, which cannot take advantage of the high performance of MySQL's native load data protocol. implementation, so the performance is relatively low.

  • BatchTool import csv file:

From system monitoring, we can see that the network transmission bandwidth of BatchTool when importing csv can reach 39MB/s, which is more than three times that of load data. This is because BatchTool is based on the producer-consumer model and supports concurrent reading of a single file and then concurrently sending batch insert statements to the database, making full use of the hardware resources on the stress testing machine and improving the throughput when importing data.

Practical scenario

In addition to regular parallel import and export of data, BatchTool also supports many scenario ecological functions for data migration. The following will introduce the use of BatchTool and the parameters of various built-in modes based on different practical scenarios.

Database Connectivity

BatchTool supports the import and export of various databases compatible with the MySQL protocol. The parameters for connecting to the database are -h (database host) -  p(port number) -u (user name) -  p (password) -D $ (target database). On this basis, it also supports connecting to the loadbalance address, for example: -lb true -h "host1:3306,host2:3306" -uroot.

Note: The parameters related to database connection will be omitted below, and only the parameter settings related to functions will be shown.

Whole database migration

BatchTool supports importing or exporting the entire database at one time, including all table structures and table data. If there are many tables under a library (for example, thousands of tables), then the efficiency of exporting the table structure based on mysqldump and then executing source to migrate the table structure will be very low, because the process is completely single-threaded.

BatchTool supports executing DDL table creation statements in parallel while reading table structure SQL files, thereby improving efficiency.

When the command line parameter specifies -t $ (table name), it is the import or export table; if this parameter is not added, it is the import or export mode of the entire database.

The command line parameter corresponding to metadata is -DDL $ (migration mode). There are three migration modes:

  • NONE: Do not migrate the table structure (default value)
  • ONLY: Only the table structure is migrated, not the data.
  • WITH: Migrate table structure and data

For example, the table structure of all tables in the tpch library will be exported: -D tpch -o export -DDL only.

Export file split

BatchTool supports specifying the number of export files or the maximum number of lines in a single file. For stand-alone MySQL, BatchTool will export a table into a file by default; for distributed database PolarDB-X, BatchTool will export each physical table under a table into a file by default, that is, the number of files is equal to the number of partitions. slices. On this basis, there are two parameters that can affect the number of exported files:

  • -F $(number of files): Fixed number of exported files, which will be evenly divided according to the total data volume of the table
  • -L $ (maximum number of lines): Specifies the maximum number of lines in a single file. When the number of lines in a single file reaches this limit, a new file will be opened to continue writing.

For example, export each table in the tpch library to a separate csv file: -D tpch -o export -s, -F 1.

Import and export specified columns

BatchTool supports importing or exporting certain columns of a specified table. The corresponding command line parameter is -col "$(semicolon-separated column name)", for example, specify the c_name, c_address and c_phone columns of the customer table to be exported, separated by commas, and the first line of the file outputs the field name: -o export - t customer -col "c_name;c_address;c_phone" -s , -header true.

encrypt documents

BatchTool supports streaming output as encrypted ciphertext data when exporting files, avoiding the manual encryption operation after exporting plaintext data; it also supports direct reading of encrypted files for data import (the correct key needs to be provided) , to avoid repeated decryption operations. Currently two encryption algorithms are supported:

  • AES-CBC
  • SM4-ECB

The corresponding command line parameter is -enc (encryption algorithm)−  key (key). For example, use the AES algorithm to encrypt and export the customer table data into a file, and specify the key as "admin123456": -o export -s, -t sbtest1 -enc AES -key admin123456 -F 1.

File compression

BatchTool supports streaming output to compressed files when exporting files to reduce space usage; it also supports direct reading of compressed files to import data to avoid repeated operations of decompressing data. The corresponding command line parameter is -comp $ (compression algorithm), for example:

1. Export the customer table as a GZIP compressed file. The field separator is comma (,): -o export -t customer -s , -comp GZIP
2. Import all GZIP compressed files in the customer-data directory into the table customer_2 , the field separator is comma (,): -o import -t customer -s , -comp GZIP -dir data-test

file format

BatchTool supports the import and export of the following file formats:

  • CSV (specified character delimited, text file)
  • XLS, XLSX (Excel table, binary file)
  • ET (WPS form, binary file)

The corresponding command line parameter is -format $ (file format), for example, to export the customer table into a table in XLSX format: -o export -t customer -format XLSX.

Data desensitization

Many times, the exported table may contain sensitive data, such as name, ID number, mobile phone number, email address and other personal information. In this case, the sensitive data needs to be processed and blurred through certain algorithms, making the data unable to be identified or restored. , thereby protecting data security and preventing data leakage;

This process is also called data desensitization. BatchTool integrates a simple data desensitization function. Through simple configuration, the specified field values ​​can be desensitized while exporting table data, and then output to a file. The desensitization algorithms supported by BatchTool include the following four:

  • Mask : Mask desensitization generally uses special characters (*, etc.) to replace the true value for string data, which is common in fields such as mobile phone numbers and ID numbers;
  • Encryption : (Symmetric) encryption is a special reversible desensitization method that uses a specified encryption algorithm and key to encrypt sensitive fields. Low-privilege users without a key will only see meaningless ciphertext data, and in some cases Under special scenario requirements, the original data can also be accessed through decryption using the key;
  • Hash summary : The method of calculating the summary value through hashing is often used for string data. For example, replacing the string of the user name "PolarDB-X" with "d7f19613a15dcf8a088c73e2c7e9b856" protects user privacy, and you can also specify salt to avoid hashing. The cracking of the Greek value;
  • Rounding : The rounding desensitization method maintains the security of the data while ensuring the approximate authenticity of the range. For example, the date field is rounded from the original value "2023-11-11 15:23:41" to "2023" -11-11 15:00:00" is output.

The corresponding command line parameter is -mask $ (desensitization algorithm configuration). Taking the cusomter table of the TPC-H data set as an example, the exported table data only displays the first three and last four digits of the mobile phone number c_phone (you can use the yaml configuration file here instead of the command line parameters):

operation: export
# 使用 | 作为字段分隔符,特殊字符需要使用引号括起
sep: "|"
table: customer
# 按照主键c_custkey进行排序
orderby: asc
orderCol: c_custkey
# 输出字段名
header: true
# 指定脱敏算法,只展示前三位与末四位
mask: >-
   {
     "c_phone": {
       "type": "hiding",
       "show_region": "0-2",
       "show_end": 4
     }
   }

raw data

c_custkey|c_name|c_address|c_nationkey|c_phone
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364
...

Data after desensitization

c_custkey|c_name|c_address|c_nationkey|c_phone
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-********2988
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-********3665
3|Customer#000000003|MG9kdTD2WBHm|1|11-********3364
...

TPC-H import

TPC-H is a benchmark test of the analytical query capabilities of a set of databases commonly used in the industry. The traditional TPC-H data set import is to use the tpck-kit tool set to first generate a text data set in csv format on the disk, and then write it into the database through data import methods such as load data; this method not only requires reserving the corresponding data on the press disk storage space (for example, for TPC-H 1T testing, at least 1 T of disk space needs to be reserved), and both the data generation and data import stages need to be parallelized by writing scripts, which is generally inefficient. .

BatchTool has a built-in component for generating TPC-H data sets. TPC-H data can be directly streamed into the database without generating text data in advance, greatly improving efficiency.

The corresponding command line parameter is -o import -benchmark tpch -scale $ (data set size). For example, for the import of a 100 GB TPC-H data set, it takes 10 minutes to generate a text file using the traditional method and 42 minutes to import load data, for a total of 52 minutes. However, using BatchTool to import online only takes 28 minutes and does not take up additional time. disk space to improve the efficiency of benchmark test preparation.

Summarize

In general, the database import and export tool-BatchTool has the following characteristics:

  • Lightweight, cross-platform
  • High performance (performing model optimization, distributed adaptation)
  • Supports a wealth of functions and is suitable for various scenarios

In addition, BatchTool is also open source on github, everyone is welcome to try it!

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

OpenAI opens ChatGPT to all users for free. Voice programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Spring Boot 3.2.0 was officially released. Google employees criticized the big boss after leaving the company. He was deeply involved in the Flutter project and formulated HTML-related standards. Microsoft Copilot Web AI will be Officially launched on December 1st, supporting Chinese Microsoft's open source Terminal Chat Rust Web framework Rocket releases v0.5: supports asynchronous, SSE, WebSockets, etc. The father of Redis implements the Telegram Bot framework using pure C language code . If you are an open source project maintainer, encounter How far can you endure this kind of response? PHP 8.3 GA
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10293537