A new generation of data integration tool ETLCloud entry practice: to achieve rapid migration from MySQL to ClickHouse

background

The most important thing about Big Data is data. Without data, there is nothing else to talk about (GPT is so popular today, it also benefits from the improvement of the collection, storage, calculation and management capabilities of massive data). The primary task of big data project development is to collect massive data, which requires us to have the ability to collect massive data.

In actual work, there are generally two sources of data, one from log files and the other from databases. There are many collection technologies for each data source. Generally, tools such as Flume, Logstash, and Filebeat are used to collect log file data, and tools such as Sqoop, Canal, and DataX are used to collect data in databases.

However, the above-mentioned data collection or integration tools are basically geared towards developers, and require users to have high technical capabilities. When using them, our developers generally face command lines, configuration files, interfaces, etc., and the efficiency is low in the process of achieving the goal. A little carelessness may cause migration failure or data service interruption due to a wrong configuration. Today, a multi-source heterogeneous data integration tool has been born, ETLCloud: a new generation (intelligent) global data integration platform, compatible with mainstream databases, data warehouses, data lakes and even message middleware products, fully localized adaptation, providing a visual automated processing process, users can create data processing tasks with only a few clicks, and easily realize data synchronization and data cleaning and transmission in multiple heterogeneous data sources.

In actual production, we have synchronized multiple associated tables of MySQL data to the ClickHouse OLAP database through the DataX tool, and finally realized the efficient query of multi-table associations; here, take the migration task of a poetry database from MySQL to ClickHouse as an example, and quickly experience the use process of ETLCloud Community Edition to achieve zero code, visualization, and efficient data migration.

Dataset Description

The library table poetry structure in the MySQL database is as follows, the data volume: 311828.

CREATE TABLE `poetry` (
	`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
	`title` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`yunlv_rule` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`author_id` INT(10) UNSIGNED NOT NULL,
	`content` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
	`dynasty` VARCHAR(10) NOT NULL COMMENT '诗所属朝代(S-宋代, T-唐代)' COLLATE 'utf8mb4_unicode_ci',
	`author` VARCHAR(150) NOT NULL COLLATE 'utf8mb4_unicode_ci',
	PRIMARY KEY (`id`) USING BTREE
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
AUTO_INCREMENT=311829;

Poetry.jpg

basic environment

The database service is deployed in a multi-cloud environment, involving a total of 3 cloud hosts. The operating systems and configurations are as follows:

  1. Operating system of the host (Aliyun) where MySQL is located
    : Ubuntu16
root@hostname:~# uname -a
Linux hostname 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

root@iZuf69c5h89bkzv0aqfm8lZ:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"

Basic configuration: 2C8G
Database version: 5.7.22-0ubuntu0.16.04.1

  1. Operating system of the host (HUAWEI CLOUD) where ClickHouse is located
    : CentOS 6
[root@ecs-xx-0003 ~]# uname -a
Linux ecs-xx-0003 2.6.32-754.15.3.el6.x86_64 #1 SMP Tue Jun 18 16:25:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@ecs-xx-0003 ~]# cat /proc/version 
Linux version 2.6.32-754.15.3.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-23) (GCC) ) #1 SMP Tue Jun 18 16:25:32 UTC 2019

Basic configuration: 4C8G
Database version: 19.9.5.36

[root@ecs-xx-0003 clickhouse-server]# clickhouse-server --version
ClickHouse server version 19.9.5.36.
  1. ETLCloud host (Tencent Cloud)
    operating system: CentOS 7
    Basic configuration: 2C2G

Note: The community version of ETLCloud is selected here, which is lightweight and quick to start using Docker deployment; as an entry-level experience, the host configuration of ETLCloud is relatively low, and it is recommended to upgrade the host configuration for actual production.

migration practice

Next, enter our migration practice: the whole process is zero-code, visualized, drag-and-drop, and the quick copy operation of poetry data from MySQL to ClickHouse can be completed with one click of the mouse.

Data source configuration

Before configuring our data source, let's take a look at the list of data sources currently supported by the ETLCloud Community Edition.
DataSource.jpg

  1. Configure Source: MySQL

I directly changed a MySQL on the data source provided on the left, and filled in the IP: port and user password information.
ConfigMySQL.jpg
The test connection is successful~
TestMySQL.jpg

  1. Configure Sink: ClickHouse

I directly changed a ClickHouse on the data source provided on the left, and filled in the IP: port and user password information.
ConfigClickHouse.jpg
The test connection is successful~
TestClickHouse.jpg

Create applications and processes

Create an application and fill in the basic application configuration information (note that the application ID is unique and immutable).
CreateApp.jpg
Then, create a data flow and fill in the information.
CreateFlow.jpg
After creating the process, you can click the "Process Design" button to enter the process visualization configuration page.
FlowOK.jpg

Visual configuration process

Before configuring the process, briefly introduce the various areas of this configuration page: the left side is the component area, the middle top is the functional area, and most of the middle is the process drawing area. Double-click the component in the drawing area, and you can see the component detailed configuration item area that pops up in a drawer style.

  1. Database table input: MySQL

In the input component on the left, select "Library Table Input", drag it to the central process drawing area, and double-click to enter the configuration stage.

Step 1: Select the MySQL data source we configured, and you can load the existing tables in MySQL.
Source1.jpg
Step 2: Generate SQL statements based on the selected table (in fact, multiple tables can be queried here to form a large wide table).
Source2.jpg
Step 3: You can read the definition of each field from the table, and support adding and deleting fields.
Source3.jpg
Step 4: Data preview is automatically performed according to the SQL statement. Such a check operation ensures the normal execution of subsequent operations.
Source4.jpg
Finally: In the third step, some conversion rules can be applied to different fields to realize data preprocessing and cleaning operations. ETLCloud has built-in many commonly used rules.
Source5.jpg

  1. Library table output: ClickHouse

In the output component on the left, select "Library Table Output", drag it to the central process drawing area, and double-click to enter the configuration stage.

Step 1: Select the ClickHouse data source we configured. Since there is no corresponding table in ClickHouse at the beginning, here we first manually enter the table name . In the third step, we can choose to automatically build the table (automatically map the table structure from MySQL to ClickHouse, praise~) At the beginning, I didn’t know whether the table could be automatically built. First, the table structure was synchronized through another library table synchronization component in ETLCloud .
Sink1.jpg
Step 2: You can read the definition of each field from the table, and support adding, deleting fields, and binding rules.
Sink2.jpg
Step 3: Configure the output options: whether to clear the table data, whether to automatically build the table, etc.
Sink3.jpg
Finally, by 流程线connecting the start , library table input , library table output , and end components respectively, the visual configuration of data migration is completed. Done~
Flow.jpg

problem record

  • An error is reported when the ClickHouse data source uses the connection pool method

Problem description: DB:: Exception: Table helloworld.dual doesn't exist. (version 19.9.5.36) )
Problem analysis: There is no dual virtual table in ClickHouse, and its virtual table is system.one
Solution: validationQuery=SELECT 1 FROM system.one

  • The primary key is not configured on the output side of ClickHouse, resulting in the failure of the data extraction process

Problem description: The primary key is not configured on the output side of ClickHouse, resulting in the failure of the data extraction process.
KeyError.jpg
Solution: library table output: the second step of ClickHouse, specify the primary key of the table.
KeyFix.jpg

Summarize

The above is the entry-level ETLCloud data migration practice, which quickly completes the data migration from MySQL to ClickHouse in a zero-code, visual, and drag-and-drop manner. During the entire practice process, in addition to efficiently completing the migration goal, there are several other points: fool-proof, visualization, detailed logs, dynamic process monitoring, and rich documentation. This powerful tool for heterogeneous data integration is worth a try.

Task monitoring board, global control.
Monitor.jpg
Dynamically monitor the progress of data synchronization, which is clear at a glance.
Progress.jpg

Reference


If you have any questions or any bugs are found, please feel free to contact me.

Your comments and suggestions are welcome!

Guess you like

Origin blog.csdn.net/u013810234/article/details/131150450