Mysql+ETLCloud CDC+Doris real-time data warehouse synchronization combat

Business needs and pain points

Many large enterprises need real-time synchronous analysis of various sales and marketing data, such as sales order information, inventory information, member information, equipment status information, etc. These statistical analysis information can be synchronized to Doris in real time for analysis and statistics. Doris serves as Analytical databases are especially suitable for the storage and analysis of massive data. We only need to synchronize the form data of MySQL to Doris in real time to realize real-time data analysis capabilities.

Introduction to Apache Doris

Apache Doris is a modern MPP analytical database product. Query results can be obtained with only sub-second response time, effectively supporting real-time data analysis. The distributed architecture of Apache Doris is very simple, easy to operate and maintain, and can support very large data sets of more than 10PB.

Apache Doris can meet a variety of data analysis needs, such as fixed historical reports, real-time data analysis, interactive data analysis and exploratory data analysis, etc. It can make data analysis work easier and more efficient!

MySQL CDC real-time synchronization tool selection

At present, the mature CDC tools that can be used for free and support MySQL+Doris include Flink CDC and ETLCloud CDC. Here we mainly consider the more popular Flink CDC and ETLCloud CDC. The principle of CDC synchronization is actually the same on different platforms. Both read the database log and store it in the target warehouse after cleaning, conversion or calculation.

Flink CDC is difficult to install and use. Without a visual CDC configuration and monitoring interface, it is relatively troublesome for unfamiliar users to install. For real-time data processing and processing, codes need to be written. Users without a little technology can’t handle it at all. Too difficult for engineers.

ETLCloud CDC is relatively easy to install and use. Basically, it can be installed in half an hour. After the installation is complete, a full WEB configuration interface is provided, which can be said to be very friendly to users. Here we choose ETLCloud CDC to realize the construction of real-time data warehouses.

How to improve the performance of writing to Doris?

Doris is compatible with the MySQL protocol, but the speed of directly writing to Doris with jdbc is very slow and basically unavailable, so the Stream load method provided by Doris must be used to load data to improve the speed.

ETLCloud CDC provides a high-performance output component specifically for Doris, and also supports automatic creation of table structures and batch loading technology.

How to directly convert the data into a wide table before writing it into Doris?

Usually, when we use CDC to monitor the log of table sales or order table data in real time, streaming data will be formed. The data that CDC transmits each time may be one piece or multiple pieces, and the streaming data monitored are all orders. However, in terms of business value, the data in a single table may lack some key dimension business data fields, such as merging customer and product data to calculate gross profit.

In order to supplement these missing data fields, the previous method is to store them in the database first, and then use SQL statements or ETL processes to transform them again to form the wide table data we need. Although this can also achieve this business requirement, the data is lost. The timeliness of processing, that is, the data that was originally streamed in real time but becomes not real-time when it reaches the business, because there is a data process that changes regularly among us.

Through the ETL function of ETLCloud, real-time data can be easily converted into wide table data and stored in Doris

(Single-table real-time stream merges other dimension data and directly outputs wide-table data to Doris)

ETLCloud CDC synchronization principle

The function of ETLCloud CDC is more powerful than other CDC tools because it links CDC and ETL processes, CDC real-time data flows into the ETL process, and then processes and outputs real-time data through the ETL process.

Configure monitoring of MySQL tables in ETLCloud CDC

MySQL must first enable the bing log function, and the opening method can be viewed

https://www.etlcloud.cn/restcloud/view/page/helpDocument.html?id=64701fca9b4f0515317fc8e2

After opening, we can enter the real-time data integration page of ETLCloud and configure it

Add a mysql cdc listener

Select the table to monitor mysql, here we choose to monitor the country table

The target we choose to pass to the ETL process, and the ETL process will write the data into Doris

In offline integration, we create a new ETL process for Doris output

The offline ETL process is very simple, just need to pull in a Doris output component

Select the Doris data source and the Doris database table, the data source has been built in advance in ETLCloud

Import all fields in the Doris table

In this way, CDC+ETL completes the creation of the real-time synchronization task of mysql=>doris

Start the MySQL CDC listener

Enter the real-time data integration function of ETLCloud and click to start the CDC listener

If the startup is successful, it will be displayed in green. If there is an error, you can check the tomcat log to see what caused it

Indicates that the listener has been started successfully

First clear the data in the existing table in Doris

In Doris, first clear the country table, so that we can observe that when there is a data change in mysql, it can be synchronized to Doris in real time

Start real-time data synchronization

The data in our country table on the right side of mysql is as follows

We can modify some of the data at will, and we can see that the data will be synchronized to Doris immediately

We modified 3 pieces of data in mysql in real time, and we can see that 3 pieces of data have been synchronized in doris immediately

At the same time, we can also observe whether the offline process of ETL is called by CDC

It can be seen that we modified 3 pieces of data, this ETL process was called 3 times, and the data was written into doris through this ETL process

Automatic table creation in Doris

ETLCloud also has the function of automatically creating tables in Doris. If we want to synchronize all 1000 tables of MySql to Doris at one time, we can use the batch synchronization function to automatically create and store 1000 tables of MySQL in Doris at one time. All data is synchronized to Doris, which is also a function that Flink CDC does not have.

In this way, all business data can be pulled into the Doris data warehouse at one time

Doris automatic table creation ability

Introduction to ETL Cloud

ETLCloud is a zero-code ETL tool that can quickly connect hundreds of data sources and application systems, and can quickly complete data synchronization and transmission without coding. Enterprise IT personnel can quickly complete various data extraction, synchronization and Cooperate with BI tools to realize statistical analysis of data.

(ETLCloud visual process synchronization interface)

The ETLCloud community version is permanently free to download and use http://www.etlcloud.cn/restcloud/view/page/index.html?id=0600024

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/131462270