Mysql+ETLCloud CDC+StarRocks real-time data warehouse synchronization practice

1. Business needs and pain points

Large enterprises need to perform real-time synchronized analysis of sales and marketing data in various business systems, such as inventory information, reconciliation signals, member information, advertising information, production progress information, etc. These statistical analysis information can be synchronized to StarRocks in real time. For analysis and statistics, StarRocks, as an analytical database, is particularly suitable for the storage and analysis of massive data. We only need to synchronize MySQL form data to StarRocks in real time to achieve real-time data analysis capabilities.

 2. Introduction to StarRocks

StarRocks is an extremely fast, full-scenario MPP enterprise-level database product with horizontal online expansion and contraction, financial-grade high availability, compatible with MySQL 5.7 protocol and MySQL ecosystem, and provides important features such as comprehensive vectorization engine and federated query of multiple data sources. StarRocks is committed to providing users with unified solutions for all-scenario OLAP business, which is suitable for various application scenarios that have high requirements for performance, real-time, concurrency and flexibility.

3. CDC real-time synchronization tool selection

Mature CDC tools that are currently available for free and support MySQL+StarRocks include Flink CDC and ETLCloud CDC, etc.

Here we mainly consider choosing the more mature Flink CDC and ETLCloud CDC. The synchronization principle of CDC is actually the same on different platforms. They all read the database log and then store it in the target warehouse after cleaning, conversion or calculation.

Flink CDC is relatively difficult to install and use. There is no visual CDC configuration and monitoring interface, which is relatively troublesome for unfamiliar users to install. For the processing and processing of real-time data, you need to write code. Users with no technical skills may not be able to handle it. For data It’s too difficult for engineers

ETLCloud CDC is relatively easy to install and use. It provides a one-click installation function and also supports Windows PC installation. After the installation is completed, it provides a full WEB configuration interface, which can be said to be very user-friendly. We choose ETLCloud CDC here to realize the construction of real-time data warehouse.

 4. How to improve the performance of writing StarRocks?

StarRocks is compatible with the MySQL protocol, but writing directly to StarRocks using jdbc is very slow and basically unusable. Therefore, the Stream load method provided by StarRocks must be used to load data to improve the speed.

ETLCloud CDC provides high-performance output components specifically for StarRocks, and also supports automatic creation of table structures and batch loading technology.

 5. How to directly convert data to a wide table before writing it to StarRocks?

Normally, when we use CDC to monitor the LOG of the sales or order table data in real time, streaming data will be formed. Each time the CDC incoming data may be one or multiple pieces, the streaming data monitored are all orders. A single piece of data in a table, but in terms of business value, the data in a single table may lack some key dimensional business data fields, such as merging customer and product data to calculate gross profit, etc.

In order to supplement these missing data fields, the previous approach was to put them into the database first, and then use SQL statements or ETL processes to transform them again to form the wide table data we need. Although this can also achieve this business requirement, the data is lost. The timeliness of processing means that the data that was originally a real-time stream is no longer real-time when it reaches the business, because there is a data process that changes regularly.

Through the ETL function of ETLCloud, real-time data can be easily converted into wide table data and stored in StarRocks .

 (Single table real-time streaming merges other dimension data and directly outputs wide table data to StarRocks)

6. ETLCloud CDC synchronization principle

The function of ETLCloud CDC is more powerful than other CDC tools because it links the CDC and ETL processes. CDC real-time data flows into the ETL process, and then the real-time data is processed and output through the ETL process.

 Configuring monitoring of MySQL tables in ETLCloud CDC

MySQL must first enable the bing log function. You can view the opening method.

RestCloud data integration platform

Let's go to ETLCloud's real-time data integration page and configure it.

 Add a mysql cdc listener

 Select the table to monitor mysql, here we choose to monitor the country table

 For the target, we choose the process that is passed to ETL. The ETL process will write the data into StarRocks.

 In offline integration, we create a new ETL process for StarRocks output.

 The offline ETL process is very simple. You only need to pull in a StarRocks output component.

 Select the StarRocks data source and StarRocks database table. The data source has been built in advance in ETLCloud.

 Import all fields from StarRocks table

In this way, CDC+ETL completes the creation of the real-time synchronization task of mysql=>StarRocks

Start the MySQL CDC listener

Enter the real-time data integration function of ETLCloud and click to start the CDC listener.

 If the startup is successful, it will be displayed in green. If there is an error, you can check the tomcat log to see what caused it.

 Indicates that the listener has been started successfully

Start synchronizing data in real time

The data in our country table in mysql is as follows

 We can modify several pieces of data at will, and we can see that the data will be synchronized to StarRocks immediately.

 We modified 3 pieces of data in real time in mysql, and we can see that 3 pieces of data have been updated immediately in StarRocks.

At the same time, we can also observe whether the ETL offline process is called by the CDC.

 

You can see that we modified 3 pieces of data, this ETL process was called once, and the data was written to StarRocks through this ETL process.

Automatic table creation in StarRocks

ETLCloud also has the function of automatically creating tables in StarRocks. If we want to synchronize all 1,000 MySQL tables to StarRocks at one time, we can use the batch synchronization function, which can automatically create and merge 1,000 MySQL tables in Doris at one time. All data is synchronized to StarRocks, which is also a function that Flink CDC does not have.

This way, all business data can be pulled into the StarRocks data warehouse at once.

 StarRocks automatic table creation capability

Guess you like

Origin blog.csdn.net/kezi/article/details/131812823
Recommended