Technical Analysis|Doris Connector combined with Flink CDC to achieve accurate access of MySQL sub-database and sub-table Exactly Once

1 Overview

In the actual business system, in order to solve various problems caused by the large amount of data in a single table, we usually split the database table by sub-database and sub-table to improve the throughput of the system.

However, this brings trouble to the subsequent data analysis. At this time, we usually try to synchronize the sub-database and sub-tables of the business database to the data warehouse, and merge the data of these sub-databases and sub-tables into one database and one table. It is convenient for our later data analysis

In this document, we will demonstrate how to realize real-time and efficient access to Apache Doris data warehouse for MySQL database sub-database and sub-table based on Flink CDC combined with Apache Doris Flink Connector and Doris Stream Load two-stage submission.

1.1 What is CDC

CDC is Change Data Capture 变更数据获取short for ( ).

The core idea is to monitor and capture database changes (including INSERT of data or data table, update UPDATE, delete DELETE, etc.), record these changes completely in the order of occurrence, and write them into message middleware for other services. Subscription and consumption.

The application scenarios of CDC technology are also very wide, including:

Data distribution , distributes a data source to multiple downstreams, often used for business decoupling and microservices.

Data integration , integrate scattered and heterogeneous data sources into the data warehouse, eliminate data islands, and facilitate subsequent analysis.

Data migration , commonly used for database backup, disaster recovery, etc.

1.2 Why choose Flink CDC

Flink CDC's Change Data Caputre technology based on database logs realizes full and incremental integrated reading capabilities. With the help of Flink's excellent pipeline capabilities and rich upstream and downstream ecosystems, it supports the capture of changes in various databases, and these changes Sync to downstream storage in real time.

Currently, the upstream of Flink CDC already supports MySQL, MariaDB, PG, Oracle, MongoDB, Oceanbase, TiDB, SQLServer and other databases.

The downstream of Flink CDC is more abundant. It supports writing to Kafka, Pulsar message queue, Hudi, Iceberg, Doris, etc., and supports writing to various data warehouses and data lakes.

At the same time, through the Changelog mechanism natively supported by Flink SQL, the processing of CDC data can be made very simple. Users can perform operations such as cleaning, widening, and aggregation of full and incremental database data through SQL, which greatly reduces the user threshold. In addition, the Flink DataStream API supports users to write code to implement custom logic, providing users with the freedom to deeply customize services

The core of Flink CDC technology is to support the synchronization and processing of the full data and incremental data in the table for real-time consistency, so that users can easily obtain real-time consistent snapshots of each table. For example, there is full historical business data in a table, and incremental business data is continuously written and updated. Flink CDC will capture incremental update records in real time and provide snapshots that are consistent with the database in real time. If it is an update record, it will update the existing data. If the record is inserted, it will be appended to the existing data. During the whole process, Flink CDC provides consistency guarantee, that is, it will not be repeated or lost.

FLink CDC has the following advantages:

  • Flink's operators and SQL modules are more mature and easy to use
  • Flink jobs can easily expand processing power by adjusting the parallelism of operators
  • Flink supports advanced state backends (State Backends), allowing access to massive state data
  • Flink provides more ecological support such as Source and Sink
  • Flink has a larger user base and an active support community, making it easier to solve problems

Moreover, the Flink Table / SQL module regards the database table and the change record stream (such as the data stream of CDC) as two sides of the same thing, so the Upsert message structure provided internally ( +Irepresenting the new value, representing -Uthe value before the record update, +Urepresenting the record update The last value, -Dindicating deletion) can correspond one-to-one with the change records generated by Debezium et al.

1.3 What is Apache Doris

Apache Doris is a modern MPP analytical database product. Query results can be obtained with sub-second response time, effectively supporting real-time data analysis. The distributed architecture of Apache Doris is very simple, easy to operate and maintain, and can support large data sets of more than 10PB.

Apache Doris can meet a variety of data analysis needs, such as fixed historical reports, real-time data analysis, interactive data analysis, and exploratory data analysis. Make your data analysis work easier and more efficient!

1.4 Two-phase commit

1.4.1 What is two-phase commit (2PC)

In a distributed system, in order to allow each node to perceive the transaction execution status of other nodes, a central node needs to be introduced to uniformly process the execution logic of all nodes. This central node is called a coordinator and is scheduled by the central node. The other business nodes are called participants.

2PC divides the distributed transaction into two phases, the two phases are commit request (voting) and commit (execution). The coordinator decides whether to actually execute the transaction according to the responses of the participants. The specific process is as follows.

Submit a request (vote) phase

  1. The coordinator sends a prepare request with the transaction content to all participants, asking if the transaction can be prepared for commit, and waits for the participant's response.
  2. Actors perform the operations contained in the transaction and log undo (for rollback) and redo (for replay), but do not actually commit.
  3. The participant returns the execution result of the transaction operation to the coordinator, and returns yes if the execution is successful, otherwise returns no.

commit (execute) phase

There are two types of success and failure.

  • If all participants return yes, the transaction can be committed:
  1. The coordinator sends commit requests to all participants.
  2. After the participant receives the commit request, the transaction is actually committed, the occupied transaction resources are released, and ack is returned to the coordinator.
  3. The coordinator receives ack messages from all participants and the transaction completes successfully.
  • If any participant returns no or the timeout fails, the transaction is interrupted and needs to be rolled back:
  1. The coordinator sends a rollback request to all participants.
  2. After the participant receives the rollback request, it rolls back to the state before the transaction execution according to the undo log, releases the occupied transaction resources, and returns ack to the coordinator.
  3. The coordinator receives ack messages from all participants, and the transaction rollback is completed.

1.4 Flink 2PC

As a stream processing engine, Flink naturally provides guarantees for exactly once semantics. The end-to-end exactly once semantics is the result of the synergy of input, processing logic, and output. Flink relies on the checkpoint mechanism and the lightweight distributed snapshot algorithm ABS to ensure exactly once. To achieve exactly-once output logic, one of the following two restrictions needs to be imposed: idempotent write, transactional write.

The process of the pre-commit stage

picture

Whenever a checkpoint needs to be done, the JobManager enters a barrier (barrier) in the data flow as the boundary of the checkpoint. The barrier is passed downstream along the operator chain, and every time an operator is reached, the action of writing a state snapshot to the state backend is triggered. When the barrier reaches the Kafka sink, the message data is flushed through the KafkaProducer.flush() method, but it has not yet been actually committed. Next, you still need to trigger the commit phase through checkpoints

Submission Stage Process

picture

The write will only succeed if all checkpoints complete successfully. This is in line with the 2PC process described above, where the JobManager is the coordinator and each operator is the participant (but only the sink participant will perform the submission). Once a checkpoint fails, the notifyCheckpointComplete() method will not execute. If the retry is unsuccessful, the abort() method will eventually be called to roll back the transaction

1.5 Doris Stream Load 2PC

1.5.1 Stream load

Stream load is a synchronous import method provided by Apache Doris. Users send requests to import local files or data streams into Doris by sending HTTP protocol. Stream load performs the import synchronously and returns the import result. The user can directly judge whether the import is successful through the return body of the request.

Stream load is mainly suitable for importing local files, or importing data in data streams through programs.

Using method, users can operate through Http Client, or use Curl command.

curl --location-trusted -u user:passwd [-H ""...] -T data.file -H "label:label" -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load

In order to prevent users from importing the same data repeatedly, the label of the import task is used here. It is strongly recommended that users use the same label for the same batch of data. In this way, repeated requests for the same batch of data will only be accepted once, ensuring At-Most-Once

1.5.2 Stream load 2PC

The earliest Stream Load of Aapche Doris does not have two-stage submission. When importing data, the data import is directly completed through the http interface of Stream Load, with only success and failure.

  1. This is not a problem under normal circumstances. In a distributed environment, it may be that the data at both ends is inconsistent because a certain import task fails. Especially in the Doris Flink Connector, the previous data import failure of the Doris Flink Connector requires The user controls it himself and handles exceptions. For example, if the import fails, save the data to a designated place (such as Kafka), and then process it manually.
  2. If the Flink job hangs suddenly due to other problems, some data will succeed and some data will fail, and because the failed data has no checkpoint, restarting the job will not be able to re-consume the failed data, resulting in inconsistent data at both ends.

In order to solve the above problems and ensure data consistency at both ends, we implemented Doris Stream Load 2PC, the principle is as follows:

  1. Commit in two phases
  2. In the first stage, the data writing task is submitted. At this time, after the data writing is successful, the data status is invisible, and the transaction status is PRECOMMITTED
  3. After the data is written successfully, the user triggers the Commit operation to change the transaction status to VISIBLE. At this time, the data can be queried.
  4. If the user wants to use this batch of data, he only needs to use the transaction ID to trigger the abort operation on the transaction, and this batch of data will be automatically deleted.

1.5.3 How to use Stream load 2PC

  1. Configure in be.conf disable_stream_load_2pc=false(restart to take effect)
  2. and declared in HEADER two_phase_commit=true.

To initiate a pre-commit:

curl  --location-trusted -u user:passwd -H "two_phase_commit:true" -T test.txt http://fe_host:http_port/api/{db}/{table}/_stream_load

Trigger transaction Commit operation

curl -X PUT --location-trusted -u user:passwd  -H "txn_id:18036" -H "txn_operation:commit"  http://fe_host:http_port/api/{db}/_stream_load_2pc

Trigger abort action on things

curl -X PUT --location-trusted -u user:passwd  -H "txn_id:18037" -H "txn_operation:abort"  http://fe_host:http_port/api/{db}/_stream_load_2pc

1.6 Doris Flink Connector 2PC

We previously provided Doris Flink Connector, which supports reading, Upsert, and delete (Unique key model) of Doris table data, but there is a problem that the data at both ends may be inconsistent due to Job failure or other abnormal conditions.

In order to solve these problems, we modified and upgraded the Doris Connector based on FLink 2PC and Doris Stream Load 2PC to ensure that both ends are exactly once.

  1. We will maintain the read and write buffers in memory. At startup, we will start writing and submit asynchronously. During this period, we will continue to write data to BE through http chunked, and stop writing until Checkpoint. The advantage of this is to avoid the overhead caused by frequent submission of http by users. After Checkpoint is completed, the next stage of writing will be started.
  2. During this Checkpoint period, multiple tasks may be writing data of a table at the same time. We will all correspond to a global label during this Checkpoint period, and unify the transactions that write data corresponding to this label during checkpoint. One commit, making the data state visible,
  3. If it fails, Flink will replay the data through checkpoint when it restarts.
  4. This ensures the consistency of data at both ends of Doris

2. System Architecture

Let's take a complete example to see how to integrate Flink CDC through the latest version of Doris Flink Connector (supporting two-stage submission) to realize real-time collection and storage of MySQL sub-database and sub-tables

picture

  1. Here we use Flink CDC to complete the data collection of MySQL sub-database and sub-table
  2. Then complete the data storage through the Doris Flink Connector
  3. Finally, use Doris's high-concurrency, high-performance OLAP analysis and computing capabilities to provide external data services

3. MySQL installation configuration

3.1 Install MySQL

Quickly use Docker to install and configure Mysql, refer to the following connection for details

https://segmentfault.com/a/1190000021523570

3.2 Open Mysql binlog

Enter the Docker container to modify the /etc/my.cnf file, and add the following under [mysqld],

log_bin=mysql_bin
binlog-format=Row
server-id=1

Then restart Mysql

systemctl restart mysqld

3.3 Prepare data

Here we demonstrate that we have prepared two databases, emp_1 and emp_2, and two tables, employees_1 and employees_2, are active and standby under each database. And give the initialization data

CREATE DATABASE emp_1;
 USE emp_1;
CREATE TABLE employees_1 (
    emp_no      INT             NOT NULL,
    birth_date  DATE            NOT NULL,
    first_name  VARCHAR(14)     NOT NULL,
    last_name   VARCHAR(16)     NOT NULL,
    gender      ENUM ('M','F')  NOT NULL,    
    hire_date   DATE            NOT NULL,
    PRIMARY KEY (emp_no)
);
​
INSERT INTO `employees_1` VALUES (10001,'1953-09-02','Georgi','Facello','M','1986-06-26'),
(10002,'1964-06-02','Bezalel','Simmel','F','1985-11-21'),
(10003,'1959-12-03','Parto','Bamford','M','1986-08-28'),
(10004,'1954-05-01','Chirstian','Koblick','M','1986-12-01'),
(10005,'1955-01-21','Kyoichi','Maliniak','M','1989-09-12'),
(10006,'1953-04-20','Anneke','Preusig','F','1989-06-02'),
(10007,'1957-05-23','Tzvetan','Zielinski','F','1989-02-10'),
(10008,'1958-02-19','Saniya','Kalloufi','M','1994-09-15'),
(10009,'1952-04-19','Sumant','Peac','F','1985-02-18'),
(10010,'1963-06-01','Duangkaew','Piveteau','F','1989-08-24'),
(10011,'1953-11-07','Mary','Sluis','F','1990-01-22'),
(10012,'1960-10-04','Patricio','Bridgland','M','1992-12-18'),
(10013,'1963-06-07','Eberhardt','Terkki','M','1985-10-20'),
(10014,'1956-02-12','Berni','Genin','M','1987-03-11'),
(10015,'1959-08-19','Guoxiang','Nooteboom','M','1987-07-02'),
(10016,'1961-05-02','Kazuhito','Cappelletti','M','1995-01-27'),
(10017,'1958-07-06','Cristinel','Bouloucos','F','1993-08-03'),
(10018,'1954-06-19','Kazuhide','Peha','F','1987-04-03'),
(10019,'1953-01-23','Lillian','Haddadi','M','1999-04-30'),
(10020,'1952-12-24','Mayuko','Warwick','M','1991-01-26'),
(10021,'1960-02-20','Ramzi','Erde','M','1988-02-10'),
(10022,'1952-07-08','Shahaf','Famili','M','1995-08-22'),
(10023,'1953-09-29','Bojan','Montemayor','F','1989-12-17'),
(10024,'1958-09-05','Suzette','Pettey','F','1997-05-19'),
(10025,'1958-10-31','Prasadram','Heyers','M','1987-08-17'),
(10026,'1953-04-03','Yongqiao','Berztiss','M','1995-03-20'),
(10027,'1962-07-10','Divier','Reistad','F','1989-07-07'),
(10028,'1963-11-26','Domenick','Tempesti','M','1991-10-22'),
(10029,'1956-12-13','Otmar','Herbst','M','1985-11-20'),
(10030,'1958-07-14','Elvis','Demeyer','M','1994-02-17'),
(10031,'1959-01-27','Karsten','Joslin','M','1991-09-01'),
(10032,'1960-08-09','Jeong','Reistad','F','1990-06-20'),
(10033,'1956-11-14','Arif','Merlo','M','1987-03-18'),
(10034,'1962-12-29','Bader','Swan','M','1988-09-21'),
(10035,'1953-02-08','Alain','Chappelet','M','1988-09-05'),
(10036,'1959-08-10','Adamantios','Portugali','M','1992-01-03');
​
CREATE TABLE employees_2 (
    emp_no      INT             NOT NULL,
    birth_date  DATE            NOT NULL,
    first_name  VARCHAR(14)     NOT NULL,
    last_name   VARCHAR(16)     NOT NULL,
    gender      ENUM ('M','F')  NOT NULL,    
    hire_date   DATE            NOT NULL,
    PRIMARY KEY (emp_no)
);
​
INSERT INTO `employees_2` VALUES (10037,'1963-07-22','Pradeep','Makrucki','M','1990-12-05'),
(10038,'1960-07-20','Huan','Lortz','M','1989-09-20'),
(10039,'1959-10-01','Alejandro','Brender','M','1988-01-19'),
(10040,'1959-09-13','Weiyi','Meriste','F','1993-02-14'),
(10041,'1959-08-27','Uri','Lenart','F','1989-11-12'),
(10042,'1956-02-26','Magy','Stamatiou','F','1993-03-21'),
(10043,'1960-09-19','Yishay','Tzvieli','M','1990-10-20'),
(10044,'1961-09-21','Mingsen','Casley','F','1994-05-21'),
(10045,'1957-08-14','Moss','Shanbhogue','M','1989-09-02'),
(10046,'1960-07-23','Lucien','Rosenbaum','M','1992-06-20'),
(10047,'1952-06-29','Zvonko','Nyanchama','M','1989-03-31'),
(10048,'1963-07-11','Florian','Syrotiuk','M','1985-02-24'),
(10049,'1961-04-24','Basil','Tramer','F','1992-05-04'),
(10050,'1958-05-21','Yinghua','Dredge','M','1990-12-25'),
(10051,'1953-07-28','Hidefumi','Caine','M','1992-10-15'),
(10052,'1961-02-26','Heping','Nitsch','M','1988-05-21'),
(10053,'1954-09-13','Sanjiv','Zschoche','F','1986-02-04'),
(10054,'1957-04-04','Mayumi','Schueller','M','1995-03-13');
​
​
CREATE DATABASE emp_2;
​
USE emp_2;
​
CREATE TABLE employees_1 (
    emp_no      INT             NOT NULL,
    birth_date  DATE            NOT NULL,
    first_name  VARCHAR(14)     NOT NULL,
    last_name   VARCHAR(16)     NOT NULL,
    gender      ENUM ('M','F')  NOT NULL,    
    hire_date   DATE            NOT NULL,
    PRIMARY KEY (emp_no)
);
​
​
INSERT INTO `employees_1` VALUES  (10055,'1956-06-06','Georgy','Dredge','M','1992-04-27'),
(10056,'1961-09-01','Brendon','Bernini','F','1990-02-01'),
(10057,'1954-05-30','Ebbe','Callaway','F','1992-01-15'),
(10058,'1954-10-01','Berhard','McFarlin','M','1987-04-13'),
(10059,'1953-09-19','Alejandro','McAlpine','F','1991-06-26'),
(10060,'1961-10-15','Breannda','Billingsley','M','1987-11-02'),
(10061,'1962-10-19','Tse','Herber','M','1985-09-17'),
(10062,'1961-11-02','Anoosh','Peyn','M','1991-08-30'),
(10063,'1952-08-06','Gino','Leonhardt','F','1989-04-08'),
(10064,'1959-04-07','Udi','Jansch','M','1985-11-20'),
(10065,'1963-04-14','Satosi','Awdeh','M','1988-05-18'),
(10066,'1952-11-13','Kwee','Schusler','M','1986-02-26'),
(10067,'1953-01-07','Claudi','Stavenow','M','1987-03-04'),
(10068,'1962-11-26','Charlene','Brattka','M','1987-08-07'),
(10069,'1960-09-06','Margareta','Bierman','F','1989-11-05'),
(10070,'1955-08-20','Reuven','Garigliano','M','1985-10-14'),
(10071,'1958-01-21','Hisao','Lipner','M','1987-10-01'),
(10072,'1952-05-15','Hironoby','Sidou','F','1988-07-21'),
(10073,'1954-02-23','Shir','McClurg','M','1991-12-01'),
(10074,'1955-08-28','Mokhtar','Bernatsky','F','1990-08-13'),
(10075,'1960-03-09','Gao','Dolinsky','F','1987-03-19'),
(10076,'1952-06-13','Erez','Ritzmann','F','1985-07-09'),
(10077,'1964-04-18','Mona','Azuma','M','1990-03-02'),
(10078,'1959-12-25','Danel','Mondadori','F','1987-05-26'),
(10079,'1961-10-05','Kshitij','Gils','F','1986-03-27'),
(10080,'1957-12-03','Premal','Baek','M','1985-11-19'),
(10081,'1960-12-17','Zhongwei','Rosen','M','1986-10-30'),
(10082,'1963-09-09','Parviz','Lortz','M','1990-01-03'),
(10083,'1959-07-23','Vishv','Zockler','M','1987-03-31'),
(10084,'1960-05-25','Tuval','Kalloufi','M','1995-12-15');
​
​
CREATE TABLE employees_2(
    emp_no      INT             NOT NULL,
    birth_date  DATE            NOT NULL,
    first_name  VARCHAR(14)     NOT NULL,
    last_name   VARCHAR(16)     NOT NULL,
    gender      ENUM ('M','F')  NOT NULL,    
    hire_date   DATE            NOT NULL,
    PRIMARY KEY (emp_no)
);
​
INSERT INTO `employees_2` VALUES (10085,'1962-11-07','Kenroku','Malabarba','M','1994-04-09'),
(10086,'1962-11-19','Somnath','Foote','M','1990-02-16'),
(10087,'1959-07-23','Xinglin','Eugenio','F','1986-09-08'),
(10088,'1954-02-25','Jungsoon','Syrzycki','F','1988-09-02'),
(10089,'1963-03-21','Sudharsan','Flasterstein','F','1986-08-12'),
(10090,'1961-05-30','Kendra','Hofting','M','1986-03-14'),
(10091,'1955-10-04','Amabile','Gomatam','M','1992-11-18'),
(10092,'1964-10-18','Valdiodio','Niizuma','F','1989-09-22'),
(10093,'1964-06-11','Sailaja','Desikan','M','1996-11-05'),
(10094,'1957-05-25','Arumugam','Ossenbruggen','F','1987-04-18'),
(10095,'1965-01-03','Hilari','Morton','M','1986-07-15'),
(10096,'1954-09-16','Jayson','Mandell','M','1990-01-14'),
(10097,'1952-02-27','Remzi','Waschkowski','M','1990-09-15'),
(10098,'1961-09-23','Sreekrishna','Servieres','F','1985-05-13'),
(10099,'1956-05-25','Valter','Sullins','F','1988-10-18'),
(10100,'1953-04-21','Hironobu','Haraldson','F','1987-09-21'),
(10101,'1952-04-15','Perla','Heyers','F','1992-12-28'),
(10102,'1959-11-04','Paraskevi','Luby','F','1994-01-26'),
(10103,'1953-11-26','Akemi','Birch','M','1986-12-02'),
(10104,'1961-11-19','Xinyu','Warwick','M','1987-04-16'),
(10105,'1962-02-05','Hironoby','Piveteau','M','1999-03-23'),
(10106,'1952-08-29','Eben','Aingworth','M','1990-12-19'),
(10107,'1956-06-13','Dung','Baca','F','1994-03-22'),
(10108,'1952-04-07','Lunjin','Giveon','M','1986-10-02'),
(10109,'1958-11-25','Mariusz','Prampolini','F','1993-06-16'),
(10110,'1957-03-07','Xuejia','Ullian','F','1986-08-22'),
(10111,'1963-08-29','Hugo','Rosis','F','1988-06-19'),
(10112,'1963-08-13','Yuichiro','Swick','F','1985-10-08'),
(10113,'1963-11-13','Jaewon','Syrzycki','M','1989-12-24'),
(10114,'1957-02-16','Munir','Demeyer','F','1992-07-17'),
(10115,'1964-12-25','Chikara','Rissland','M','1986-01-23'),
(10116,'1955-08-26','Dayanand','Czap','F','1985-05-28');

4. Doris installation configuration

Here we take the stand-alone version as an example

First download the Doris 1.1 release version:

https://doris.apache.org/downloads/downloads.html

Unzip to the specified directory

tar zxvf apache-doris-1.1.0-bin.tar.gz -C doris-1.1

The unzipped directory structure is as follows:

.
├── apache_hdfs_broker
│   ├── bin
│   ├── conf
│   └── lib
├── be
│   ├── bin
│   ├── conf
│   ├── lib
│   ├── log
│   ├── minidump
│   ├── storage
│   └── www
├── derby.log
├── fe
│   ├── bin
│   ├── conf
│   ├── doris-meta
│   ├── lib
│   ├── log
│   ├── plugins
│   ├── spark-dpp
│   ├── temp_dir
│   └── webroot
└── udf
    ├── include
    └── lib

Configure fe and be

cd doris-1.0
# 配置 fe.conf 和 be.conf,这两个文件分别在fe和be的conf目录下
打开这个 priority_networks
修改成自己的IP地址,注意这里是CIDR方式配置IP地址
例如我本地的IP是172.19.0.12,我的配置如下:
priority_networks = 172.19.0.0/24
​
######
在be.conf配置文件最后加上下面这个配置
disable_stream_load_2pc=false

  1. Note that by default, you only need to modify fe.confand be.confthe same configuration as above.
  2. The default fe metadata directory is under the fe/doris-metadirectory
  3. The data of be is stored in the be/storagedirectory

Start FE

sh fe/bin/start_fe.sh --daemon

start BE

sh be/bin/start_be.sh --daemon

The MySQL command line connects to FE. The default users of the newly installed Doris cluster here are root and admin, and the password is empty.

mysql -uroot -P9030 -h127.0.0.1
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 41
Server version: 5.7.37 Doris version trunk-440ad03
​
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
​
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
​
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
​
mysql> show frontends;
+--------------------------------+-------------+-------------+----------+-----------+---------+----------+----------+------------+------+-------+-------------------+---------------------+----------+--------+---------------+------------------+
| Name                           | IP          | EditLogPort | HttpPort | QueryPort | RpcPort | Role     | IsMaster | ClusterId  | Join | Alive | ReplayedJournalId | LastHeartbeat       | IsHelper | ErrMsg | Version       | CurrentConnected |
+--------------------------------+-------------+-------------+----------+-----------+---------+----------+----------+------------+------+-------+-------------------+---------------------+----------+--------+---------------+------------------+
| 172.19.0.12_9010_1654681464955 | 172.19.0.12 | 9010        | 8030     | 9030      | 9020    | FOLLOWER | true     | 1690644599 | true | true  | 381106            | 2022-06-22 18:13:34 | true     |        | trunk-440ad03 | Yes              |
+--------------------------------+-------------+-------------+----------+-----------+---------+----------+----------+------------+------+-------+-------------------+---------------------+----------+--------+---------------+------------------+
1 row in set (0.01 sec)
​

Add BE nodes to the cluster

mysql>alter system add backend "172.19.0.12:9050";

Here is your own IP address

View BE

mysql> show backends;
+-----------+-----------------+-------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------------------------+--------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| BackendId | Cluster         | IP          | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime       | LastHeartbeat       | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | Tag                      | ErrMsg | Version       | Status                                                                                                                        |
+-----------+-----------------+-------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------------------------+--------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
| 10002     | default_cluster | 172.19.0.12 | 9050          | 9060   | 8040     | 8060     | 2022-06-22 12:51:58 | 2022-06-22 18:15:34 | true  | false                | false                 | 4369      | 328.686 MB       | 144.083 GB    | 196.735 GB    | 26.76 % | 26.76 %        | {"location" : "default"} |        | trunk-440ad03 | {"lastSuccessReportTabletsTime":"2022-06-22 18:15:05","lastStreamLoadTime":-1,"isQueryDisabled":false,"isLoadDisabled":false} |
+-----------+-----------------+-------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------------------------+--------+---------------+-------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Doris stand-alone installation is complete

5. Flink installation configuration

5.1 Download and install Flink1.14.4

wget https://dlcdn.apache.org/flink/flink-1.14.4/flink-1.14.5-bin-scala_2.12.tgz
tar zxvf flink-1.14.4-bin-scala_2.12.tgz

Then you need to copy the following dependencies to the lib directory in the Flink installation directory. The specific dependent lib files are as follows:

wget https://jiafeng-1308700295.cos.ap-hongkong.myqcloud.com/flink-doris-connector-1.14_2.12-1.0.0-SNAPSHOT.jar
wget https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.2.1/flink-sql-connector-mysql-cdc-2.2.1.jar

Start Flink

bin/start-cluster.sh

The interface after startup is as follows:

picture

6. Start syncing data to Doris

6.1 Create Doris database and table

create database demo;
use demo;
CREATE TABLE all_employees_info (
    emp_no       int NOT NULL,
    birth_date   date,
    first_name   varchar(20),
    last_name    varchar(20),
    gender       char(2),
    hire_date    date,
    database_name varchar(50),
    table_name    varchar(200)
)
UNIQUE KEY(`emp_no`, `birth_date`)
DISTRIBUTED BY HASH(`birth_date`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

6.2 Enter Flink SQL Client

 bin/sql-client.sh embedded 

picture

Turn on checkpoint and do a checkpoint every 10 seconds

Checkpoint is not enabled by default, we need to enable Checkpoint to submit transactions.

Source scans the entire table at startup and divides the table into multiple chunks according to the primary key. And use the incremental snapshot algorithm to read the data of each chunk one by one. The job will periodically execute Checkpoint to record the completed chunks. When a Failover occurs, just keep reading the unfinished chunks. When all chunks are read, incremental change records will be read from the previously obtained Binlog site. The Flink job will continue to periodically execute Checkpoint and record the Binlog site. When the job fails over, it will continue processing from the previously recorded Binlog site, thus realizing Exactly Once semantics.

SET execution.checkpointing.interval = 10s;

Note: This is a demo, the production environment recommends checkpoint interval of 60 seconds

6.3 Create MySQL CDC table

Execute the following SQL under Flink SQL Client

CREATE TABLE employees_source (
    database_name STRING METADATA VIRTUAL,
    table_name STRING METADATA VIRTUAL,
    emp_no int NOT NULL,
    birth_date date,
    first_name STRING,
    last_name STRING,
    gender STRING,
    hire_date date,
    PRIMARY KEY (`emp_no`) NOT ENFORCED
  ) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'localhost',
    'port' = '3306',
    'username' = 'root',
    'password' = 'MyNewPass4!',
    'database-name' = 'emp_[0-9]+',
    'table-name' = 'employees_[0-9]+'
  );

  1. 'database-name' = 'emp_[0-9]+': Here is the use of regular expressions to connect multiple libraries at the same time
  2. 'table-name' = 'employees_[0-9]+': Here is the use of regular expressions to connect multiple tables at the same time

Querying the CDC table, we can see the following data, indicating that everything is normal

select * from employees_source limit 10;

picture

6.4 Create Doris Sink Table

CREATE TABLE cdc_doris_sink (
    emp_no       int ,
    birth_date   STRING,
    first_name   STRING,
    last_name    STRING,
    gender       STRING,
    hire_date    STRING,
    database_name STRING,
    table_name    STRING
) 
WITH (
  'connector' = 'doris',
  'fenodes' = '172.19.0.12:8030',
  'table.identifier' = 'demo.all_employees_info',
  'username' = 'root',
  'password' = '',
  'sink.properties.two_phase_commit'='true',
  'sink.label-prefix'='doris_demo_emp_001'
);

Parameter Description:

  1. connector : Specifies that the connector is doris
  2. fenodes: Doris FE node IP address and http port
  3. table.identifier : The database and table name corresponding to Doris
  4. username: doris username
  5. password: doris user password
  6. sink.properties.two_phase_commit: Specify the use of two-phase commit, so when the stream is loaded, it will be added to the http header two_phase_commit:true, otherwise it will fail
  7. sink.label-prefix : This is a parameter that must be added during two-phase submission to ensure data consistency at both ends, otherwise it will fail
  8. For other parameters, refer to the official documentation https://doris.apache.org/zh-CN/docs/ecosystem/flink-doris-connector.html

At this time, there is no data to query the Doris sink table

select * from cdc_doris_sink;

picture

6.5 Insert data into Doris table

Execute the following SQL:

insert into cdc_doris_sink (emp_no,birth_date,first_name,last_name,gender,hire_date,database_name,table_name) 
select emp_no,cast(birth_date as string) as birth_date ,first_name,last_name,gender,cast(hire_date as string) as hire_date ,database_name,table_name from employees_source;

Then we can see the task running information on the Flink WEB UI

picture

Here we can look at the log information of the TaskManager, and we will find that the two-phase submission is used here, and the data is continuously transmitted to the BE side through the http chunked method. After the Checkpoint is completed, the submission of the next task will continue.

2022-06-22 19:04:01,350 INFO  io.debezium.relational.history.DatabaseHistoryMetrics        [] - Started database history recovery
2022-06-22 19:04:01,350 INFO  io.debezium.relational.history.DatabaseHistoryMetrics        [] - Finished database history recovery of 0 change(s) in 0 ms
2022-06-22 19:04:01,351 INFO  io.debezium.util.Threads                                     [] - Requested thread factory for connector MySqlConnector, id = mysql_binlog_source named = binlog-client
2022-06-22 19:04:01,352 INFO  io.debezium.connector.mysql.MySqlStreamingChangeEventSource  [] - Skip 0 events on streaming start
2022-06-22 19:04:01,352 INFO  io.debezium.connector.mysql.MySqlStreamingChangeEventSource  [] - Skip 0 rows on streaming start
2022-06-22 19:04:01,352 INFO  io.debezium.util.Threads                                     [] - Creating thread debezium-mysqlconnector-mysql_binlog_source-binlog-client
2022-06-22 19:04:01,374 INFO  io.debezium.util.Threads                                     [] - Creating thread debezium-mysqlconnector-mysql_binlog_source-binlog-client
2022-06-22 19:04:01,381 INFO  io.debezium.connector.mysql.MySqlStreamingChangeEventSource  [] - Connected to MySQL binlog at localhost:3306, starting at MySqlOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.mysql.Source:STRUCT}, sourceInfo=SourceInfo [currentGtid=null, currentBinlogFilename=mysql_bin.000005, currentBinlogPosition=211725, currentRowNumber=0, serverId=0, sourceTime=null, threadId=-1, currentQuery=null, tableIds=[], databaseName=null], partition={server=mysql_binlog_source}, snapshotCompleted=false, transactionContext=TransactionContext [currentTransactionId=null, perTableEventCount={}, totalEventCount=0], restartGtidSet=null, currentGtidSet=null, restartBinlogFilename=mysql_bin.000005, restartBinlogPosition=211725, restartRowsToSkip=0, restartEventsToSkip=0, currentEventLengthInBytes=0, inTransaction=false, transactionId=null]
2022-06-22 19:04:01,381 INFO  io.debezium.util.Threads                                     [] - Creating thread debezium-mysqlconnector-mysql_binlog_source-binlog-client
2022-06-22 19:04:01,381 INFO  io.debezium.connector.mysql.MySqlStreamingChangeEventSource  [] - Waiting for keepalive thread to start
2022-06-22 19:04:01,497 INFO  io.debezium.connector.mysql.MySqlStreamingChangeEventSource  [] - Keepalive thread is running
2022-06-22 19:04:08,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:08,321 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6963,
    "Label": "doris_demo_001_0_1",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 634,
    "NumberLoadedRows": 634,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 35721,
    "LoadTimeMs": 9046,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9041,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:08,321 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:08,321 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_2
2022-06-22 19:04:08,321 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:08,325 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 1
2022-06-22 19:04:08,329 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6963] commit successfully."
}
2022-06-22 19:04:18,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:18,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6964,
    "Label": "doris_demo_001_0_2",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9988,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9983,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:18,310 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:18,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_3
2022-06-22 19:04:18,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:18,312 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 2
2022-06-22 19:04:18,317 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6964] commit successfully."
}
2022-06-22 19:04:28,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:28,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6965,
    "Label": "doris_demo_001_0_3",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9998,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9993,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:28,308 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:28,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_4
2022-06-22 19:04:28,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:28,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 3
2022-06-22 19:04:28,316 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6965] commit successfully."
}
2022-06-22 19:04:38,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6966,
    "Label": "doris_demo_001_0_4",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9999,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9994,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:38,308 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_5
2022-06-22 19:04:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:38,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 4
2022-06-22 19:04:38,317 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6966] commit successfully."
}
2022-06-22 19:04:48,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:48,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6967,
    "Label": "doris_demo_001_0_5",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 10000,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9996,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:48,310 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:48,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_6
2022-06-22 19:04:48,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:48,312 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 5
2022-06-22 19:04:48,317 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6967] commit successfully."
}
2022-06-22 19:04:58,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:04:58,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6968,
    "Label": "doris_demo_001_0_6",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9998,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9993,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:04:58,308 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:04:58,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_7
2022-06-22 19:04:58,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:04:58,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 6
2022-06-22 19:04:58,316 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6968] commit successfully."
}
2022-06-22 19:05:08,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:05:08,309 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6969,
    "Label": "doris_demo_001_0_7",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9999,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9995,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:05:08,309 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:05:08,309 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_8
2022-06-22 19:05:08,309 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:05:08,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 7
2022-06-22 19:05:08,316 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6969] commit successfully."
}
2022-06-22 19:05:18,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:05:18,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6970,
    "Label": "doris_demo_001_0_8",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9999,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9993,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:05:18,308 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:05:18,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_9
2022-06-22 19:05:18,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:05:18,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 8
2022-06-22 19:05:18,317 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6970] commit successfully."
}
2022-06-22 19:05:28,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:05:28,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6971,
    "Label": "doris_demo_001_0_9",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 10000,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9996,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:05:28,310 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:05:28,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_10
2022-06-22 19:05:28,310 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:05:28,315 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 9
2022-06-22 19:05:28,320 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6971] commit successfully."
}
2022-06-22 19:05:38,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:05:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6972,
    "Label": "doris_demo_001_0_10",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 9998,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9992,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:05:38,308 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:05:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_11
2022-06-22 19:05:38,308 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:05:38,311 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 10
2022-06-22 19:05:38,316 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6972] commit successfully."
}
2022-06-22 19:05:48,303 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load stopped.
2022-06-22 19:05:48,315 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - load Result {
    "TxnId": 6973,
    "Label": "doris_demo_001_0_11",
    "TwoPhaseCommit": "true",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 520,
    "NumberLoadedRows": 520,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 29293,
    "LoadTimeMs": 10005,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 10001,
    "CommitAndPublishTimeMs": 0
}
​
2022-06-22 19:05:48,315 INFO  org.apache.doris.flink.sink.writer.RecordBuffer              [] - start buffer data, read queue size 0, write queue size 3
2022-06-22 19:05:48,315 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - stream load started for doris_demo_001_0_12
2022-06-22 19:05:48,315 INFO  org.apache.doris.flink.sink.writer.DorisStreamLoad           [] - start execute load
2022-06-22 19:05:48,322 INFO  org.apache.flink.streaming.runtime.operators.sink.AbstractStreamingCommitterHandler [] - Committing the state for checkpoint 11
2022-06-22 19:05:48,327 INFO  org.apache.doris.flink.sink.committer.DorisCommitter         [] - load result {
    "status": "Success",
    "msg": "transaction [6973] commit successfully."
}

6.6 Querying Doris data

Here I have inserted 636 pieces of data,

mysql> select count(1) from  all_employees_info ;
+----------+
| count(1) |
+----------+
|      634 |
+----------+
1 row in set (0.01 sec)
​
mysql> select * from  all_employees_info limit 20;
+--------+------------+------------+-------------+--------+------------+---------------+-------------+
| emp_no | birth_date | first_name | last_name   | gender | hire_date  | database_name | table_name  |
+--------+------------+------------+-------------+--------+------------+---------------+-------------+
|  10001 | 1953-09-02 | Georgi     | Facello     | M      | 1986-06-26 | emp_1         | employees_1 |
|  10002 | 1964-06-02 | Bezalel    | Simmel      | F      | 1985-11-21 | emp_1         | employees_1 |
|  10003 | 1959-12-03 | Parto      | Bamford     | M      | 1986-08-28 | emp_1         | employees_1 |
|  10004 | 1954-05-01 | Chirstian  | Koblick     | M      | 1986-12-01 | emp_1         | employees_1 |
|  10005 | 1955-01-21 | Kyoichi    | Maliniak    | M      | 1989-09-12 | emp_1         | employees_1 |
|  10006 | 1953-04-20 | Anneke     | Preusig     | F      | 1989-06-02 | emp_1         | employees_1 |
|  10007 | 1957-05-23 | Tzvetan    | Zielinski   | F      | 1989-02-10 | emp_1         | employees_1 |
|  10008 | 1958-02-19 | Saniya     | Kalloufi    | M      | 1994-09-15 | emp_1         | employees_1 |
|  10009 | 1952-04-19 | Sumant     | Peac        | F      | 1985-02-18 | emp_1         | employees_1 |
|  10010 | 1963-06-01 | Duangkaew  | Piveteau    | F      | 1989-08-24 | emp_1         | employees_1 |
|  10011 | 1953-11-07 | Mary       | Sluis       | F      | 1990-01-22 | emp_1         | employees_1 |
|  10012 | 1960-10-04 | Patricio   | Bridgland   | M      | 1992-12-18 | emp_1         | employees_1 |
|  10013 | 1963-06-07 | Eberhardt  | Terkki      | M      | 1985-10-20 | emp_1         | employees_1 |
|  10014 | 1956-02-12 | Berni      | Genin       | M      | 1987-03-11 | emp_1         | employees_1 |
|  10015 | 1959-08-19 | Guoxiang   | Nooteboom   | M      | 1987-07-02 | emp_1         | employees_1 |
|  10016 | 1961-05-02 | Kazuhito   | Cappelletti | M      | 1995-01-27 | emp_1         | employees_1 |
|  10017 | 1958-07-06 | Cristinel  | Bouloucos   | F      | 1993-08-03 | emp_1         | employees_1 |
|  10018 | 1954-06-19 | Kazuhide   | Peha        | F      | 1987-04-03 | emp_1         | employees_1 |
|  10019 | 1953-01-23 | Lillian    | Haddadi     | M      | 1999-04-30 | emp_1         | employees_1 |
|  10020 | 1952-12-24 | Mayuko     | Warwick     | M      | 1991-01-26 | emp_1         | employees_1 |
+--------+------------+------------+-------------+--------+------------+---------------+-------------+
20 rows in set (0.00 sec)

6.7 Test deletion

mysql> use emp_2;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
​
Database changed
mysql> show tables;
+-----------------+
| Tables_in_emp_2 |
+-----------------+
| employees_1     |
| employees_2     |
+-----------------+
2 rows in set (0.00 sec)
​
mysql> delete from employees_2 where emp_no in (12013,12014,12015);
Query OK, 3 rows affected (0.01 sec)

Verify Doris Data Deletion

mysql> select count(1) from  all_employees_info ;
+----------+
| count(1) |
+----------+
|      631 |
+----------+
1 row in set (0.01 sec)

7. Summary

This question mainly introduces how to synchronize FLink CDC sub-database and sub-tables in real time, and combines the mechanism, integration principle, and usage method of Flink 2PC and Doris Stream Load 2PC integrated with the latest version of Apache Doris Flink Connector.

Hope to bring you some help.

8. Related Links:

SelectDB official website:

https://selectdb.com

Apache Doris official website:

http://doris.apache.org

Apache Doris Github:

https://github.com/apache/doris

Apache Doris developer mailing group:

[email protected]

image.png

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/5555473