Building Streaming ETL for MySQL and Postgres based on Flink CDC

Official website: https://ververica.github.io/flink-cdc-connectors/release-2.3/content/%E5%BF%AB%E9%80%9F%E4%B8%8A%E6%89%8B/ The official mysql-postgres-tutorial-zh.html tutorial has some pitfalls. After testing it myself, I took notes.

Server environment:

VM virtual machine: CentOS7.9

docker version: Docker version 24.0.5, build ced0996

docker-compose version: 2.19

jdk 1.8

Virtual machine IP:192.168.122.131 

Memory: 16G (must be greater than or equal to 16G)

CPU:4g

Disk: >= 60G

1. Docker compose installation

DOCKER_CONFIG=${DOCKER_CONFIG:-/usr/local/lib/docker/cli-plugins}
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.19.1/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose

Apply executable permissions to the file:

chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose

Test whether the installation was successful

docker compose version #之前的v1版本命令是docker-compose --version

Reference: https://blog.csdn.net/qq_40099908/article/details/131611496

Two, actual combat

This tutorial will show how to quickly build streaming ETL for MySQL and Postgres based on Flink CDC. The demonstrations in this tutorial will be carried out in the Flink SQL CLI, involving only SQL, without a single line of Java/Scala code, and no need to install an IDE.

Assume that we are running an e-commerce business. The data of goods and orders are stored in MySQL, and the logistics information corresponding to the orders is stored in Postgres. For the order table, in order to facilitate analysis, we hope to associate it with its corresponding product and logistics information to form a wide table, and write it to ElasticSearch in real time.

The following content will introduce how to use Flink Mysql/Postgres CDC to achieve this requirement. The overall architecture of the system is shown in the figure below:

1. Prepare the components required for the tutorial

The following tutorial will  docker-compose prepare the required components in the following manner.

Create a file with the following content  docker-compose.yml :

version: '2.1'
services:
  postgres:
    image: debezium/example-postgres:1.1
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
  mysql:
    image: debezium/example-mysql:1.1
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=123456
      - MYSQL_USER=mysqluser
      - MYSQL_PASSWORD=mysqlpw
  elasticsearch:
    image: elastic/elasticsearch:7.6.0
    environment:
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.type=single-node
    ports:
      - "9200:9200"
      - "9300:9300"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
  kibana:
    image: elastic/kibana:7.6.0
    ports:
      - "5601:5601"

The containers included in this Docker Compose are:

  • MySQL: The product table  products and order table  orders will be stored in this database. These two tables will  shipmentsbe associated with the logistics table in the Postgres database to obtain an order table containing more information. enriched_orders

  • Postgres: Logistics table  shipments will be stored in this database

  • Elasticsearch: The final order table  enriched_orders will be written to Elasticsearch

  • Kibana: used to visualize ElasticSearch data

docker-compose.yml Execute the following command in the directory where you are located to start the components required for this tutorial

docker compose up -d

This command will automatically start all containers defined in the Docker Compose configuration in detached mode. You can use docker ps to see whether the above containers are started normally, or you can  check whether Kibana is running normally by visiting http://192.168.122.131:5601 .

2. Download Flink and the required dependencies

Download  Flink 1.16.0  and extract it to the directory  flink-1.16.0  ,

Download the dependency packages listed below and place them  flink-1.16.0/lib/ in the directory:

  1. The download link is only valid for the published version, the SNAPSHOT version requires local compilation

Prepare data

Prepare data in MySQL database

Enter the MySQL container

docker compose exec mysql mysql -uroot -p123456

Create database and tables  products, ordersand insert data

-- MySQL
CREATE DATABASE mydb;
USE mydb;
CREATE TABLE products (
  id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  description VARCHAR(512)
);
ALTER TABLE products AUTO_INCREMENT = 101;

INSERT INTO products
VALUES (default,"scooter","Small 2-wheel scooter"),
       (default,"car battery","12V car battery"),
       (default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3"),
       (default,"hammer","12oz carpenter's hammer"),
       (default,"hammer","14oz carpenter's hammer"),
       (default,"hammer","16oz carpenter's hammer"),
       (default,"rocks","box of assorted rocks"),
       (default,"jacket","water resistent black wind breaker"),
       (default,"spare tire","24 inch spare tire");

CREATE TABLE orders (
  order_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
  order_date DATETIME NOT NULL,
  customer_name VARCHAR(255) NOT NULL,
  price DECIMAL(10, 5) NOT NULL,
  product_id INTEGER NOT NULL,
  order_status BOOLEAN NOT NULL -- Whether order has been placed
) AUTO_INCREMENT = 10001;

INSERT INTO orders
VALUES (default, '2020-07-30 10:08:22', 'Jark', 50.50, 102, false),
       (default, '2020-07-30 10:11:09', 'Sally', 15.00, 105, false),
       (default, '2020-07-30 12:00:30', 'Edward', 25.25, 106, false);

Note: mysql will encounter the wrong time zone.

Adjust the time zone in the mysql container:

set time_zone='+8:00';
SET GLOBAL time_zone = '+8:00';
flush privileges;
SELECT @@global.time_zone;
show variables like '%time_zone%';

Prepare data in Postgres database

Enter the Postgres container

docker compose exec postgres psql -h localhost -U postgres

Create a table  shipmentsand insert data

-- PG
CREATE TABLE shipments (
  shipment_id SERIAL NOT NULL PRIMARY KEY,
  order_id SERIAL NOT NULL,
  origin VARCHAR(255) NOT NULL,
  destination VARCHAR(255) NOT NULL,
  is_arrived BOOLEAN NOT NULL
);
ALTER SEQUENCE public.shipments_shipment_id_seq RESTART WITH 1001;
ALTER TABLE public.shipments REPLICA IDENTITY FULL;
INSERT INTO shipments
VALUES (default,10001,'Beijing','Shanghai',false),
       (default,10002,'Hangzhou','Shanghai',false),
       (default,10003,'Shanghai','Hangzhou',false);

Start the Flink cluster and Flink SQL CLI

Use the following command to jump to the Flink directory

cd flink-1.16.0

Use the following command to start the Flink cluster

./bin/start-cluster.sh

If the startup is successful, you can   access the Flink Web UI at http://192.168.122.131:8081/ , as shown below:

Note: If it cannot be accessed from a local computer other than the VM, you need to adjust the /flink-1.16.0/conf/flink-conf.yaml file.

Change rest.address value to: 0.0.0.0

Open a single port (you need to restart the firewall after opening to take effect);

firewall-cmd --zone=public --add-port=8081/tcp --permanent

Restart the firewall;  systemctl restart firewalld

  Another: There is also a parameter taskmanager.numberOfTaskSlots: 50. Generally, set a larger value, such as 50.

Start Flink SQL CLI using the following command

./bin/sql-client.sh

After successful startup, you can see the following page:

Create tables using Flink DDL in Flink SQL CLI

First, enable checkpoint and do a checkpoint every 3 seconds.

-- Flink SQL                   
Flink SQL> SET execution.checkpointing.interval = 3s;

Then, for the tables in the database  productsuse Flink SQL CLI to create corresponding tables to synchronize the data of these underlying database tables ordersshipments

-- Flink SQL
Flink SQL> CREATE TABLE products (
    id INT,
    name STRING,
    description STRING,
    PRIMARY KEY (id) NOT ENFORCED
  ) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'localhost',
    'port' = '3306',
    'username' = 'root',
    'password' = '123456',
    'database-name' = 'mydb',
    'table-name' = 'products'
  );

Flink SQL> CREATE TABLE orders (
   order_id INT,
   order_date TIMESTAMP(0),
   customer_name STRING,
   price DECIMAL(10, 5),
   product_id INT,
   order_status BOOLEAN,
   PRIMARY KEY (order_id) NOT ENFORCED
 ) WITH (
   'connector' = 'mysql-cdc',
   'hostname' = 'localhost',
   'port' = '3306',
   'username' = 'root',
   'password' = '123456',
   'database-name' = 'mydb',
   'table-name' = 'orders'
 );

Flink SQL> CREATE TABLE shipments (
   shipment_id INT,
   order_id INT,
   origin STRING,
   destination STRING,
   is_arrived BOOLEAN,
   PRIMARY KEY (shipment_id) NOT ENFORCED
 ) WITH (
   'connector' = 'postgres-cdc',
   'hostname' = 'localhost',
   'port' = '5432',
   'username' = 'postgres',
   'password' = 'postgres',
   'database-name' = 'postgres',
   'schema-name' = 'public',
   'table-name' = 'shipments'
 );

Finally, create  enriched_orders a table to write the associated order data into Elasticsearch.

-- Flink SQL
Flink SQL> CREATE TABLE enriched_orders (
   order_id INT,
   order_date TIMESTAMP(0),
   customer_name STRING,
   price DECIMAL(10, 5),
   product_id INT,
   order_status BOOLEAN,
   product_name STRING,
   product_description STRING,
   shipment_id INT,
   origin STRING,
   destination STRING,
   is_arrived BOOLEAN,
   PRIMARY KEY (order_id) NOT ENFORCED
 ) WITH (
     'connector' = 'elasticsearch-7',
     'hosts' = 'http://localhost:9200',
     'index' = 'enriched_orders'
 );

Associate order data and write it to Elasticsearch

Use Flink SQL to associate the order table  order with the product table  productsand logistics information table  shipments , and write the associated order information into Elasticsearch

-- Flink SQL
Flink SQL> INSERT INTO enriched_orders
 SELECT o.*, p.name, p.description, s.shipment_id, s.origin, s.destination, s.is_arrived
 FROM orders AS o
 LEFT JOIN products AS p ON o.product_id = p.id
 LEFT JOIN shipments AS s ON o.order_id = s.order_id;

Now, you can see the order data including product and logistics information in Kibana.

First visit http://192.168.122.131:5601/app/kibana#/management/kibana/index_pattern to create index pattern enriched_orders.

Then you can see the written data at http://192.168.122.131:5601/app/kibana#/discover.

Next, modify the data in the tables in the MySQL and Postgres databases, and the order data displayed in Kibana will also be updated in real time:

orders Insert a piece of data into a MySQL  table

--MySQL
INSERT INTO orders
VALUES (default, '2020-07-30 15:22:00', 'Jark', 29.71, 104, false);

shipment Insert a piece of data into a Postgres  table

--PG
INSERT INTO shipments
VALUES (default,10004,'Shanghai','Beijing',false);

orders Update order status in MySQL  table

--MySQL
UPDATE orders SET order_status = true WHERE order_id = 10004;

shipment Update logistics status in Postgres  table

--PG
UPDATE shipments SET is_arrived = true WHERE shipment_id = 1004;

orders Delete a piece of data in a MYSQL  table

--MySQL
DELETE FROM orders WHERE order_id = 10004;

Kibana is refreshed every time a step is executed, and you can see that the order data displayed in Kibana will be updated in real time, as shown below:

environmental cleanup

After this tutorial,  docker-compose.yml execute the following command in the directory where the file is located to stop all containers:

docker compose down

Execute the following command in the directory where Flink is located  flink-1.16.0 to stop the Flink cluster:

./bin/stop-cluster.sh

Troubleshooting

If the data is abnormal, check the error message on the flink web page.

http://192.168.122.131:8081/#/job/running

Guess you like

Origin blog.csdn.net/puzi0315/article/details/132689670