Practical combat: Big data Flink CDC synchronizes Mysql data to ElasticSearch

Preface

In the previous blog post, we shared the big data distributed stream processing computing framework Flink and the construction of its basic environment . I believe that all readers have already built their own operating environment. So, today let’s practice using Flink CDC to synchronize Mysql data to Elasticsearch.

knowledge accumulation

Introduction to CDC

The full name of CDC is Change Data Capture (change data capture technology). In a broad concept, as long as it is a technology that can capture data changes, we can call it CDC. The CDC technology commonly described at present is mainly oriented to database changes and is a technology used to capture data changes in the database.
insert image description here

Types of CDC

There are many technical solutions for CDC. At present, the mainstream implementation mechanisms in the industry can be divided into two types:
query-based CDC:
◆Offline scheduling of query jobs and batch processing. Synchronize a table to other systems, and obtain the latest data in the table through query each time;
◆Data consistency cannot be guaranteed, and the data may have changed multiple times during the query process;
◆Real-time performance is not guaranteed, based on offline There is a natural delay in scheduling.
Log-based CDC:
◆Real-time consumption of logs, stream processing, such as MySQL's binlog log completely records the changes in the database, and the binlog file can be used as the data source of the stream; ◆Guarantee
data consistency, because the binlog file contains all history Change details;
◆ Ensure real-time performance, because log files similar to binlog can be consumed in a streaming manner and provide real-time data.

Comparison of common CDC plans

insert image description here

Connecting Springboot to Flink CDC

Since Flink officially provides Java, Scala, and Python language interfaces to develop Flink applications, we can directly use Maven to import Flink dependencies for function implementation.

Environmental preparation

1. SpringBoot 2.4.3
2. Flink 1.13.6
3. Scala 2.11
4. Maven 3.6.3
5. Java 8
6, mysql 8
7, es 7
Springboot, Flink, and Scala versions must match, or you can strictly follow this Blog configuration.
Note:
If you are just testing and playing on the local machine, Maven relies on the integrated computing environment and does not need to build an additional Flink environment; if you need to deploy to a Flink cluster, you need to build an additional Flink cluster. In addition, the Scala version is only used for dependency selection, and there is no need to care about the Scala environment.

Project construction

1. Introduce Flink CDC Maven dependency

pom.xml

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.4.3</version>
    <relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>flink-demo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>flink-demo</name>
<description>Demo project for Spring Boot</description>
<properties>
    <java.version>8</java.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <flink.version>1.13.6</flink.version>
</properties>
<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>8.0.23</version>
    </dependency>
    <!-- Flink CDC connector for MySQL -->
    <dependency>
        <groupId>com.ververica</groupId>
        <artifactId>flink-connector-mysql-cdc</artifactId>
        <version>2.1.0</version>
        <exclusions>
            <exclusion>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-shaded-guava</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <!-- 
    Flink CDC connector for ES 
    https://mvnrepository.com/artifact/org.apache.flink/flink-connector-elasticsearch7_2.11
    -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-elasticsearch7_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-json -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-json</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-api-java-bridge_2.11 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-api-java-bridge_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner_2.11 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-planner_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-planner-blink_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients_2.11 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java_2.11 -->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

2. Create the test database table users

users table structure

CREATE TABLE `users` (
  `id` bigint NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `name` varchar(50) NOT NULL COMMENT '名称',
  `birthday` timestamp NULL DEFAULT NULL COMMENT '生日',
  `ts` timestamp NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='用户';

3. es index operation

es operation command
es index will be automatically created

#设置es分片与副本
curl -X PUT "10.10.22.174:9200/users" -u elastic:VaHcSC3mOFfovLWTqW6E   -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "number_of_shards" : 3,
        "number_of_replicas" : 2
    }
}'

#查询index下全部数据 
curl -X GET "http://10.10.22.174:9200/users/_search"  -u elastic:VaHcSC3mOFfovLWTqW6E -H 'Content-Type: application/json' 

#删除index
curl -X DELETE "10.10.22.174:9200/users" -u elastic:VaHcSC3mOFfovLWTqW6E

Run locally

@SpringBootTest
class FlinkDemoApplicationTests {

    /**
     * flinkCDC
     * mysql to es
     * @author senfel
     * @date 2023/8/22 14:37 
     * @return void
     */
    @Test
    void flinkCDC() throws Exception{
        EnvironmentSettings fsSettings = EnvironmentSettings.newInstance()
                //.useBlinkPlanner()
                .inStreamingMode()
                .build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env,fsSettings);
        tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
        // 数据源表
        String sourceDDL =
                "CREATE TABLE users (\n" +
                        "  id BIGINT PRIMARY KEY NOT ENFORCED ,\n" +
                        "  name STRING,\n" +
                        "  birthday TIMESTAMP(3),\n" +
                        "  ts TIMESTAMP(3)\n" +
                        ") WITH (\n" +
                        "      'connector' = 'mysql-cdc',\n" +
                        "      'hostname' = '10.10.10.202',\n" +
                        "      'port' = '6456',\n" +
                        "      'username' = 'root',\n" +
                        "      'password' = 'MyNewPass2021',\n" +
                        "      'server-time-zone' = 'Asia/Shanghai',\n" +
                        "      'database-name' = 'cdc',\n" +
                        "      'table-name' = 'users'\n" +
                        "      )";
        // 输出目标表
        String sinkDDL =
                "CREATE TABLE users_sink_es\n" +
                        "(\n" +
                        "    id BIGINT PRIMARY KEY NOT ENFORCED,\n" +
                        "    name STRING,\n" +
                        "    birthday TIMESTAMP(3),\n" +
                        "    ts TIMESTAMP(3)\n" +
                        ") \n" +
                        "WITH (\n" +
                        "  'connector' = 'elasticsearch-7',\n" +
                        "  'hosts' = 'http://10.10.22.174:9200',\n" +
                        "  'index' = 'users',\n" +
                        "  'username' = 'elastic',\n" +
                        "  'password' = 'VaHcSC3mOFfovLWTqW6E'\n" +
                        ")";
        // 简单的聚合处理
        String transformSQL = "INSERT INTO users_sink_es SELECT * FROM users";

        tableEnv.executeSql(sourceDDL);
        tableEnv.executeSql(sinkDDL);
        TableResult result = tableEnv.executeSql(transformSQL);
        result.print();
        env.execute("mysql-to-es");
    }

Requesting the es user index found no data:

[root@bluejingyu-1 ~]# curl -X GET “http://10.10.22.174:9200/users/_search” -u elastic:VaHcSC3mOFfovLWTqW6E -H ‘Content-Type: application/json’
{“took”:0,“timed_out”:false,“_shards”:{“total”:3,“successful”:3,“skipped”:0,“failed”:0},“hits”:{“total”:{“value”:0,“relation”:“eq”},“max_score”:null,“hits”:[]}}

Operate mysql database to add multiple pieces of data

5 senfel 2023-08-30 15:02:28 2023-08-30 15:02:36
6 sebfel2 2023-08-30 15:02:43 2023-08-30 15:02:47

Get the es user index view data again

[root@bluejingyu-1 ~]# curl -X GET “http://10.10.22.174:9200/users/_search” -u elastic:VaHcSC3mOFfovLWTqW6E -H ‘Content-Type: application/json’
{“took”:67,“timed_out”:false,“_shards”:{“total”:3,“successful”:3,“skipped”:0,“failed”:0},“hits”:{“total”:{“value”:2,“relation”:“eq”},“max_score”:1.0,“hits”:[{“_index”:“users”,“_type”:“_doc”,“_id”:“5”,“_score”:1.0,“_source”:{“id”:5,“name”:“senfel”,“birthday”:“2023-08-30 15:02:28”,“ts”:“2023-08-30 15:02:36”}},{“_index”:“users”,“_type”:“_doc”,“_id”:“6”,“_score”:1.0,“_source”:{“id”:6,“name”:“sebfel2”,“birthday”:“2023-08-30 15:02:43”,“ts”:“2023-08-30 15:02:47”}}]}}

From the above test results, it can be seen that there is no abnormality in the local operation.

Cluster operation

Project tree:
insert image description here

1. Create a cluster to run code logic

/**
 * FlinkMysqlToEs
 * @author senfel
 * @version 1.0
 * @date 2023/8/22 14:56
 */
public class FlinkMysqlToEs {

    public static void main(String[] args) throws Exception {
        EnvironmentSettings fsSettings = EnvironmentSettings.newInstance()
                //.useBlinkPlanner()
                .inStreamingMode()
                .build();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env,fsSettings);
        tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
        // 数据源表
        String sourceDDL =
                "CREATE TABLE users (\n" +
                        "  id BIGINT PRIMARY KEY NOT ENFORCED ,\n" +
                        "  name STRING,\n" +
                        "  birthday TIMESTAMP(3),\n" +
                        "  ts TIMESTAMP(3)\n" +
                        ") WITH (\n" +
                        "      'connector' = 'mysql-cdc',\n" +
                        "      'hostname' = '10.10.10.202',\n" +
                        "      'port' = '6456',\n" +
                        "      'username' = 'root',\n" +
                        "      'password' = 'MyNewPass2021',\n" +
                        "      'server-time-zone' = 'Asia/Shanghai',\n" +
                        "      'database-name' = 'cdc',\n" +
                        "      'table-name' = 'users'\n" +
                        "      )";
        // 输出目标表
        String sinkDDL =
                "CREATE TABLE users_sink_es\n" +
                        "(\n" +
                        "    id BIGINT PRIMARY KEY NOT ENFORCED,\n" +
                        "    name STRING,\n" +
                        "    birthday TIMESTAMP(3),\n" +
                        "    ts TIMESTAMP(3)\n" +
                        ") \n" +
                        "WITH (\n" +
                        "  'connector' = 'elasticsearch-7',\n" +
                        "  'hosts' = 'http://10.10.22.174:9200',\n" +
                        "  'index' = 'users',\n" +
                        "  'username' = 'elastic',\n" +
                        "  'password' = 'VaHcSC3mOFfovLWTqW6E'\n" +
                        ")";
        // 简单的聚合处理
        String transformSQL = "INSERT INTO users_sink_es SELECT * FROM users";

        tableEnv.executeSql(sourceDDL);
        tableEnv.executeSql(sinkDDL);
        TableResult result = tableEnv.executeSql(transformSQL);
        result.print();
        env.execute("mysql-to-es");
    }
}

2. Cluster operation requires packaging of Flink programs. Different from ordinary jar packages, shade must be used here.

<build>
    <finalName>flink-demo</finalName>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.2.4</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <createDependencyReducedPom>false</createDependencyReducedPom>
                        <artifactSet>
                            <excludes>
                                <exclude>com.google.code.findbugs:jsr305</exclude>
                                <exclude>org.slf4j:*</exclude>
                                <exclude>log4j:*</exclude>
                            </excludes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>module-info.class</exclude>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                        <transformers>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                <resource>META-INF/spring.handlers</resource>
                                <resource>reference.conf</resource>
                            </transformer>
                            <transformer
                                    implementation="org.springframework.boot.maven.PropertiesMergingResourceTransformer">
                                <resource>META-INF/spring.factories</resource>
                            </transformer>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                <resource>META-INF/spring.schemas</resource>
                            </transformer>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>com.example.flinkdemo.FlinkMysqlToEs</mainClass>
                            </transformer>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Package the project and pass the package to the cluster startup

1. Project packaging
mvn package -Dmaven.test.skip=true

2. Manually upload to the server copy and run it inside the cluster:
/opt/flink/bin# ./flink run …/flink-demo.jar

3. Test and operate mysql database

Delete the user with id=6 and only the user with id=5 is left.

5 senfel000 2023-08-30 15:02:28 2023-08-30 15:02:36

4. Query es user index

[root@bluejingyu-1 ~]# curl -X GET “http://10.10.22.174:9200/users/_search” -u elastic:VaHcSC3mOFfovLWTqW6E -H ‘Content-Type: application/json’
{“took”:931,“timed_out”:false,“_shards”:{“total”:3,“successful”:3,“skipped”:0,“failed”:0},“hits”:{“total”:{“value”:1,“relation”:“eq”},“max_score”:1.0,“hits”:[{“_index”:“users”,“_type”:“_doc”,“_id”:“5”,“_score”:1.0,“_source”:{“id”:5,“name”:“senfel”,“birthday”:“2023-08-30 15:02:28”,“ts”:“2023-08-30 15:02:36”}}]}}[

As shown above, only the data with id==5 is left in es;
after testing, manual deployment to the cluster environment was successful.

Remotely deploy packages to the flink cluster

1. Added controller trigger interface

/**
 * remote runTask
 * @author senfel
 * @date 2023/8/30 16:57 
 * @return org.apache.flink.api.common.JobID
 */
@GetMapping("/runTask")
public JobID runTask() {
    try {
        // 集群信息
        Configuration configuration = new Configuration();
        configuration.setString(JobManagerOptions.ADDRESS, "10.10.22.91");
        configuration.setInteger(JobManagerOptions.PORT, 6123);
        configuration.setInteger(RestOptions.PORT, 8081);
        RestClusterClient<StandaloneClusterId>  client = new RestClusterClient<>(configuration, StandaloneClusterId.getInstance());
        //jar包存放路径,也可以直接调用hdfs中的jar
        File jarFile = new File("input/flink-demo.jar");
        SavepointRestoreSettings savepointRestoreSettings = SavepointRestoreSettings.none();
        //构建提交任务参数
        PackagedProgram program = PackagedProgram
                .newBuilder()
                .setConfiguration(configuration)
                .setEntryPointClassName("com.example.flinkdemo.FlinkMysqlToEs")
                .setJarFile(jarFile)
                .setSavepointRestoreSettings(savepointRestoreSettings).build();
        //创建任务
        JobGraph jobGraph = PackagedProgramUtils.createJobGraph(program, configuration, 1, false);
        //提交任务
        CompletableFuture<JobID> result = client.submitJob(jobGraph);
        return result.get();
    } catch (Exception e) {
        e.printStackTrace();
        return null;
    }
}

2. Start the Springboot project
insert image description here

3. Postman request
insert image description here
4. View the Fink cluster console
insert image description here

As shown in the figure above, the remote deployment has been completed.

5. Test operation mysql database

5 senfel000 2023-08-30 15:02:28 2023-08-30 15:02:36
7 eeeee 2023-08-30 17:12:00 2023-08-30 17:12:04
8 33333 2023-08-30 30 17:12:08 2023-08-30 17:12:11

6. Query es user index

[root@bluejingyu-1 ~]# curl -X GET “http://10.10.22.174:9200/users/_search” -u elastic:VaHcSC3mOFfovLWTqW6E -H ‘Content-Type: application/json’
{“took”:766,“timed_out”:false,“_shards”:{“total”:3,“successful”:3,“skipped”:0,“failed”:0},“hits”:{“total”:{“value”:3,“relation”:“eq”},“max_score”:1.0,“hits”:[{“_index”:“users”,“_type”:“_doc”,“_id”:“5”,“_score”:1.0,“_source”:{“id”:5,“name”:“senfel000”,“birthday”:“2023-08-30 15:02:28”,“ts”:“2023-08-30 15:02:36”}},{“_index”:“users”,“_type”:“_doc”,“_id”:“7”,“_score”:1.0,“_source”:{“id”:7,“name”:“eeeee”,“birthday”:“2023-08-30 17:12:00”,“ts”:“2023-08-30 17:12:04”}},{“_index”:“users”,“_type”:“_doc”,“_id”:“8”,“_score”:1.0,“_source”:{“id”:8,“name”:“33333”,“birthday”:“2023-08-30 17:12:08”,“ts”:“2023-08-30 17:12:11”}}]}}

As above, two new pieces of data have been added to es;
after testing, the remote release of Flink Task is completed.

write at the end

Big data Flink CDC synchronizes Mysql data to ElasticSearch. It is relatively simple to build and test. For the basic learning and testing environment, independent clusters currently only support single task deployment. If multiple tasks are required or used in production, Yarn and Job can be used for deployment.

Guess you like

Origin blog.csdn.net/weixin_39970883/article/details/132707967