[Real-time data warehouse] Introduction to CDC, realization of DWD layer business data processing (main tasks, receiving kafka data, dynamic distribution *****)

1 Introduction to CDC

1 What is CDC

CDC is the abbreviation of Change Data Capture (Change Data Capture). The core idea is to monitor and capture changes in the database (including insertion, update, and deletion of data or data tables, etc.), record these changes in the order in which they occur, and write them into the message middleware for other services to subscribe and Consumption.

2 Types of CDCs

CDC is mainly divided into query-based and Binlog-based methods. Let’s mainly understand the differences between the two:

Query-based CDC Binlog-based CDC
open source product Sqoop、Kafka JDBC Source Canal、Maxwell、Debezium
execution mode Batch Streaming
Is it possible to capture all data changes no yes
delay high latency low latency
Whether to increase database pressure yes no

3 Flink-CDC

The Flink community has developed the flink-cdc-connectors component, which is a source component that can directly read full data and incrementally changed data from databases such as MySQL and PostgreSQL. It is also open source at present, refer to the URL .

2 Prepare business data - DWD layer

Changes in business data can be collected through Maxwell, but MaxWell writes all the data into one topic in a unified way. These data include business data and dimension data, which is obviously not conducive to future data processing, so this function is derived from Kafka The business data ODS layer reads the data, after processing, saves the dimension data to Hbase, and writes the fact data back to Kafka as the DWD layer of business data.

1 main task

(1) Receive Kafka data and filter null data

Perform ETL on the data captured by Maxwell, keep the useful parts, and filter out the useless parts.

(2) Realize dynamic shunt function

Since MaxWell writes all data into one topic, it is obviously not conducive to future data processing. Therefore, each table needs to be disassembled and processed. But because each table has different characteristics, some tables are dimension tables, some tables are fact tables, and some tables are both fact tables and dimension tables under certain circumstances.

In real-time computing, dimension data is generally written into storage containers, which are usually databases that are convenient to query through primary keys, such as HBase, Redis, MySQL, etc. Generally, the fact data is written into the stream for further processing, and finally a wide table is formed. But as a Flink real-time computing task, how do you know which tables are dimension tables and which are fact tables? And which fields should these tables collect?

You can put the above content in a certain place and configure it centrally. Such a configuration is not suitable to be written in the configuration file, because every time the business end adds a table as the demand changes, the configuration needs to be modified and the calculation program restarted. Therefore, a dynamic configuration solution is needed here to store this configuration for a long time. Once the configuration changes, real-time computing can automatically sense it.

This can be done in two ways

  • One is to use Zookeeper to store and sense data changes through Watch.
  • The other is to use mysql database storage, periodic synchronization, and use FlinkCDC to read.

The second option is chosen here, mainly because mysql is more convenient to use sql for configuration data initialization and maintenance management.

So there is the following picture:

insert image description here

Configuration table field description:

  • sourceTable: the original table name.
  • sinkType: The type of output.
  • sinkTable: Which table to write out to.
  • sinkpk: primary key.
  • sinkcolum: Which fields are kept.
  • ext: The extension of the table creation statement, such as engine, primary key growth method, encoding method, etc.
  • operateType: operation type, does not record data deletion operations.

(3) Save the divided stream to the corresponding table and topic

Business data is stored in Kafka topics.

Dimension data is saved to Hbase tables.

2 Receive Kafka data and filter null value data

Overall workflow:

insert image description here

(1) Code

public class BaseDBApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1 基本环境准备
        //流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置并行度
        env.setParallelism(1);

        //TODO 2 检查点设置
        //开启检查点
        env.enableCheckpointing(5000L,CheckpointingMode.EXACTLY_ONCE);
        // 设置检查点超时时间
        env.getCheckpointConfig().setCheckpointTimeout(60000L);
        // 设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,3000L));
        // 设置job取消后,检查点是否保留
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        // 设置状态后端 -- 基于内存 or 文件系统 or RocksDB
        env.setStateBackend(new FsStateBackend("hdfs://hadoop101:8020/ck/gmall"));
        // 指定操作HDFS的用户
        System.setProperty("HADOOP_USER_NAME","hzy");

        //TODO 3 从kafka中读取数据
        //声明消费的主题以及消费者组
        String topic = "ods_base_db_m";
        String groupId = "base_db_app_group";
        // 获取消费者对象
        FlinkKafkaConsumer<String> kafkaSource = MyKafkaUtil.getKafkaSource(topic, groupId);
        // 读取数据,封装成流
        DataStreamSource<String> kafkaDS = env.addSource(kafkaSource);

        //TODO 4 对数据类型进行转换 String -> JSONObject
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(JSON::parseObject);

        //TODO 5 简单的ETL
        SingleOutputStreamOperator<JSONObject> filterDS = jsonObjDS.filter(
                new FilterFunction<JSONObject>() {
    
    
                    @Override
                    public boolean filter(JSONObject jsonobj) throws Exception {
    
    
                        boolean flag =
                                jsonobj.getString("table") != null &&
                                        jsonobj.getString("table").length() > 0 &&
                                        jsonobj.getJSONObject("data") != null &&
                                        jsonobj.getString("data").length() > 3;
                        return flag;
                    }
                }
        );
        filterDS.print("<<<");

        //TODO 6 动态分流

        //TODO 7 将维度侧输出流的数据写到Hbase中

        //TODO 8 将主流数据写回kafka的dwd层

        env.execute();
    }
}

(2) test

The overall flow of business data is as follows:

insert image description here

开启zookeeper
开启kafka
开启maxwell
开启nm,等待安全模式关闭
开启主程序
模拟生成业务数据,查看主程序输出内容

3 According to the MySQL configuration table, perform dynamic distribution

FlinkCDC dynamically monitors the changes in the configuration table, reads the changes in the configuration table into the program in the form of streams, and transmits them downwards in the form of broadcast streams, and the mainstream obtains configuration information from the broadcast streams.

(1) Preparation

a Introduce pom.xml dependencies

<!--lomback插件依赖-->
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <version>1.18.12</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>
<dependency>
    <groupId>com.alibaba.ververica</groupId>
    <artifactId>flink-connector-mysql-cdc</artifactId>
    <version>1.2.0</version>
</dependency>

b Create a database in Mysql

Note that it is different from the gmall2022 business library

insert image description here

c Create configuration table table_process in gmall2022_realtime library

CREATE TABLE `table_process` (
  `source_table` varchar(200) NOT NULL COMMENT '来源表',
  `operate_type` varchar(200) NOT NULL COMMENT '操作类型 insert,update,delete',
   `sink_type` varchar(200) DEFAULT NULL COMMENT '输出类型 hbase kafka',
  `sink_table` varchar(200) DEFAULT NULL COMMENT '输出表(主题)',
  `sink_columns` varchar(2000) DEFAULT NULL COMMENT '输出字段',
  `sink_pk` varchar(200) DEFAULT NULL COMMENT '主键字段',
  `sink_extend` varchar(200) DEFAULT NULL COMMENT '建表扩展',
  PRIMARY KEY (`source_table`,`operate_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

d Create configuration table entity class

@Data
public class TableProcess {
    
    
    //动态分流Sink常量,改为小写和脚本一致
    public static final String SINK_TYPE_HBASE = "hbase";
    public static final String SINK_TYPE_KAFKA = "kafka";
    public static final String SINK_TYPE_CK = "clickhouse";
    //来源表
    String sourceTable;
    //操作类型 insert,update,delete
    String operateType;
    //输出类型 hbase kafka
    String sinkType;
    //输出表(主题)
    String sinkTable;
    //输出字段
    String sinkColumns;
    //主键字段
    String sinkPk;
    //建表扩展
    String sinkExtend;
}

e Add a listener to the configuration database in MySQL Binlog, and restart MySQL

sudo vim /etc/my.cnf
# 添加
binlog-do-db=gmall2022_realtime
# 重启
sudo systemctl restart mysqld

(2) Use of FlinkCDC – DataStream

Create a new maven project gmall2022-cdc.

a import dependencies

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.12.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.12</artifactId>
        <version>1.12.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.12</artifactId>
        <version>1.12.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.48</version>
    </dependency>

    <dependency>
        <groupId>com.alibaba.ververica</groupId>
        <artifactId>flink-connector-mysql-cdc</artifactId>
        <version>1.2.0</version>
    </dependency>

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.75</version>
    </dependency>
</dependencies>
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

b code writing

/**
 * 通过FlinkCDC动态读取MySQL表中的数据 -- DataStreamAPI
 */
public class FlinkCDC01_DS {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1 准备流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        //TODO 2 开启检查点   Flink-CDC将读取binlog的位置信息以状态的方式保存在CK,如果想要做到断点续传,
        // 需要从Checkpoint或者Savepoint启动程序
        // 开启Checkpoint,每隔5秒钟做一次CK,并指定CK的一致性语义
        env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
        // 设置超时时间为1分钟
        env.getCheckpointConfig().setCheckpointTimeout(60000);
        // 指定从CK自动重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2,2000L));
        // 设置任务关闭的时候保留最后一次CK数据
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        // 设置状态后端
        env.setStateBackend(new FsStateBackend("hdfs://hadoop101:8020/ck/flinkCDC"));
        // 设置访问HDFS的用户名
        System.setProperty("HADOOP_USER_NAME", "hzy");

        //TODO 3 创建Flink-MySQL-CDC的Source
        Properties props = new Properties();
        props.setProperty("scan.startup.mode","initial");
        SourceFunction<String> sourceFunction = MySQLSource.<String>builder()
                .hostname("hadoop101")
                .port(3306)
                .username("root")
                .password("123456")
                // 可配置多个库
                .databaseList("gmall2022_realtime")
                ///可选配置项,如果不指定该参数,则会读取上一个配置中指定的数据库下的所有表的数据
                //注意:指定的时候需要使用"db.table"的方式
                .tableList("gmall2022_realtime.t_user")
                .debeziumProperties(props)
                .deserializer(new StringDebeziumDeserializationSchema())
                .build();

        //TODO 4 使用CDC Source从MySQL读取数据
        DataStreamSource<String> mysqlDS = env.addSource(sourceFunction);

        //TODO 5 打印输出
        mysqlDS.print();

        //TODO 6 执行任务
        env.execute();
    }
}

c test

Add a table in gmall2022_realtime, execute the program, add data, you can see the following information

insert image description here

d Endpoint resume case test

# 打包并将带依赖的jar包上传至Linux
# 启动HDFS集群
start-dfs.sh
# 启动Flink集群
bin/start-cluster.sh
# 启动程序
bin/flink run -m hadoop101:8081 -c com.hzy.gmall.cdc.FlinkCDC01_DS ./gmall2022-cdc-1.0-SNAPSHOT-jar-with-dependencies.jar
# 观察taskManager日志,会从头读取表数据
# 给当前的Flink程序创建Savepoint 
bin/flink savepoint JobId hdfs://hadoop101:8020/flink/save
# 在WebUI中cancelJob
# 在MySQL的gmall2022_realtime.t_user表中添加、修改或者删除数据
# 从Savepoint重启程序
bin/flink run -s hdfs://hadoop101:8020/flink/save/JobId -c com.hzy.gmall.cdc.FlinkCDC01_DS ./gmall2022-cdc-1.0-SNAPSHOT-jar-with-dependencies.jar
# 观察taskManager日志,会从检查点读取表数据

(3) Use of FlinkCDC – FlinkSQL

Use FlinkCDC to get data from MySQL through sql.

a import dependencies

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner-blink_2.12</artifactId>
    <version>1.12.0</version>
</dependency>

b basic information configuration

modify language level

insert image description here

Modify compilation level

insert image description here

c code writing

public class FlinkCDC02_SQL {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1.准备环境
        //1.1流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //1.2 表执行环境
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        //TODO 2.创建动态表
        tableEnv.executeSql("CREATE TABLE user_info (" +
                "  id INT," +
                "  name STRING," +
                "  age INT" +
                ") WITH (" +
                "  'connector' = 'mysql-cdc'," +
                "  'hostname' = 'hadoop101'," +
                "  'port' = '3306'," +
                "  'username' = 'root'," +
                "  'password' = '123456'," +
                "  'database-name' = 'gmall2022_realtime'," +
                "  'table-name' = 't_user'" +
                ")");

        tableEnv.executeSql("select * from user_info").print();

        //TODO 6.执行任务
        env.execute();
    }

}

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128226677