The perfect partner for real-time data integration: CDC technology and Kafka integrated solution

Table of contents

1. Real-time data synchronization

2. Reliable data transmission

3. Flexible data processing

4. Decoupled data systems

5. Introduction to mainstream free CDC tools

6. Flink CDC installation and usage steps:

7. ETL CDC installation and usage steps

8. Write at the end


1. Real-time data synchronization

As enterprise data continues to grow, how to efficiently capture, synchronize and process data has become key to business development. In this digital era, the integration of CDC technology and Kafka provides enterprises with a seamless data management solution, opening a new door for data flow and real-time processing.

The integration of CDC technology and Kafka can achieve fast and reliable real-time data synchronization. CDC technology captures data changes in database transaction logs and converts them into reliable data streams. These data streams are transmitted through Kafka's high-throughput message queue, ensuring real-time and consistent data. Whether it is synchronization from source database to target database, or data transfer across different data storage systems, CDC technology integrated with Kafka provides an efficient and seamless solution.

2. Reliable data transmission

As a distributed and scalable message queue system, Kafka provides a highly reliable data transmission mechanism. With Kafka’s persistent storage and data replication mechanism, data will not be lost or corrupted. Kafka integration ensures data integrity and reliability even under high concurrency conditions. This provides enterprises with a strong data transmission foundation and ensures the safe transmission of data in all aspects.

3. Flexible data processing

The integration of CDC technology and Kafka not only provides real-time data synchronization, but also provides enterprises with flexible data processing capabilities. Kafka's message queue and stream processing features allow enterprises to perform real-time data processing and analysis while transmitting data. With Kafka's consumer applications, enterprises can transform, filter, and aggregate data streams to achieve real-time data cleaning, processing, and analysis. This real-time data processing capability provides businesses with instant insights, helping them make fast and accurate decisions.

4. Decoupled data systems

Integrating CDC technology with Kafka can also help enterprises decouple data systems. By using CDC technology and Kafka as the middle layer, different data sources and target systems can operate independently, eliminating tightly coupled dependencies between each other. This decoupling brings great flexibility, making it easier for enterprises to add, remove or upgrade data sources and target systems without having to restructure the entire data process.

The integration of CDC technology and Kafka brings a new experience in data management to enterprises. It provides efficient, reliable data synchronization and real-time processing to help enterprises achieve data-driven success. Whether it is data synchronization, real-time processing or data system decoupling, the integration of CDC technology and Kafka provides enterprises with a powerful and flexible solution.

5. Introduction to mainstream free CDC tools

Introducing two mainstream tools that can quickly and free integrate CDC technology with Kafka: Flink CDC and ETLCloud CDC.

Environment preparation before testing: JDK8 or above, Mysql database (enable BinLog), kafka

6. Flink CDC installation and usage steps:

Download the installation package

Enter the Flink official website and download the 1.13.3 version installation package  flink-1.13.3-bin-scala_2.11.tgz. ( Flink1.13.3 supports flink cdc2.x version, which is compatible with flink cdc )

Unzip

Create the installation directory /home/flink on the server, place the flink installation package in this directory, and execute the decompression command to decompress it to the current directory. t ar  -zxvf  flink-1.13.3-bin-scala_2.11.tgz

start up

Enter the decompressed flink/lib directory and upload the mysql and sql-connector driver packages.

 Enter the flink/bin directory and execute the startup command: ./start-cluster.sh

 Enter the Flink visual interface to view http://ip:8081

 test

Next, let's create a new maven project to do synchronization testing of CDC data monitoring.

POM dependency

<!--    Flink CDC  -->
<dependency>
    <groupId>com.ververica</groupId>
    <artifactId>flink-connector-mysql-cdc</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.12.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.12</artifactId>
    <version>1.12.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients_2.12</artifactId>
    <version>1.12.0</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.49</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner-blink_2.12</artifactId>
    <version>1.12.0</version>
</dependency>
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.75</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.12.0</version>
</dependency>

新建Flink_CDC2Kafka类
import com.ververica.cdc.connectors.mysql.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.DebeziumSourceFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
public class Flink_CDC2Kafka {
    public static void main(String[] args) throws Exception {
        //1. Get the execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment. getExecutionEnvironment ();        env.setParallelism(1); //1.1 Set the CK& status backend         // Omitted         // 2. Build the SourceFunction through FlinkCDC and read the data DebeziumSourceFunction<String> sourceFunction = MySqlSource.<String> builder ()                .hostname( "ip" )   //Database IP                 .port(3306) //Database port                 .username( "admin" )   //Database username.password                 ( "pass" )    / /database password                 .
        

        


        




"test" )    //This comment is multi-database synchronization.tableList ( "test.admin" ) //This comment is multi-table synchronization.deserializer ( new CustomerDeserialization()) //You need to customize the serialization format here // .deserializer(new StringDebeziumDeserializationSchema()) //The default is this serialization format.startupOptions (StartupOptions. latest ())                .build();        DataStreamSource<String> streamSource = env.addSource(sourceFunction); //3. Print the data and Write data to Kafka streamSource.print();        String sinkTopic = "test" ;        streamSource.addSink( getKafkaProducer ( "ip:9092" ,sinkTopic));
                
                

                


        
        


        //4.启动任务
        env.execute("FlinkCDC");
    }
    //kafka 生产者
    public static FlinkKafkaProducer<String> getKafkaProducer(String brokers,String topic) {
        return new FlinkKafkaProducer<String>(brokers,topic,
                new SimpleStringSchema());
    }
}

Custom serialization class

import com.alibaba.fastjson.JSONObject;
import com.ververica.cdc.debezium.DebeziumDeserializationSchema;
import io.debezium.data.Envelope;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.util.Collector;
import org.apache.kafka.connect.data.Field;
import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.data.Struct;
import org.apache.kafka.connect.source.SourceRecord;
import java.util.ArrayList;
import java.util.List;
public class CustomerDeserialization implements DebeziumDeserializationSchema<String> {
    @Override
     public void deserialize(SourceRecord sourceRecord, Collector<String> collector) throws Exception {
         //1. Create a JSON object to store the final data JSONObject result = new JSONObject(); //2 .Get the library name & table name and put it into source String topic = sourceRecord.topic();        String[] fields = topic.split( " \\ ." );        String database = fields[1];        String tableName = fields[2] ;        JSONObject source = new
        
        
        



JSONObject();
        source.put("database",database);
        source.put("table",tableName);
        Struct value = (Struct) sourceRecord.value();
        //3.获取"before"数据
        Struct before = value.getStruct("before");
        JSONObject beforeJson = new JSONObject();
        if (before != null) {
            Schema beforeSchema = before.schema();
            List<Field> beforeFields = beforeSchema.fields();
            for (Field field : beforeFields) {
                Object beforeValue = before.get(field);
                beforeJson.put(field.name(), beforeValue);
            }
        }
        //4. Get "after" data Struct after = value.getStruct( "after" );         JSONObject afterJson = new JSONObject(); if (after != null ) {             Schema afterSchema = after.schema();             List<Field> afterFields = afterSchema.fields(); for (Field field : afterFields) {                 Object afterValue = after.get(field);                 afterJson.put(field.name() , afterValue);             }         } //5. Get the operation type CREATE UPDATE DELETE to perform letters that comply with Debezium-op
        

        


            




        
        Envelope.Operation operation = Envelope.operationFor(sourceRecord);
        String type = operation.toString().toLowerCase();
        if ("insert".equals(type)) {
            type = "c";
        }
        if ("update".equals(type)) {
            type = "u";
        }
        if ("delete".equals(type)) {
            type = "d";
        }
        if ("create".equals(type)) {
            type = "c";
        }
        //6. Write the fields into the JSON object result.put( "source" , source);        result.put( "before" , beforeJson);        result.put( "after" , afterJson);        result.put( "op" , type); //7. Output data collector.collect(result.toJSONString());    }@Override public TypeInformation<String> getProducedType() { return BasicTypeInfo. STRING_TYPE_INFO ;    }}
        



        
        

    
    
        

Turn on CDC monitoring

 Add a new piece of personnel data in Mysql

 The console captures incremental data

 Incremental data is also successfully pushed to kafka

At this point, the process of pushing incremental data from the Flink CDC monitoring database to Kafka has been completed. It can be seen that the entire process requires some coding capabilities, which is quite painful for business personnel to use.

Next, we will introduce how the ETLCloud product can quickly realize the above scenarios through visual configuration.

7. ETL CDC installation and usage steps

Download the installation package

ETLCloud provides a one-click quick deployment package. You only need to run the startup script to complete the installation and product deployment. To download the deployment package, you can log in to the ETLCloud official website and download it yourself.

Install

Download the Linux one-click deployment package from the official website, put the one-click deployment package in a directory, unzip it and enter the directory.

Authorize script files

chmod +x restcloud_install.sh

Execute script

./restcloud_install.sh

 Wait for tomcat to start. When this interface appears, restcloud proves that the startup is successful.

Data source configuration

Added MySql data source information

Added Kafka data source information

Test data source

 Listener configuration

Add a new database listener

 Listener configuration

 Receiver configuration (select kafka as data transmission type)

 Advanced configuration (default parameters)

 Start listening

 Monitoring successful

 test

Open the Navicat visualization tool to add and modify a piece of personnel information

 Real-time transmission data can be dynamically captured in real-time data

 View new data in Kafka

 View modified data in Kafka

 8. Write at the end

Above, we have implemented the function of synchronizing real-time data to kafka through two CDC tools. However, by comparing Flink CDC and ETLCloud CDC, we can see that ETLCloud CDC provides a visual configuration method, making the configuration process simpler and faster, and does not require coding ability. . Flink CDC requires coding, which may have a certain learning cost for business personnel.

No matter which tool you choose, you can integrate CDC technology with Kafka to capture incremental data changes in the database in real time, providing a convenient and efficient data synchronization and transmission method.

Guess you like

Origin blog.csdn.net/kezi/article/details/131791191
Recommended