6.2. Writing Flink data to Kafka

Table of contents

1. Add POM dependency

2. API usage instructions

3. Serializer

3.1 Using predefined serializers

3.2 Using a custom serializer

4. Fault tolerance guarantee level

4.1 At least once configuration

4.2 Exactly once configuration

5. This is a complete introductory case


1. Add POM dependency

Apache Flink integrates a universal Kafka connector. When using it, you need to introduce corresponding dependencies according to the version of the production environment.

<!-- 引入 kafka连接器依赖-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId>
    <version>1.17.1</version>
</dependency>

2. API usage instructions

KafkaSink Data streams can be written to one or more Kafka topics.

Official website link: Official website link

DataStream<String> stream = ...;
        
KafkaSink<String> sink = KafkaSink.<String>builder()  // 泛型为 输入输入的类型
        // TODO 必填项:配置 kafka 的地址和端口
        .setBootstrapServers(brokers)
        // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
        .setRecordSerializer(KafkaRecordSerializationSchema.builder()
            .setTopic("topic-name")
            .setValueSerializationSchema(new SimpleStringSchema())
            .build()
        )
        // TODO 必填项:配置容错保证级别 精准一次、至少一次、不做任何保证
        .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
        .build();
        
stream.sinkTo(sink);

3. Serializer

The role of the serializer is to convert flink data into kafka's ProducerRecord

3.1 Using predefined serializers

Function: Convert DataStream data to value in Kafka message, key is the default value null, timestamp is the default value

// 初始化 KafkaSink 实例
KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
        // TODO 必填项:配置 kafka 的地址和端口
        .setBootstrapServers("worker01:9092")
        // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
        .setRecordSerializer(
                KafkaRecordSerializationSchema.<String>builder()
                        .setTopic("20230912")
                        .setValueSerializationSchema(new SimpleStringSchema())
                        .build()
        )
        .build();

3.2 Using a custom serializer

Function: You can assign values ​​to the key, value, partition, and timestamp of the Kafka message

/**
 * 如果要指定写入kafka的key,可以自定义序列化器:
 * 		1、实现 一个接口,重写 序列化 方法
 * 		2、指定key,转成 字节数组
 * 		3、指定value,转成 字节数组
 * 		4、返回一个 ProducerRecord对象,把key、value放进去
 */
// 初始化 KafkaSink 实例 (自定义 KafkaRecordSerializationSchema 实例)
KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
        // TODO 必填项:配置 kafka 的地址和端口
        .setBootstrapServers("worker01:9092")
        // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
        .setRecordSerializer(
                new KafkaRecordSerializationSchema<String>() {

                    @Nullable
                    @Override
                    public ProducerRecord<byte[], byte[]> serialize(String element, KafkaSinkContext context, Long timestamp) {
                        String[] datas = element.split(",");
                        byte[] key = datas[0].getBytes(StandardCharsets.UTF_8);
                        byte[] value = element.getBytes(StandardCharsets.UTF_8);
                        Long currTimestamp = System.currentTimeMillis();
                        Integer partition = 0;
                        return new ProducerRecord<>("20230913", partition, currTimestamp, key, value);
                    }
                }
        )
        .build();

4. Fault tolerance guarantee level

KafkaSink A total of three different semantic guarantees are supported ( DeliveryGuarantee)

  • DeliveryGuarantee.NONE   No guarantee is provided
    • Messages may be lost due to Kafka broker or duplicated due to Flink failure.
  • DeliveryGuarantee.AT_LEAST_ONCE  at least once
    • During checkpoint, the sink will wait for all data in the Kafka buffer to be confirmed by the Kafka producer.
    • Messages will not be lost due to events that occur on the Kafka broker side, but may be duplicated on Flink restarts as Flink reprocesses old data.
  • DeliveryGuarantee.EXACTLY_ONCE 精确once
    • In this mode, Kafka sink will write all data through transactions submitted at checkpoint.
    • Therefore, if the consumer only reads committed data (see Kafka consumer configuration  isolation.level), no data duplication will occur when Flink restarts.
    • However, this will make the data visible only after the checkpoint is completed, so please adjust the checkpoint interval as needed.
    • Please confirm that the prefix of the transaction ID (transactionIdPrefix) is unique for different applications to ensure that transactions of different jobs will not affect each other! In addition, it is strongly recommended to adjust Kafka's transaction timeout to be much larger than the maximum checkpoint interval + maximum restart time, otherwise Kafka's expiration of uncommitted transactions will cause data loss.

4.1 At least once configuration

DataStream<String> stream = ...;

// 初始化 KafkaSink 实例
KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
        // TODO 必填项:配置 kafka 的地址和端口
        .setBootstrapServers("worker01:9092")
        // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
        .setRecordSerializer(
                KafkaRecordSerializationSchema.<String>builder()
                        .setTopic("20230912")
                        .setValueSerializationSchema(new SimpleStringSchema())
                        .build()
        )
        // TODO 必填项:配置容灾保证级别设置为 至少一次
        .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
        .build();

stream.sinkTo(sink);

4.2  精确One-time configuration

// 如果是精准一次,必须开启checkpoint
env.enableCheckpointing(2000, CheckpointingMode.EXACTLY_ONCE);

DataStream<String> stream = ...;
        
KafkaSink<String> sink = KafkaSink.<String>builder()  // 泛型为 输入输入的类型
        // TODO 必填项:配置 kafka 的地址和端口
        .setBootstrapServers(brokers)
        // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
        .setRecordSerializer(KafkaRecordSerializationSchema.builder()
            .setTopic("topic-name")
            .setValueSerializationSchema(new SimpleStringSchema())
            .build()
        )
        // TODO 必填项:配置容灾保证级别设置为 精准一次
        .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
        // 如果是精准一次,必须设置 事务的前缀
        .setTransactionalIdPrefix("flink-")
        // 如果是精准一次,必须设置 事务超时时间: 大于checkpoint间隔,小于 max 15分钟
        .setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, "6000")
        .build();
        
stream.sinkTo(sink);

5. This is a complete introductory case

Requirements: Flink reads the socket data source in real time and writes the read data to Kafka (to ensure no loss or duplication)

Development language: java1.8

flink version: flink1.17.0

package com.baidu.datastream.sink;

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.connector.base.DeliveryGuarantee;
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema;
import org.apache.flink.connector.kafka.sink.KafkaSink;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.kafka.clients.producer.ProducerConfig;

// TODO flink 数据输出到kafka
public class SinkKafka {
    public static void main(String[] args) throws Exception {
        // 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);

        // 如果是精准一次,必须开启checkpoint
        env.enableCheckpointing(2000, CheckpointingMode.EXACTLY_ONCE);

        // 2.指定数据源
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);

        // 3.初始化 KafkaSink 实例
        KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
                // TODO 必填项:配置 kafka 的地址和端口
                .setBootstrapServers("worker01:9092")
                // TODO 必填项:配置消息序列化器信息 Topic名称、消息序列化器类型
                .setRecordSerializer(
                        KafkaRecordSerializationSchema.<String>builder()
                                .setTopic("20230912")
                                .setValueSerializationSchema(new SimpleStringSchema())
                                .build()
                )
                // TODO 必填项:配置容灾保证级别设置为 精准一次
                .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
                // 如果是精准一次,必须设置 事务的前缀
                .setTransactionalIdPrefix("flink-")
                // 如果是精准一次,必须设置 事务超时时间: 大于checkpoint间隔,小于 max 15分钟
                .setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, "6000")
                .build();

        streamSource.sinkTo(kafkaSink);

        // 3.触发程序执行
        env.execute();
    }
}

Guess you like

Origin blog.csdn.net/weixin_42845827/article/details/132854512