4.3. How to read data in Kafka for Flink tasks

Table of contents

1. Add pom dependencies

2. API instructions

3. This is a complete entry case

4. How should Kafka messages be parsed

4.1. Only get the value part of the Kafka message

​4.2. Obtain complete Kafka messages (key, value, Metadata)

4.3. Customize Kafka message parser

5. How to set the initial consumption point

​5.1、earliest()

5.2、latest()

5.3、timestamp()

6. What should I do if the Kafka partition is expanded——Dynamic partition check

7. Extract event time & add watermark when loading KafkaSource

7.1. Use the built-in monotonically increasing water level generator + kafka timestamp as the event time

7.2. Use the built-in monotonically increasing water level generator + the ID field in the kafka message as the event time


1. Add pom dependencies

We can use Flink's official tool flink-connector-kafka to connect to Kafka

This tool implements a consumer FlinkKafkaConsumer, which can be used to read kafka data

If you want to use this general Kafka connection tool, you need to introduce jar dependencies

<!-- 引入 kafka连接器依赖-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId>
    <version>1.17.0</version>
</dependency>

2. API instructions

Official website link: Apache Kafka Connector

Grammar description: 

// 1.初始化 KafkaSource 实例
KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers(brokers)                           // 必填:指定broker连接信息 (为保证高可用,建议多指定几个节点)                     
    .setTopics("input-topic")                               // 必填:指定要消费的topic
    .setGroupId("my-group")                                 // 必填:指定消费者的groupid(不存在时会自动创建)
    .setValueOnlyDeserializer(new SimpleStringSchema())     // 必填:指定反序列化器(用来解析kafka消息数据,转换为flink数据类型)
    .setStartingOffsets(OffsetsInitializer.earliest())      // 可选:指定启动任务时的消费位点(不指定时,将默认使用 OffsetsInitializer.earliest())
    .build(); 

// 2.通过 fromSource + KafkaSource 获取 DataStreamSource
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");

3. This is a complete entry case

Development language: java1.8

Flink version: flink1.17.0

public class ReadKafka {
    public static void main(String[] args) throws Exception {
        newAPI();
    }

    public static void newAPI() throws Exception {
        // 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.读取kafka数据
        KafkaSource<String> source = KafkaSource.<String>builder()
                .setBootstrapServers("worker01:9092")               // 必填:指定broker连接信息 (为保证高可用,建议多指定几个节点)
                .setTopics("20230810")                              // 必填:指定要消费的topic
                .setGroupId("FlinkConsumer")                        // 必填:指定消费者的groupid(不存在时会自动创建)
                .setValueOnlyDeserializer(new SimpleStringSchema()) // 必填:指定反序列化器(用来解析kafka消息数据)
                .setStartingOffsets(OffsetsInitializer.earliest())  // 可选:指定启动任务时的消费位点(不指定时,将默认使用 OffsetsInitializer.earliest())
                .build();

        env.fromSource(source,
                WatermarkStrategy.noWatermarks(),
                "Kafka Source")
                .print()
        ;

        // 3.触发程序执行
        env.execute();
    }
}

4. How should Kafka messages be parsed

A deserializer (Deserializer) needs to be provided in the code to parse Kafka messages

The function of the deserializer:

                Convert Kafka ConsumerRecords to data types handled by Flink (Java/Scala objects)

The deserializer  is specified by setDeserializer(KafkaRecordDeserializationSchema.of(deserializer type)) 

Two commonly used Kafka message parsers are introduced below:

        KafkaRecordDeserializationSchema.of(new JSONKeyValueDeserializationSchema(true)) :

                 1. Return the complete Kafka message and deserialize the JSON string into an ObjectNode object

                 2. You can choose whether to return the Metadata information of the Kafak message, true-return, false-not return

        KafkaRecordDeserializationSchema.valueOnly(StringDeserializer.class) :

                1. Only return the value part of the Kafka message 

4.1. Only get the value part of the Kafka message

4.2. Obtain complete Kafka messages (key, value, Metadata)

Kafak message format:

                key = {"nation":"Shu"}

                value = {"ID": integer}

    public static void ParseMessageJSONKeyValue() throws Exception {
        // 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.读取kafka数据
        KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
                .setBootstrapServers("worker01:9092")               // 必填:指定broker连接信息 (为保证高可用,建议多指定几个节点)
                .setTopics("9527")                                  // 必填:指定要消费的topic
                .setGroupId("FlinkConsumer")                        // 必填:指定消费者的groupid(不存在时会自动创建)
                // 必填:指定反序列化器(将kafak消息解析为ObjectNode,json对象)
                .setDeserializer(KafkaRecordDeserializationSchema.of(
                        // includeMetadata = (true:返回Kafak元数据信息 false:不返回)
                        new JSONKeyValueDeserializationSchema(true)
                ))
                .setStartingOffsets(OffsetsInitializer.latest())  // 可选:指定启动任务时的消费位点(不指定时,将默认使用 OffsetsInitializer.earliest())
                .build();

        env
                .fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
                .print()
        ;

        // 3.触发程序执行
        env.execute();

    }

operation result:    

Common errors: 

Caused by: java.io.IOException: Failed to deserialize consumer record ConsumerRecord(topic = 9527, partition = 0, leaderEpoch = 0, offset = 1064, CreateTime = 1691668775938, serialized key size = 4, serialized value size = 9, headers = RecordHeaders(headers = [], isReadOnly = false), key = [B@5e9eaab8, value = [B@67390400).
	at org.apache.flink.connector.kafka.source.reader.deserializer.KafkaDeserializationSchemaWrapper.deserialize(KafkaDeserializationSchemaWrapper.java:57)
	at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:53)
	... 14 more
Caused by: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'xxxx': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (byte[])"xxxx"; line: 1, column: 5]

Reason for error:

          This error occurs, generally when using flink to read fafka, use JSONKeyValueDeserializationSchema

When parsing the message, the key or value content in the kafka message does not conform to the json format, resulting in a parsing error

For example, the following format will cause a parsing error key=1000, value=hello

So how should we solve it?

        1. If you have permission to modify the Kafka message format, you can modify the Kafka message key&value content to Json format

        2. If you do not have permission to modify the Kafka message format (such as online environment, it is difficult to modify), you can re-implement

       The JSONKeyValueDeserializationSchema class parses Kafka messages according to the required format (you can refer to the source code)

4.3. Customize Kafka message parser

        There are always various formats for Kafka messages and parsing in production. When Flink’s predefined parser cannot meet business needs, you can complete business support by customizing the Kafka message parser

For example, when using MyJSONKeyValueDeserializationSchema to obtain Kafka metadata, only three field information of offset, topic, and partition are returned, and now ` timestamp` when the kafka producer writes data is needed , which can be done by customizing the kafka message parser

Code example:

// TODO 自定义Kafka消息解析器,在 metadata 中增加 timestamp字段
public class MyJSONKeyValueDeserializationSchema implements KafkaDeserializationSchema<ObjectNode>{

        private static final long serialVersionUID = 1509391548173891955L;

        private final boolean includeMetadata;
        private ObjectMapper mapper;

        public MyJSONKeyValueDeserializationSchema(boolean includeMetadata) {
            this.includeMetadata = includeMetadata;
        }

        @Override
        public void open(DeserializationSchema.InitializationContext context) throws Exception {
            mapper = JacksonMapperFactory.createObjectMapper();
        }

        @Override
        public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
            ObjectNode node = mapper.createObjectNode();
            if (record.key() != null) {
                node.set("key", mapper.readValue(record.key(), JsonNode.class));
            }
            if (record.value() != null) {
                node.set("value", mapper.readValue(record.value(), JsonNode.class));
            }
            if (includeMetadata) {
                node.putObject("metadata")
                        .put("offset", record.offset())
                        .put("topic", record.topic())
                        .put("partition", record.partition())
                        // 添加 timestamp 字段
                        .put("timestamp",record.timestamp())
                ;
            }
            return node;
        }

        @Override
        public boolean isEndOfStream(ObjectNode nextElement) {
            return false;
        }

        @Override
        public TypeInformation<ObjectNode> getProducedType() {
            return getForClass(ObjectNode.class);
        }
    }

operation result:


5. How to set the initial consumption point

Description of the starting point of consumption:

        The starting point of consumption refers to where the Kafka message should be read when the flink task is started   

        Here are three commonly used settings:    

                OffsetsInitializer.earliest()  :

                        Eliminate from the earliest position

                        The earliest here refers to the duration of Kafka message storage (the default is 7 days, and the production environment is slightly different for each company)

                        This setting is the default setting. When OffsetsInitializer.xxx is not specified , the default is earliest() 

                OffsetsInitializer.latest()   :

                        Start consumption from the last position

                        The end here refers to the message produced after the start time of the flink task

                OffsetsInitializer.timestamp(timestamp):

                        Start consuming data with a timestamp greater than or equal to the specified timestamp (milliseconds)

The following uses a case to illustrate the effects of the three settings. Kafak generates 10 pieces of data, as follows:

5.1、earliest()

Code example:

KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
        .setBootstrapServers("worker01:9092")
        .setTopics("23230811")
        .setGroupId("FlinkConsumer")
        // 将kafka消息解析为Json对象,并返回元数据
        .setDeserializer(KafkaRecordDeserializationSchema.of(
                new JSONKeyValueDeserializationSchema(true)
        ))
        // 设置起始消费位点:从最早位置开始消费(该设置为默认设置)
        .setStartingOffsets(OffsetsInitializer.earliest())
        .build();

operation result:

5.2、latest()

Code example:

KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
        .setBootstrapServers("worker01:9092")
        .setTopics("23230811")
        .setGroupId("FlinkConsumer")
        // 将kafka消息解析为Json对象,并返回元数据
        .setDeserializer(KafkaRecordDeserializationSchema.of(
                new JSONKeyValueDeserializationSchema(true)
        ))
        // 设置起始消费位点:从最末尾位点开始消费
        .setStartingOffsets(OffsetsInitializer.latest())
        .build();

operation result:

5.3、timestamp()

Code example:

KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
        .setBootstrapServers("worker01:9092")
        .setTopics("23230811")
        .setGroupId("FlinkConsumer")
        // 将kafka消息解析为Json对象,并返回元数据
        .setDeserializer(KafkaRecordDeserializationSchema.of(
                new MyJSONKeyValueDeserializationSchema(true)
        ))
        // 设置起始消费位点:从指定时间戳后开始消费
        .setStartingOffsets(OffsetsInitializer.timestamp(1691722791273L))
        .build();

operation result:


6. What should I do if the Kafka partition is expanded——Dynamic partition check

        In flink1.13, if the Kafka partition is expanded, the data of the newly added partition can only be consumed by restarting the flink task. The editor once encountered the expansion of the Kafka partition of the upstream business department, and did not notify the downstream users. , resulting in abnormal real-time indicators and even data loss.

        In flink1.17, you can consume the data of the newly added partition without restarting the flink task by `turning on the dynamic partition check`

Enable partition check: (not enabled by default)

KafkaSource.builder()
    .setProperty("partition.discovery.interval.ms", "10000"); // 每 10 秒检查一次新分区

Code example:

KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
        .setBootstrapServers("worker01:9092")
        .setTopics("9527")
        .setGroupId("FlinkConsumer")
        // 将kafka消息解析为Json对象,并返回元数据
        .setDeserializer(KafkaRecordDeserializationSchema.of(
                new JSONKeyValueDeserializationSchema(true)
        ))
        // 设置起始消费位点:从最末尾位点开始消费
        .setStartingOffsets(OffsetsInitializer.latest())
        // 开启动态分区检查(默认不开启)
        .setProperty("partition.discovery.interval.ms", "10000") // 每 10 秒检查一次新分区
        .build();

7. Extract event time & add watermark when loading KafkaSource

When  fromSource(source,WatermarkStrategy,sourceName) , extract event time and formulate watermark generation strategy

Note: When no event time extractor is specified, Kafka Source uses the timestamp in the Kafka message as the event time

7.1. Use the built-in monotonically increasing water level generator + kafka timestamp as the event time

Code example:

    // 在读取Kafka消息时,提取事件时间&插入水位线
    public static void KafkaSourceExtractEventtimeAndWatermark() throws Exception {
        // 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.读取kafka数据
        KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
                .setBootstrapServers("worker01:9092")
                .setTopics("9527")
                .setGroupId("FlinkConsumer")
                // 将kafka消息解析为Json对象,并返回元数据
                .setDeserializer(KafkaRecordDeserializationSchema.of(
                        new MyJSONKeyValueDeserializationSchema(true)
                ))
                // 设置起始消费位点:从最末尾位点开始消费
                .setStartingOffsets(OffsetsInitializer.latest())
                .build();

        env.fromSource(source,
                        // 使用内置的单调递增的水位线生成器(默认使用 kafka的timestamp作为事件时间)
                        WatermarkStrategy.forMonotonousTimestamps(),
                        "Kafka Source")
                // 通过 ProcessFunction 查看提取的事件时间和水位线信息
                .process(
                        new ProcessFunction<ObjectNode, String>() {
                            @Override
                            public void processElement(ObjectNode kafkaJson, ProcessFunction<ObjectNode, String>.Context ctx, Collector<String> out) throws Exception {
                                // 当前处理时间
                                long currentProcessingTime = ctx.timerService().currentProcessingTime();
                                // 当前水位线
                                long currentWatermark = ctx.timerService().currentWatermark();
                                StringBuffer record = new StringBuffer();
                                record.append("========================================\n");
                                record.append(kafkaJson + "\n");
                                record.append("currentProcessingTime:" + currentProcessingTime + "\n");
                                record.append("currentWatermark:" + currentWatermark + "\n");
                                record.append("kafka-ID:" + Long.parseLong(kafkaJson.get("value").get("ID").toString()) + "\n");
                                record.append("kafka-timestamp:" + Long.parseLong(kafkaJson.get("metadata").get("timestamp").toString()) + "\n");
                                out.collect(record.toString());

                            }
                        }
                ).print();

        // 3.触发程序执行
        env.execute();
    }

operation result:

7.2. Use the built-in monotonically increasing water level generator + the ID field in the kafka message as the event time

Code example:

    // 在读取Kafka消息时,提取事件时间&插入水位线
    public static void KafkaSourceExtractEventtimeAndWatermark() throws Exception {
        // 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.读取kafka数据
        KafkaSource<ObjectNode> source = KafkaSource.<ObjectNode>builder()
                .setBootstrapServers("worker01:9092")
                .setTopics("9527")
                .setGroupId("FlinkConsumer")
                // 将kafka消息解析为Json对象,并返回元数据
                .setDeserializer(KafkaRecordDeserializationSchema.of(
                        new MyJSONKeyValueDeserializationSchema(true)
                ))
                // 设置起始消费位点:从最末尾位点开始消费
                .setStartingOffsets(OffsetsInitializer.latest())
                .build();

        env.fromSource(source,
                        // 使用内置的单调递增的水位线生成器(使用 kafka消息中的ID字段作为事件时间)
                        WatermarkStrategy.<ObjectNode>forMonotonousTimestamps()
                                // 提取 Kafka消息中的 ID字段作为 事件时间
                                .withTimestampAssigner(
                                        (json, timestamp) -> Long.parseLong(json.get("value").get("ID").toString())
                                ),

                        "Kafka Source")
                // 通过 ProcessFunction 查看提取的事件时间和水位线信息
                .process(
                        new ProcessFunction<ObjectNode, String>() {
                            @Override
                            public void processElement(ObjectNode kafkaJson, ProcessFunction<ObjectNode, String>.Context ctx, Collector<String> out) throws Exception {
                                // 当前处理时间
                                long currentProcessingTime = ctx.timerService().currentProcessingTime();
                                // 当前水位线
                                long currentWatermark = ctx.timerService().currentWatermark();
                                StringBuffer record = new StringBuffer();
                                record.append("========================================\n");
                                record.append(kafkaJson + "\n");
                                record.append("currentProcessingTime:" + currentProcessingTime + "\n");
                                record.append("currentWatermark:" + currentWatermark + "\n");
                                record.append("kafka-ID:" + Long.parseLong(kafkaJson.get("value").get("ID").toString()) + "\n");
                                record.append("kafka-timestamp:" + Long.parseLong(kafkaJson.get("metadata").get("timestamp").toString()) + "\n");
                                out.collect(record.toString());

                            }
                        }
                ).print();

        // 3.触发程序执行
        env.execute();
    }

operation result:

Guess you like

Origin blog.csdn.net/weixin_42845827/article/details/132194623