Flink connects kafka source and kafka sink of kafka

Table of Contents

 

Import connector dependencies

Kafka consumers of Flink

Deserializer

Custom deserializer

Set the starting point of consumption

Specify the offset starting point

Kafka consumption fault tolerance (checkpoint mechanism)

Zone awareness

offset commit configuration

Extract timestamp and generate watermark

Kafka producer of Flink

Serializer

Kafka producer partitioner

Kafka producer fault tolerance mechanism

Use kafka timestamp and flink event time

data lost


Import connector dependencies

Flink itself does not provide an interface to link to kafka, and you need to import related dependencies before you can use it. After flink 1.7, flink-connector-kafka automatically adapts to the latest version of Kafka, but if you use a lower version of Kafka, such as 0.11, 0.10, 0.9 or 0.8, you should use the corresponding Kafka connector.

Maven dependency flink version Consumer and producer class names Kafka version Remarks
flink-connector-kafka-0.8_2.11 1.0.0 FlinkKafkaConsumer08
FlinkKafkaProducer08
0.8.x Use Kafka's SimpleConsumer  API internally . The offset is submitted to ZK by Flink.
flink-connector-kafka-0.9_2.11 1.0.0 FlinkKafkaConsumer09
FlinkKafkaProducer09
0.9.x Use the new Consumer API  Kafka.
flink-connector-kafka-0.10_2.11 1.2.0 FlinkKafkaConsumer010
FlinkKafkaProducer010
0.10.x The connector supports Kafka messages with timestamps for production and use.
flink-connector-kafka-0.11_2.11 1.4.0 FlinkKafkaConsumer011
FlinkKafkaProducer011
0.11.x Since 0.11.x, Kafka does not support Scala 2.10. The connector supports Kafka transaction messaging to provide producers with precise one-time semantics.
flink-connector-kafka_2.11 1.7.0 FlinkKafka
consumer FlinkKafka producer
> = 1.0.0 This universal Kafka connector will adapt to the latest version of Kafka. The client version used by Flink may change between releases. Starting from Flink 1.9 version, it uses Kafka 2.2.0 client. Kafka client is backward compatible with version 0.10.0 or higher. But for Kafka 0.11.x and 0.10.x versions, we recommend using dedicated flink-connector-kafka-0.11_2.11 and flink-connector-kafka-0.10_2.11 respectively.

rely:

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.11 -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>1.10.0</version>
        </dependency>

Import implicit conversion

Scala needs to import implicit conversion, otherwise an error will be reported. It will not be imported in the following code example.

import org.apache.flink.api.scala._

Kafka consumers of Flink

Flink's kafka consumer class name is  FlinkKafkaConsumer08  (08 is the kafka version, such as the corresponding flink consumer class name of Kafka 0.9.0.x FlinkKafkaConsumer09 ).

To create a flink consumer, you need to pass three parameters:

  1. topic: a topic name, or a list containing multiple topic names.
  2. Serializer: Used to serialize messages.
  3. Properties object: Contains various configurations, such as bootstrap.servers, consumer groups, 0.8 and earlier versions of Kafka also need a zk address to save the offset.

Such as:

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
stream = env
    .addSource(new FlinkKafkaConsumer08[String]("topic", new SimpleStringSchema(), properties))
    .print()

Deserializer

When flink consumes Kafka, you need to know how to convert the binary data in Kafka into objects in Java/Scala. Flink comes with a variety of deserializers, in addition to the SimpleStringSchema in the above example, there are:

TypeInformationSerializationSchema(或 TypeInformationKeyValueSerializationSchema  : Create a schema based on flink's TypeInformation. If the data only flows between flink, you can use this method, and the performance is higher than other deserialization methods.

JsonDeserializationSchema (或 JSONKeyValueDeserializationSchema  : can convert json data in kafka into an ObjectNode object, which can be used objectNode.get("field").as(Int/String/...)()to access the specified fields. If you are using the deserializer in parentheses, you will get a k/v type that objectNode,not only contains all the fields in json, but also contains the metadata information of kafka, such as topic, partitions, and offset.

AvroDeserializationSchema : Used to read serialized data in Avro format using a static schema. The schema information can be inferred from the classes generated by Avro (such as AvroDeserializationSchema.forSpecific(...)); you can also manually specify the schema information (use the GenericRecords class, generated by AvroDeserializationSchema.forGeneric(...)). To use Avro serialization, you need to import corresponding dependencies, such as flink-avro or flink-avro-confluent-registry.

Custom deserializer

Flink provides an DeserializationSchema interface that allows you to rewrite the T deserialize(byte[] message) methods in it to customize the deserializer. deserialize will process every kafka message and return custom type data.

Take KeyedDeserializationSchema as an example, rewrite the deserialize method to return a triple containing topic, key, and value:

public class KafkaDeserializationTopicSchema implements KeyedDeserializationSchema<Tuple3<String,String,String>> {
 
 
    public KafkaDeserializationTopicSchema(){
 
    }
    @Override
    public Tuple3 deserialize(byte[] keyByte, byte[] message, String topic, int partition, long offset) throws IOException {
        String key = null;
        String value = null;
        if (keyByte != null) {
            key = new String(keyByte, StandardCharsets.UTF_8);
        }
        if (message != null) {
            value = new String(message,StandardCharsets.UTF_8);
        }
 
        return new Tuple3(topic,key, value);
    }
 
    @Override
    public boolean isEndOfStream(Tuple3 o) {
        return false;
    }
 
    @Override
    public TypeInformation getProducedType() {
        return TypeInformation.of(new TypeHint<Tuple3<String,String,String>>(){});
    }

Set the starting point of consumption

Flink can set the starting point for consumption of Kafka. Such as:

val env = StreamExecutionEnvironment.getExecutionEnvironment()

val myConsumer = new FlinkKafkaConsumer08[String](...)
myConsumer.setStartFromEarliest()      // 从最早的offset开始消费
myConsumer.setStartFromLatest()        // 从最迟的offset开始消费
myConsumer.setStartFromTimestamp(...)  // 从指定的时间开始消费
myConsumer.setStartFromGroupOffsets()  // 从当前组消费到的offset开始消费(默认的消费策略)

val stream = env.addSource(myConsumer)

setStartFromGroupOffsets : Start consumption from the offset submitted by the consumer group to Kafka. Kafka version 0.8 is stored in zookeeper. After 0.8, Kafka has a topic dedicated to saving offset. If the offset is not found, consumption starts from the parameters set by auto.offset.reset.

setStartFromEarliest() /  setStartFromLatest():Start consumption from the earliest/latest offset. If this mode is used, the submitted offset will be ignored and will not be consumed from the submitted offset.

setStartFromTimestamp(long): Start consumption from the specified timestamp. In each partition, data with a timestamp greater than or equal to this timestamp will be consumed; if the timestamp of the latest data in the partition is less than the specified timestamp, consumption will start from the latest timestamp. If this mode is used, the submitted offset will be ignored and consumption will not start from the submitted offset.

Specify the offset starting point

You can also manually specify each partition to start consumption from a certain offset. For example, "myTopic" is the topic of consumption, 0 1 2 is the partition number of the topic, and 23 31 43 is the offset, which is the next consumption offset. If no partition is specified, consumption will start from the offset consumed by the consumer group, that is, fall back to the setStartFromGroupOffsets mode.

val specificStartOffsets = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L)

myConsumer.setStartFromSpecificOffsets(specificStartOffsets)

Note: When the job is automatically restored from a failure or manually restored using checkpoint, consumption will continue from the offset in the saved state, and these settings will not be used again.

Kafka consumption fault tolerance (checkpoint mechanism)

After the checkpoint mechanism is turned on, flink will periodically save Kafka's offset and work status (including calculation results) when consuming Kafka, and ensure data consistency. If an error occurs in the task, flink will restore the calculation state from the checkpoint and start to consume data again from the saved offset. 

Therefore, the interval between saving checkpoints determines how much data is lost when the task fails.

The checkpoint needs to call the enableCheckpointing method of the context object before calling addsink. The parameter is the time interval for saving the checkpoint, in milliseconds. code show as below:

val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(5000) // checkpoint every 5000 msecs

Zone awareness

Flink consumption Kafka supports dynamic perception of Kafka partition changes. When Kafka creates a new partition, flink can discover the new partition and consume them exactly once. All partitions discovered after the initial retrieval of partition metadata (that is, the partitions discovered when the job starts running) will be consumed from the earliest offset.

By default, partition awareness is disabled. You can enable partition awareness by setting the parameter of flink.partition-discovery.interval-millis in properties. The parameter is a non-negative value, which indicates the interval time for checking partition changes, in milliseconds. .

Flink can also use regular expressions to match topics based on topic names, such as:

val env = StreamExecutionEnvironment.getExecutionEnvironment()

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")

val myConsumer = new FlinkKafkaConsumer08[String](
  java.util.regex.Pattern.compile("test-topic-[0-9]"),
  new SimpleStringSchema,
  properties)

val stream = env.addSource(myConsumer)

In the above example, flink will consume all topics that start with "test-topic-" and end with numbers.

offset commit configuration

Kafka will submit the offsets consumed by flink to the built-in topic of kafka (version 0.8 of kafka is saved to zookeeper), but flink will not use these offsets as a fault tolerance mechanism

, The offset saved by Kafka is only used for Kafka to monitor the consumption status.

According to whether checkpoint is enabled, there are different ways to submit offset:

Enable checkpoint: If checkpoint is enabled, flink will first save the offset and state to the checkpoint, and then submit the offset to Kafka. This ensures that the offset saved in kafka and checkpoint is consistent. You can use setCommitOffsetsOnCheckpoints(boolean) to set whether to submit the offset to Kafka, the default is true. If setCommitOffsetsOnCheckpoints is called, the parameters submitted by setCommitOffsetsOnCheckpoints will override the parameters configured in Properties.

Disable checkpoint: If checkpoint is disabled, then flink relies on the offset submitted to Kafka for consumption, which can be used in Properties enable.auto.commit (this is for version 0.8 of Kafka auto.commit.enable) or  auto.commit.interval.ms to set whether to automatically submit offset and submission interval.

Extract timestamp and generate watermark

Kafka data may have event timestamps, timestamps and watermarks are described in detail in another article, so I won't repeat them here. If you do not need to use event timestamps, you can skip this section.

Setting the watermark generator and registering the timestamp stamp is by calling the assignTimestampsAndWatermarks method of the kafka consumer object and passing in a custom watermark generator, such as:

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")

val myConsumer = new FlinkKafkaConsumer08[String]("topic", new SimpleStringSchema(), properties)
myConsumer.assignTimestampsAndWatermarks(new CustomWatermarkEmitter())
stream = env
    .addSource(myConsumer)
    .print()

watermark and how to customize the watermark generator (allocator), reference: https://blog.csdn.net/x950913/article/details/106246807

 

Kafka producer of Flink

The kafka producer class name of flink is FlinkKafkaProducer011 (if it is Kafka 0.10.0.x version, it is FlinkKafkaProducer010;if the kafka version is higher than 1.0.0, then yes FlinkKafkaProducer ). You can use the producer object to write data to one or more topics.

Code example:

val stream: DataStream[String] = ...

val myProducer = new FlinkKafkaProducer011[String](
        "localhost:9092",         // broker list
        "my-topic",               // target topic
        new SimpleStringSchema)   // serialization schema

// versions 0.10+ allow attaching the records' event timestamp when writing them to Kafka;
// this method is not available for earlier Kafka versions
myProducer.setWriteTimestampToKafka(true)

stream.addSink(myProducer)

 

Serializer

Refer to the deserializer of kafka consumer.

Kafka producer partitioner

If the partitioner is not set, flink will use its own FlinkFixedPartitioner partitioning. By default, each parallel subtask generates a partition, that is, the number of partitions is equal to the degree of parallelism.

The partitioner can be customized by inheriting the FlinkKafkaPartitioner class. All versions support custom partitioners.

Note: Partitioners must be serializable, because they will be sent to each flink node. Moreover, the partitioner will not be saved to the checkpoint, so do not save the state in the partitioner, otherwise the state in the partitioner will be lost after the task fails.

You can also not use the partitioner, and rely on the serializer to specify its partition for each piece of data. If this is the case, you must use null as the partitioner when setting the partitioner (null must be specified, because the above said that if the partitioner is not specified, the default FlinkFixedPartitioner partitioner will be used ).

Kafka producer fault tolerance mechanism

Kafka 0.8 version

Kafka version 0.8 does not support precise one-time and at least one-time fault tolerance.

Kafka 0.9 and 0.10 version

After the checkpoint is turned on, Kafka versions 0.9 and 0.10 support at least one consumption.

In addition to opening checkpoint, other parameters should be configured using the setLogFailuresOnly(boolean) and setFlushOnCheckpoint(boolean) method.

  • setLogFailuresOnly  : defaults to false. After it is set to true, when an exception occurs, only the error information is recorded, and no exception is thrown. It can be understood that if an exception occurs, the piece of data is also considered to have been successfully submitted to Kafka. Therefore, if you want to guarantee at least one submission, set the parameter to false.
  • setFlushOnCheckpoint: The default is true. When enabled, when flink saves the checkpoint, it will wait for kafka to return the ack confirmation before saving the offset and state. This ensures that all data has been submitted to Kafka before the offset and state are written to the checkpoint. If you want to guarantee at least one submission, it must be set to true.

If you want to ensure at least one submission of Kafka version 0.9 and 0.10, you must enable checkpoint and ensure that setLogFailuresOnly is false and setFlushOnCheckpoint is true.

Note : The default number of submission retries is 0, so when setLogFailuresOnly is false, if an error occurs, the submission of the data fails immediately. The default value of 0 is to prevent duplicate data from being generated. It is recommended to increase the number of retries in a production environment.

Kafka 0.11 and higher

If checkpoint is used, then FlinkKafkaProducer011 (if the version of Kafka is higher than 1.0.0, yes FlinkKafkaProducer ) can be guaranteed to be accurate and one-time.

After opening checkpoint, you can choose three submission modes. By FlinkKafkaProducer011 setting the parameters of semantic:

  • Semantic.NONE: Flink does not guarantee any accuracy. Data may be lost or repeated submissions.
  • Semantic.AT_LEAST_ONCE: The default configuration, it is guaranteed to submit at least once, the data will not be lost, but it may be repeated. It is the same as setFlushOnCheckpoint is true in Kafka 0.9 and 0.10.
  • Semantic.EXACTLY_ONCE: Ensure accurate one-time submission. Use kafka's transaction mechanism to achieve (cooperate with flink's two-phase commit, and a separate article will detail the checkpoint mechanism and two-phase commit mechanism). The official tip is that if the transaction mechanism is used when writing to Kafka, then isolation.level  the settings ( read_committed or read_uncommitted 后者为默认值,应该修改为前者) that need to be modified  are described in the "note" below.

Note : In Semantic.EXACTLY_ONCE mode, after a failure occurs, you need to restore the state from the checkpoint to continue submitting. If the recovery time (or flink failure time) is greater than the timeout time of the Kafka transaction, then the data will be lost. That is, after the first commit, flink hangs, and when flink is restored, the two-phase commit is performed, but the transaction has timed out, then the data submitted this time is lost. Based on this, Kafka's transaction timeout can be adjusted appropriately.

The default timeout of Kafka is set by transaction.max.timeout.ms, which is 15 minutes by default. When the two-phase commit interval is greater than 15 minutes, the transaction fails. FlinkKafkaProducer011 changed its default value to 1 hour. Therefore, when using precise one-off semantics, this value should be increased appropriately.

If the consumer of Kafka uses the read_committed mode, then all data after the transaction will not be read by the consumer before the unfinished transaction is committed. For example, if the two transaction timelines are as follows:

  1. Send transaction A;
  2. Send transaction B;
  3. Submit transaction B;

Although transaction B is committed before transaction A, transaction A is still not committed, so consumers still cannot read the data sent by transaction B unless the read_uncommitted mode is used.

Note again : In the  Semantic.EXACTLY_ONCE mode, each  FlinkKafkaProducer011  instance will use a fixed-size Kafka producer pool, where each Kafka producer has a checkpoint. If the current number of concurrent checkpoints is greater than the size of the pool, then flink will report an error and cause the application to fail. So it is recommended to increase the size of this pool appropriately.

Final note : As mentioned above, when the transaction is delayed and the consumer is in read_committed mode, the data submitted after the transaction cannot be read. Then there is such a situation, when the flink program sends a transaction, but it is not committed, and before the first checkpoint is generated, flink hangs. When flink restarts, there is no previous information in the checkpoint and cannot be submitted. This transaction does not know the existence of this uncommitted transaction and continues to send other data, then the consumer will not be able to consume the data, and the data may also be rolled back (guess, to be verified). So this is not safe. Make sure that the flink program does not hang up before generating the checkpoint.

This parameter FlinkKafkaProducer011.SAFE_SCALE_DOWN_FACTOR seems to be related to this mechanism. Leave it here and take a closer look when you have time.

	/**
	 * This coefficient determines what is the safe scale down factor.
	 *
	 * <p>If the Flink application previously failed before first checkpoint completed or we are starting new batch
	 * of {@link FlinkKafkaProducer011} from scratch without clean shutdown of the previous one,
	 * {@link FlinkKafkaProducer011} doesn't know what was the set of previously used Kafka's transactionalId's. In
	 * that case, it will try to play safe and abort all of the possible transactionalIds from the range of:
	 * {@code [0, getNumberOfParallelSubtasks() * kafkaProducersPoolSize * SAFE_SCALE_DOWN_FACTOR) }
	 *
	 * <p>The range of available to use transactional ids is:
	 * {@code [0, getNumberOfParallelSubtasks() * kafkaProducersPoolSize) }
	 *
	 * <p>This means that if we decrease {@code getNumberOfParallelSubtasks()} by a factor larger than
	 * {@code SAFE_SCALE_DOWN_FACTOR} we can have a left some lingering transaction.
	 */
	public static final int SAFE_SCALE_DOWN_FACTOR = 5;

Use kafka timestamp and flink event time

In Kafka 0.10 and later versions, Kafka messages support time stamps. This timestamp can be the event time or the time when the message arrives in Kafka. Refer to https://blog.csdn.net/x950913/article/details/106246807 for the introduction of event time .

If the time semantics is set to the event time TimeCharacteristic.EventTime in flink, then  FlinkKafkaConsumer010  data with event timestamps will be sent. The semantic code for setting time is as follows:

StreamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

If you want Kafka data to carry the timestamp of arrival in Kafka, you do not need to define it during production. The default is the time of arrival in Kafka.

No matter which timestamp is used, you need to use the configuration object of the flink-kafka linker to call setWriteTimestampToKafka to true.

FlinkKafkaProducer010.FlinkKafkaProducer010Configuration config = FlinkKafkaProducer010.writeToKafkaWithTimestamps(streamWithTimestamps, topic, new SimpleStringSchema(), standardProps);
config.setWriteTimestampToKafka(true);

data lost

Although the settings are accurate and one-time production, the default settings of the following configurations may also cause data loss, it is worth noting:

  • acks
  • log.flush.interval.messages
  • log.flush.interval.ms
  • log.flush.*

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106172218