Flink + kafka end state to ensure consistency

Flink + kafka end state to ensure consistency

What is the status consistency

  • Stateful flow processing, operator within each task has its own state
  • For internal stream processors, the so-called state of consistency, in fact, what we call the results to ensure accurate
  • A loss of data do not change, nor should double counting
  • In the event of a failure state can be restored, the recovery recalculated later, the result is entirely correct

Flink internal state consistency classification

  • LEAST-ONCE-AT (at least once)

Really most scenarios, we hope not to lose time. This type of security means that all the data can be processed, and some time may also be processed multiple times

  • ONCE-MOST-AT (more than once)

When a task fails, the easiest way is do not do anything, neither recover lost status, not replay the missing data, the semantic processing of data more than once

  • ** EXACTLY-ONCE (** precise time)

Exactly once processing is the most rigorous semantics, but also the most difficult to achieve, not just deal with a semantic meaning no events are lost, also means that updates data only once for each internal state

Flink consistency checkpoint by checkpoint to ensure exactly-once semantics, there is a consistency check point application of state of the stream, in fact: the status of all tasks in a snapshot at a point in time, and this time, it should be when all tasks are processed exactly one and the same input data. Application state consistency check something flink heart failure recovery mechanisms.

End-to-end-to-end consistency state

From the source -> process -> sink the entire process of data consistency should be guaranteed, then how to ensure it?

Without loss of data is not repeated consumption data.

1.source read offset data may be provided

Like kafka hand from submitting offset

2.process

Checkpoint by checkpoint mechanism consistency

When recovery from 3.sink end, the external system data is not repeatedly written

Idempotent write (idempotent writes)

: The so-called idempotent operation, is that an operation can repeat many times, but only lead to a change in the results, that is to say Repeat back will not work, for example, to insert the same hashMap several times (k- > v)

Things written

: ACID [atomicity, consistency, isolation, durability]

Construction things corresponds to the checkpoint operation, until the checkpoint complete real result only then all of the corresponding sink to the external system

Method to realize:

1.WAL write-ahead log: GenrticWriteAheadSink of DataStreamAPI to achieve things Sink

2.2 Phase Commit 2PC (TWO-Phase-the commit )

: For each checkpoint, sink task will start a thing, and then added all the data received in a transaction

These data are then written to an external sink system, but does not commit them, then just pre-submission;

When it receives notification of the completion of checkpoint, which was formally submitted, the achievement of results actually written;

In this way truly exctly-once, it requires an external sink system to provide support things. flink provides TwoPhaseCommitSinkFunction API Interface

sink、source Non-resettable consumption source Reset source
Any (Any) At-most-once At-least-once
Idempotent At-most-once Exactly-once
Write-ahead log (WAL) At-most-once At-least-once
Two-phase commit (2PC) At-most-once Exactly-once

Here Insert Picture Description
Here Insert Picture Description

Flink + kafka end state to achieve consistency

As well as in previous versions kafka of 0.8, kafka zk only by default in the offset storage;

Kafka version of 0.9-0.10, only to ensure that at-least-once semantics of data through the following configuration

出了要开启flink的checkpoint功能,同时还要设置相关配置功能。
因在0.9或者0.10,默认的FlinkKafkaProducer只能保证at-least-once语义,假如需要满足at-least-once语义,我们还需要设置
setLogFailuresOnly(boolean)    默认false
setFlushOnCheckpoint(boolean)  默认true

In 0.11 and later versions of kafka, flink well supported in exactly-once semantics

1. Internal , through checkpoint mechanism to save the state, you can restore the event of failure, to ensure internal consistency flink

env.enableCheckpointing(60000)

2.source , KafkaConsumer as a Source, the offset can be saved if the subsequent down task fails, the recovery time can be reset by the offset of the connector, re-consumption data, to ensure the consistency (Auto)

//自定义kafkaConsumer,同时可以指定从哪里开始消费
//开启了Flink的检查点之后,我们还要开启kafka-offset的检查点,通过kafkaConsumer.setCommitOffsetsOnCheckpoints(true)开启,
//一旦这个检查点开启,那么之前配置的 auto.commit.enable = true的配置就会自动失效
kafkaConsumer.setCommitOffsetsOnCheckpoints(true)

3.sink FlinkkafkaProducer as Sink, using two-phase commit sink, we need to implement a TwoPhaseCommitSinkFunction

// 本身就继承了TwoPhaseCommitSinkFunction,但是我们需要在参数里面传入指定语义,默认时AT-LEAST-ONCE
public class FlinkKafkaProducer<IN>
	extends TwoPhaseCommitSinkFunction<IN, FlinkKafkaProducer.KafkaTransactionState, FlinkKafkaProducer.KafkaTransactionContext> {

It also requires some producer's fault-tolerant configuration:

  • In addition to enabling checkpointing Flink outside , can also be appropriate semantic parameter passed to FlinkKafkaProducer011 (FlinkKafkaProducer for Kafka> = 1.0.0 version)

    To select the three different modes of operation:

  • Semantic.NONE on behalf of at-mostly-once semantics

  • Semantic.AT_LEAST_ONCE (Flink default setting)

  • Semantic.EXACTLY_ONCE use Kafka transaction provides a precise semantics, whenever you use transactional writes Kafka,

      * <li>decrease number of max concurrent checkpoints</li>
      * <li>make checkpoints more reliable (so that they complete faster)</li>
      * <li>increase the delay between checkpoints</li>
      * <li>increase the size of {@link FlinkKafkaInternalProducer}s pool</li>
    

    Do not forget to set the required settings for any application using Kafka record: isolation.level (READ_COMMITTED or read_uncommitted- which is the default)

Precautions

1.Semantic.EXACTLY_ONCE dependent on downstream system can support transactional operations. To 0.11 version, for example

transaction.max.timeout.msThe maximum duration of the supermarket, 15 minutes by default, if you need to use exactly the semantics, need to increase this value.
isolation.levelIf they are needed exactly semantics need to set the lower consumerConfig Read-COMMITED [Read-uncommited (default)], Tell me what network specific
transaction.timeout.msdefault 1hour

Note1: Semantic.EXACTLY_ONCE mode for each instance KafkaProducers FlinkKafkaProducer011 pool of a fixed size. Each checkpoint use of each of these producers. If concurrent checkpoint number exceeds the pool size, FlinkKafkaProducer011 will throw an exception, and the entire application to fail. Please configures the maximum pool size and the maximum number of concurrent check points .

Note2: Semantic.EXACTLY_ONCE take all possible measures not to leave any lingering transaction, which would otherwise hinder consumers more read Kafka theme. However, if the application fails Flink before the first checkpoint, then after you restart these applications, there is no information in the system about the size of the previous pool. Thus, before the first checkpoint completion scaled reduction FlinkKafkaProducer011.SAFE_SCALE_DOWN_FACTOR Flink application

//1。设置最大允许的并行checkpoint数,防止超过producer池的个数发生异常
env.getCheckpointConfig.setMaxConcurrentCheckpoints(5) 
//2。设置producer的ack传输配置
// 设置超市时长,默认15分钟,建议1个小时以上
producerConfig.put(ProducerConfig.ACKS_CONFIG, 1) 
producerConfig.put(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 15000) 

//3。在下一个kafka consumer的配置文件,或者代码中设置ISOLATION_LEVEL_CONFIG-read-commited
//Note:必须在下一个consumer中指定,当前指定是没用用的
kafkaonfigs.setProperty(ConsumerConfig.ISOLATION_LEVEL_CONFIG,"read_commited")

Examples of complete code:

package com.shufang.flink.connectors

import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.flink.streaming.connectors.kafka._
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.clients.producer.ProducerConfig
import org.apache.kafka.common.serialization.StringDeserializer

object KafkaSource01 {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    //这是checkpoint的超时时间
    //env.getCheckpointConfig.setCheckpointTimeout()
    //设置最大并行的chekpoint
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(5)
    env.getCheckpointConfig.setCheckpointInterval(1000) //增加checkpoint的中间时长,保证可靠性


    /**
     * 为了保证数据的一致性,我们开启Flink的checkpoint一致性检查点机制,保证容错
     */
    env.enableCheckpointing(60000)

    /**
     * 从kafka获取数据,一定要记得添加checkpoint,能保证offset的状态可以重置,从数据源保证数据的一致性
     * 保证kafka代理的offset与checkpoint备份中保持状态一致
     */

    val kafkaonfigs = new Properties()

    //指定kafka的启动集群
    kafkaonfigs.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
    //指定消费者组
    kafkaonfigs.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "flinkConsumer")
    //指定key的反序列化类型
    kafkaonfigs.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getName)
    //指定value的反序列化类型
    kafkaonfigs.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getName)
    //指定自动消费offset的起点配置
    //    kafkaonfigs.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")


    /**
     * 自定义kafkaConsumer,同时可以指定从哪里开始消费
     * 开启了Flink的检查点之后,我们还要开启kafka-offset的检查点,通过kafkaConsumer.setCommitOffsetsOnCheckpoints(true)开启,
     * 一旦这个检查点开启,那么之前配置的 auto-commit-enable = true的配置就会自动失效
     */
    val kafkaConsumer = new FlinkKafkaConsumer[String](
      "console-topic",
      new SimpleStringSchema(), // 这个schema是将kafka的数据应设成Flink中的String类型
      kafkaonfigs
    )

    // 开启kafka-offset检查点状态保存机制
    kafkaConsumer.setCommitOffsetsOnCheckpoints(true)

    //    kafkaConsumer.setStartFromEarliest()//
    //    kafkaConsumer.setStartFromTimestamp(1010003794)
    //    kafkaConsumer.setStartFromLatest()
    //    kafkaConsumer.setStartFromSpecificOffsets(Map[KafkaTopicPartition,Long]()

    // 添加source数据源
    val kafkaStream: DataStream[String] = env.addSource(kafkaConsumer)

    kafkaStream.print()

    val sinkStream: DataStream[String] = kafkaStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.seconds(5)) {
      override def extractTimestamp(element: String): Long = {
        element.split(",")(1).toLong
      }
    })


    /**
     * 通过FlinkkafkaProduccer API将stream的数据写入到kafka的'sink-topic'中
     */
    //    val brokerList = "localhost:9092"
    val topic = "sink-topic"
    val producerConfig = new Properties()
    producerConfig.put(ProducerConfig.ACKS_CONFIG, new Integer(1)) // 设置producer的ack传输配置
    producerConfig.put(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, Time.hours(2)) //设置超市时长,默认1小时,建议1个小时以上

    /**
     * 自定义producer,可以通过不同的构造器创建
     */
    val producer: FlinkKafkaProducer[String] = new FlinkKafkaProducer[String](
      topic,
      new KeyedSerializationSchemaWrapper[String](SimpleStringSchema),
      producerConfig,
      Semantic.EXACTLY_ONCE
    )

    //    FlinkKafkaProducer.SAFE_SCALE_DOWN_FACTOR
    /** *****************************************************************************************************************
     * * 出了要开启flink的checkpoint功能,同时还要设置相关配置功能。
     * * 因在0.9或者0.10,默认的FlinkKafkaProducer只能保证at-least-once语义,假如需要满足at-least-once语义,我们还需要设置
     * * setLogFailuresOnly(boolean)    默认false
     * * setFlushOnCheckpoint(boolean)  默认true
     * * come from 官网 below:
     * * Besides enabling Flink’s checkpointing,you should also configure the setter methods setLogFailuresOnly(boolean)
     * * and setFlushOnCheckpoint(boolean) appropriately.
     * ******************************************************************************************************************/

    producer.setLogFailuresOnly(false) //默认是false


    /**
     * 除了启用Flink的检查点之外,还可以通过将适当的semantic参数传递给FlinkKafkaProducer011(FlinkKafkaProducer对于Kafka> = 1.0.0版本)
     * 来选择三种不同的操作模式:
     * Semantic.NONE  代表at-mostly-once语义
     * Semantic.AT_LEAST_ONCE(Flink默认设置)
     * Semantic.EXACTLY_ONCE:使用Kafka事务提供一次精确的语义,每当您使用事务写入Kafka时,
     * 请不要忘记为使用Kafka记录的任何应用程序设置所需的设置isolation.level(read_committed 或read_uncommitted-后者是默认值)
     */

    sinkStream.addSink(producer)

    env.execute("kafka source & sink")
  }
}

Published 65 original articles · won praise 3 · Views 2111

Guess you like

Origin blog.csdn.net/shufangreal/article/details/104737652