How to manage the offset of Spark Streaming consuming Kafka (3)

The previous article has introduced how to deal with the offset when spark streaming integrates kafka. Since the drawbacks of the checkpoint that comes with spark streaming are very obvious, it is not recommended to use it in some projects that require high data consistency. Built-in checkpoint for failure recovery.



Versions after spark streaming 1.3 support direct kafka stream. This strategy is more complete, abandoning the original use of Kafka's high-level API to automatically save data offsets. Later versions use Simple API, which is a more low-level API. We You can use checkpoints for disaster recovery, or you can use low-level APIs to obtain offsets and manage the offsets yourself, so that whether it is a program upgrade or a fault restart, Exact One can achieve exactly once semantics on the framework side.


In this article, I will introduce how to manually manage the offset of kafka, and give the specific code for analysis:


Version:

apache spark streaming2.1

apache kafka 0.9.0.0




Notes on manual management of offset:

(1) For the first time When the project starts, because there is no offset in zk, KafkaUtils is used to create InputStream directly. The default is to start consumption from the latest offset, which can be controlled.


(2) If it is not the first start, there is already an offset in zk, so we read the offset of zk and pass it into KafkaUtils, and start consumption processing from the offset at the last end.


(3) In foreachRDD, after processing each batch of data, update the offset stored in zk again.


Note the above three steps, 1 and 2 will only be loaded once, and the third step is for each batch It will be executed once inside.


Let's look at the core code of the first and second steps:


  /****
    *
    * @param ssc  StreamingContext
    * @param kafkaParams configure the parameters of kafka
    * @param zkClient zk connected client
    * @param zkOffsetPath The path of the offset in zk
    * @param topics topic to be processed
    * @return InputDStream[(String, String)] returns the input stream
    */
  def createKafkaStream(ssc: StreamingContext,
                        kafkaParams: Map[String, String],
                        zkClient: ZkClient,
                        zkOffsetPath: String,
                        topics: Set[String]): InputDStream[(String, String)]={
    //Currently only supports offset processing of one topic, read the offset string in zk
    val zkOffsetData=KafkaOffsetManager.readOffsets(zkClient,zkOffsetPath,topics.last)

    val kafkaStream = zkOffsetData match {
      case None => //If the offset is not read from zk, it means that the system is started for the first time
        log.info("The system is started for the first time, the offset is not read, and consumption starts with the latest offset by default")
        //Create DirectStream with latest offset
        KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
      case Some(lastStopOffset) =>
        log.info("Read the offset from zk, start consuming data from the last offset...")
        val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message)
        //Create DirectStream with offset from last stop
        KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, lastStopOffset, messageHandler)
    }
    kafkaStream//Return the created kafkaStream
  }



It is mainly for the first startup, which is different from the non-first startup.


Then look at the code for the third step:


  /****
    * Save the offset of each batch of rdd to zk
    * @param zkClient zk connected client
    * @param zkOffsetPath offset path
    * @param rdd rdd of each batch
    */
  def saveOffsets (zkClient: ZkClient, zkOffsetPath: String, rdd: RDD [_]): Unit = {
    //Convert rdd to Array[OffsetRange]
    val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    //Convert each OffsetRange to the string format when stored in zk: partition number 1: offset 1, partition number 2: offset 2,...
    val offsetsRangesStr = offsetsRanges.map(offsetRange => s"${offsetRange.partition}:${offsetRange.untilOffset}").mkString(",")
    log.debug("Saved offsets: "+offsetsRangesStr)
    //Save the final string result to zk
    ZkUtils.updatePersistentPath(zkClient, zkOffsetPath, offsetsRangesStr)
  }





The main thing is to update the offset of each batch into zk.



The example has been uploaded to github. Interested students can refer to this link:

https://github.com/qindongliang/streaming-offset-to-zk



Subsequent articles will talk about how to gracefully close the streaming program in order to upgrade the application, and How the above program is automatically compatible when kafka expands the partition.

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326568987&siteId=291194637