Problems and solutions of using checkpoint in SparkStreaming

sparkstreaming about offset management

When Direct DStream is initialized, an offset containing each partition of each topic needs to be specified for Direct DStream to read data from the specified location.
- offsets is the offsets position saved in step 4
read and process messages
Store the result data after processing
- Storing and committing offsets with dashed circles simply emphasizes that users may perform a series of operations to satisfy their more stringent semantic requirements. This includes idempotent operations and storing offsets through atomic operations.
Finally, save the offsets in an external persistent database such as HBase, Kafka, HDFS, and ZooKeeper

Problems with SparkStreaming using checkpoint

When SparkStreaming processes data in kafka, there is a management problem of kafka offset:

The official solution is checkpoint:
- Checkpoint is to save the metadata and
  the data state of each rdds during the running process of sparkstreaming to a persistent system. Of course, it also includes offset, usually HDFS, S3. If the program hangs, or the cluster hangs, the next The next startup can still be recovered from the checkpoint, so as to achieve 7*24 high availability of the production environment. If checkpoint storage is done in hdfs, it will bring about the problem of small files.

But the biggest drawback of checkpoint is that once your streaming program code or configuration is changed, or you update and iterate new functions, at this time, you stop the old sparkstreaming program first, and then package and compile the new program and execute it. Two cases:

(1) Startup error, deserialization exception
(2) The startup is normal, but the running code is still the code of the last program.

Why do the above two situations occur?

This is because when the checkpoint is persisted for the first time, the entire related jar will be serialized into a binary file, and it will be restored from it every time it restarts , but when your new
program is packaged, the serialized loading is still the old serialization. file, which can result in an error or still execute the old code. Some students may say, in this case, if you delete the last checkpoint directly, won't you be able to start it? It can be started, but once you delete the old checkpoint, the newly started program can only be consumed from the offset of the smallest or largest of kafka. The default is from the latest, if it is the latest, not the last time the program stopped The offset
will cause data loss, and if it is old, it will cause data duplication. No matter how you do it, there are problems.
https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#upgrading-application-code

In response to this problem, the spark official website gives two solutions:

(1) The old program does not stop, the new program continues to start, and the two programs coexist for a period of time. Evaluation: There is still the possibility of losing repeated consumption

(2) When stopping, record the last offset, and then the newly restored program reads this offset and continues to work, so as not to lose messages. Evaluation: The official website does not give specific instructions on how to operate, but just gives an idea: store offsets by yourself,

Your own data store

For data stores that support transactions, saving offsets in the same transaction as the results can keep the two in sync, even in failure situations. If you’re careful about detecting repeated or skipped offset ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This gives the equivalent of exactly-once semantics. It is also possible to use this tactic even for outputs that result from aggregations, which are typically hard to make idempotent.

#Java
// Th#e details depend on your data store, but the general idea looks like this

// begin from the the offsets committed to the database
Map<TopicPartition, Long> fromOffsets = new HashMap<>();
for (resultSet : selectOffsetsFromYourDatabase)
  fromOffsets.put(new TopicPartition(resultSet.string("topic"), resultSet.int("partition")), resultSet.long("offset"));
}

JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
  streamingContext,
  LocationStrategies.PreferConsistent(),
  ConsumerStrategies.<String, String>Assign(fromOffsets.keySet(), kafkaParams, fromOffsets)
);

stream.foreachRDD(rdd -> {
  OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
  
  Object results = yourCalculation(rdd);

  // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction
});

The idea is in this pseudocode: the data store supports transactions, updates the results and offsets within the transaction, and confirms that the offsets are updated correctly.

 // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction

Several ways for SparkStreaming to manage offsets in kafka

SparkStreaming manages the offsets in kafka, that is to store the offsets in a certain data format somewhere, generally in the following ways:

1. Store in kafka

Apache Spark 2.1.x and spark-streaming-kafka-0-10 use a new consumer API called Asynchronous Submit API. You can use the commitAsync API (asynchronous commit API) to commit offsets to Kafka after you've ensured that your processed data has been properly preserved. The new consumer API will use the consumer group id as the unique identifier to submit offsets

Submit offsets to Kafka

stream.foreachRDD { rdd =>

  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  // some time later, after outputs have completed

  stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

}

Note: commitAsync() is integrated in kafka-0-10 version of Spark Streaming, and the Spark documentation reminds that it is still an experimental API and there is the possibility of modification .

2. Store in zookeeper

The offset of the kafka consumer itself is stored in zookeeper. In sparkstreaming, it is necessary to read the offset from zookeeper as specified when starting. The reference code is as follows:

step1: Initialize Zookeeper connection to get offsets from Zookeeper



val zkClientAndConnection = ZkUtils.createZkClientAndConnection(zkUrl, sessionTimeout, connectionTimeout)

val zkUtils = new ZkUtils(zkClientAndConnection._1, zkClientAndConnection._2, false)

Method for retrieving the last offsets stored in ZooKeeper of the consumer group and topic list.

def readOffsets(topics: Seq[String], groupId:String):

 Map[TopicPartition, Long] = {

 val topicPartOffsetMap = collection.mutable.HashMap.empty[TopicPartition, Long]

 val partitionMap = zkUtils.getPartitionsForTopics(topics)

 // /consumers/<groupId>/offsets/<topic>/

 partitionMap.foreach(topicPartitions => {

   val zkGroupTopicDirs = new ZKGroupTopicDirs(groupId, topicPartitions._1)

   topicPartitions._2.foreach(partition => {

     val offsetPath = zkGroupTopicDirs.consumerOffsetDir + "/" + partition

     try {

       val offsetStatTuple = zkUtils.readData(offsetPath)

       if (offsetStatTuple != null) {

         LOGGER.info("retrieving offset details - topic: {}, partition: {}, offset: {}, node path: {}", Seq[AnyRef](topicPartitions._1, partition.toString, offsetStatTuple._1, offsetPath): _*)

         topicPartOffsetMap.put(new TopicPartition(topicPartitions._1, Integer.valueOf(partition)),

           offsetStatTuple._1.toLong)

       }

     } catch {

       case e: Exception =>

         LOGGER.warn("retrieving offset details - no previous node exists:" + " {}, topic: {}, partition: {}, node path: {}", Seq[AnyRef](e.getMessage, topicPartitions._1, partition.toString, offsetPath): _*)

         topicPartOffsetMap.put(new TopicPartition(topicPartitions._1, Integer.valueOf(partition)), 0L)

     }

   })

 })

 topicPartOffsetMap.toMap

}

step2: Use the obtained offsets to initialize Kafka Direct DStream

val inputDStream = KafkaUtils.createDirectStream(ssc, PreferConsistent, ConsumerStrategies.Subscribe[String,String](topics, kafkaParams, fromOffsets))

Method for persisting recoverable offsets to zookeeper.

#注意: Kafka offset在ZooKeeper中的存储路径为/consumers/[groupId]/offsets/topic/[partitionId], 存储的值为offset

def persistOffsets(offsets: Seq[OffsetRange], groupId: String, storeEndOffset: Boolean): Unit = {

 offsets.foreach(or => {

   val zkGroupTopicDirs = new ZKGroupTopicDirs(groupId, or.topic);

   val acls = new ListBuffer[ACL]()

   val acl = new ACL

   acl.setId(ANYONE_ID_UNSAFE)

   acl.setPerms(PERMISSIONS_ALL)

   acls += acl

   val offsetPath = zkGroupTopicDirs.consumerOffsetDir + "/" + or.partition;

   val offsetVal = if (storeEndOffset) or.untilOffset else or.fromOffset

   zkUtils.updatePersistentPath(zkGroupTopicDirs.consumerOffsetDir + "/"

     + or.partition, offsetVal + "", JavaConversions.bufferAsJavaList(acls))

   LOGGER.debug("persisting offset details - topic: {}, partition: {}, offset: {}, node path: {}", Seq[AnyRef](or.topic, or.partition.toString, offsetVal.toString, offsetPath): _*)

 })

}

3. Store in hbase

DDL: 30 days expired

create 'stream_kafka_offsets', {NAME=>'offsets', TTL=>2592000}

RowKey Layout

row:              <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS>
column family:    offsets
qualifier:        <PARTITION_ID>
value:            <OFFSET_ID>

For each batch of messages, saveOffsets() function is used to persist last read offsets for a given kafka topic in HBase. For each batch of messages, use saveOffsets() to save the offsets read from the specified topic to HBase

/*
 Save offsets for each batch into HBase
*/
def saveOffsets(TOPIC_NAME:String,GROUP_ID:String,offsetRanges:Array[OffsetRange],
                hbaseTableName:String,batchTime: org.apache.spark.streaming.Time) ={
  val hbaseConf = HBaseConfiguration.create()
  hbaseConf.addResource("src/main/resources/hbase-site.xml")
  val conn = ConnectionFactory.createConnection(hbaseConf)
  val table = conn.getTable(TableName.valueOf(hbaseTableName))
  val rowKey = TOPIC_NAME + ":" + GROUP_ID + ":" +String.valueOf(batchTime.milliseconds)
  val put = new Put(rowKey.getBytes)
  for(offset <- offsetRanges){
    put.addColumn(Bytes.toBytes("offsets"),Bytes.toBytes(offset.partition.toString),
          Bytes.toBytes(offset.untilOffset.toString))
  }
  table.put(put)
  conn.close()
}

Before executing the streaming task, first use getLastCommittedOffsets() to read the offsets saved at the end of the last task from HBase. This method will use the usual scheme to return kafka topic partition offsets.

Scenario 1: The Streaming task starts for the first time, gets the number of partitions for a given topic from zookeeper, sets the offset of each partition to 0, and returns.

Scenario 2: A streaming task that has been running for a long time is stopped and a new partition is added to a given topic. The processing method is to obtain the number of partitions for a given topic from zookeeper. For all old partitions, the offset still uses the same value in HBase. Save, and set offset to 0 for new partitions.

Scenario 3: The Streaming task stops after running for a long time and there is no change in the topic partition. In this case, the offset saved in HBase can be used directly.

After the Spark Streaming application is started, if a new partition is added to the topic, the application can only read the data in the old partition, but cannot read the new one. So if you want to read the data in the new partition, you have to restart the Spark Streaming application.

/* Returns last committed offsets for all the partitions of a given topic from HBase in  
following  cases.
*/
    
def getLastCommittedOffsets(TOPIC_NAME:String,GROUP_ID:String,hbaseTableName:String,
zkQuorum:String,zkRootDir:String,sessionTimeout:Int,connectionTimeOut:Int):Map[TopicPartition,Long] ={
 
  val hbaseConf = HBaseConfiguration.create()
  val zkUrl = zkQuorum+"/"+zkRootDir
  val zkClientAndConnection = ZkUtils.createZkClientAndConnection(zkUrl,
                                                sessionTimeout,connectionTimeOut)
  val zkUtils = new ZkUtils(zkClientAndConnection._1, zkClientAndConnection._2,false)
  val zKNumberOfPartitionsForTopic = zkUtils.getPartitionsForTopics(Seq(TOPIC_NAME
                                                 )).get(TOPIC_NAME).toList.head.size
  zkClientAndConnection._1.close()
  zkClientAndConnection._2.close()
 
  //Connect to HBase to retrieve last committed offsets
  val conn = ConnectionFactory.createConnection(hbaseConf)
  val table = conn.getTable(TableName.valueOf(hbaseTableName))
  val startRow = TOPIC_NAME + ":" + GROUP_ID + ":" +
                                              String.valueOf(System.currentTimeMillis())
  val stopRow = TOPIC_NAME + ":" + GROUP_ID + ":" + 0
  val scan = new Scan()
  val scanner = table.getScanner(scan.setStartRow(startRow.getBytes).setStopRow(
                                                   stopRow.getBytes).setReversed(true))
  val result = scanner.next()
  var hbaseNumberOfPartitionsForTopic = 0 //Set the number of partitions discovered for a topic in HBase to 0
  if (result != null){
  //If the result from hbase scanner is not null, set number of partitions from hbase 
  to the  number of cells
    hbaseNumberOfPartitionsForTopic = result.listCells().size()
  }

val fromOffsets = collection.mutable.Map[TopicPartition,Long]()
 
  if(hbaseNumberOfPartitionsForTopic == 0){
    // initialize fromOffsets to beginning
    for (partition <- 0 to zKNumberOfPartitionsForTopic-1){
      fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)
    }
  } else if(zKNumberOfPartitionsForTopic > hbaseNumberOfPartitionsForTopic){
  // handle scenario where new partitions have been added to existing kafka topic
    for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1){
      val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("offsets"),
                                        Bytes.toBytes(partition.toString)))
      fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)
    }
    for (partition <- hbaseNumberOfPartitionsForTopic to zKNumberOfPartitionsForTopic-1){
      fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)
    }
  } else {
  //initialize fromOffsets from last run
    for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1 ){
      val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("offsets"),
                                        Bytes.toBytes(partition.toString)))
      fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)
    }
  }
  scanner.close()
  conn.close()
  fromOffsets.toMap
}

When we get the offsets we can create a Kafka Direct DStream

val fromOffsets= getLastCommittedOffsets(topic,consumerGroupID,hbaseTableName,zkQuorum,

                                   zkKafkaRootDir,zkSessionTimeOut,zkConnectionTimeOut)

val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,

                           Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))

Call saveOffsets() to save the offsets after finishing data processing for this batch.

/*
For each RDD in a DStream apply a map transformation that processes the message.
*/

inputDStream.foreachRDD((rdd,batchTime) => {

  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  offsetRanges.foreach(offset => println(offset.topic,offset.partition, offset.fromOffset,

                        offset.untilOffset))

  val newRDD = rdd.map(message => processMessage(message))

  newRDD.count()

  saveOffsets(topic,consumerGroupID,offsetRanges,hbaseTableName,batchTime)

})

Reference code: https://github.com/gdtm86/spark-streaming-kafka-cdh511-testing

Summarize

In summary, it is recommended to use zk to maintain offsets.

references

Tips : This article is a record of my own learning and practice process. Many pictures and texts are pasted from online articles. Please forgive me if there is no reference to it! If you have any questions, please leave a message or email notification, I will reply in time.