Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

The Kinesis Client Library provides convenient abstractions for interacting with Amazon Kinesis. Consumer checkpoints are automatically tracked in DynamoDB (Kinesis checkpointing) and it’s easy to spawn workers to consume data from each shard (Kinesis term for a partition) in parallel. For those unfamiliar with checkpointing in streaming applications, it is the process of tracking which messages have been successfully read from the stream.

Spark Streaming implements a receiver using the Kinesis Client Library to read messages from Kinesis. Spark also provides a utility called checkpointing (Spark checkpointing; not to be confused with Kinesis checkpointing in DynamoDB) which helps make applications fault-tolerant. Using Spark checkpointing in combination with Kinesis checkpointing provides at-least-once semantics.

When we tried to implement the recommended solution using Spark checkpointing, it was very difficult to develop any code without breaking our checkpoints. When Spark saves checkpoints, it serializes the classes which define the transformations and then uses that to restart a stopped stream. If you then change the structure of one of the transformation classes, checkpoints become invalid and cannot be used for recovery. (There are ways to make code changes without breaking your application’s checkpoints, however in my opinion they add unnecessary complexity and risk to the development process as cited in this example). This challenge, in combination with a sub-optimal at-least-once guarantee, led us to abandon Spark checkpointing to pursue a simpler, albeit somewhat hacky, alternative.

Every message sent to Kinesis is given a partition key. The partition key determines the shard to which the message is assigned. Once a message is assigned to a shard, it is given a sequence number. Sequence numbers within a shard are unique and increase over time. (If the producer is leveraging message aggregation, it is possible for multiple consumed messages to have the same sequence number)

When starting up a Spark Streaming Kinesis application, there are two positions you can start consuming from the stream: InitialPositionInStream.LATEST will consume any messages received while your application is running, and InitialPositionInStream.TRIM_HORIZONwhich will consume from the oldest message in the streams retention window (24hr by default). The subtlety here is that if a kinesis-checkpoint table is present, the stream will start consuming from the existing offsets even when TRIM_HORIZON is specified.

Setting the state of DynamoDB before starting the streaming application will ensure the KCL workers start consuming from the last successfully saved offset. To do that we’ll need the shard-id and the sequence number of the last message successfully persisted from that shard. There are a couple ways offset information could be maintained. We opted to persist offsets as part of the message itself on HDFS. This approach has a small amount of disk overhead and doesn’t require additional external dependencies. A similar approach using zookeeper is described here.

The record processor used by the KCL workers is a function (Record => T). We can cast Record to a UserRecord which provides the partition key, sequence number, and sub-sequence number (necessary when a producer uses message aggregation)

case class Message(var shardId: Option[String], partitionKey: String, seqNum: String, subSeqNum: String, json: String)

private val messageHandler : (Record => Message) = {
 (record : Record) => {
   val partitionKey = record.getPartitionKey
   val seqNum = record.getSequenceNumber
   val subSeqNum = record.asInstanceOf[UserRecord].getSubSequenceNumber.toString

var json = new Array[Byte](record.getData.remaining)
val data = record.getData
data.get(json)

// note that we do not assign a shard-id, it must be calculated later
new Message(None, partitionKey, seqNum, subSeqNum, new String(json, "UTF-8"))

}
}

The new problem is that we have no way to know which shard on the stream a message came from. The KCL assigns shards to the workers under the covers and the shard-id is not exposed. Luckily, we can use the partition key to determine the shard-id.

def setShardId(stream : InputDStream[Message], shards : List[Shard]) : Unit = {
 stream.foreachRDD { rdd =>
  rdd.foreach { msg =>
   for (shard <- shards) {
    keyIsInShard(msg.partitionKey, shard) match {
     case true => msg.shardId = Some(shard.getShardId)
     case _ => throw new Exception(
      s"PartitionKey: ${msg.partitionKey} does not " +
      s"fall within shardList: ${shards.toString}")
    }
   }
  }
 }
}

def keyIsInShard(partitionKey : String, shard : Shard) : Boolean = {
 val keyAsBytes = partitionKey.getBytes("UTF-8")
 val md5 = MessageDigest.getInstance("MD5").digest(keyAsBytes)
 val hashedPartitionKey: BigInteger = new BigInteger(1, md5)

 val startingHashKey : BigInteger = new BigInteger(shard.getHashKeyRange.getStartingHashKey)
 val endingHashKey : BigInteger = new BigInteger(shard.getHashKeyRange.getEndingHashKey)
 if (startingHashKey.compareTo(hashedPartitionKey) <= 0 &&
  endingHashKey.compareTo(hashedPartitionKey) >= 0) {
  true
 }
 false
}

Now that each record is explicitly linked with the shard to which it was assigned, we can persist this information on HDFS along with the message.

// persist the messages and their offset information to hdfs
 def saveMessages(stream: DStream[Message], spark : SparkSession) ={
  stream.foreachRDD { (rdd, time) =>
   val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
   import spark.implicits._

// coalesce can be tuned to yield an ideal file size
 rdd.coalesce(1).toDF().write.parquet(s"/data/${time}")
 }
}

In the event of a failure, we load the max saved sequence number per shard…

def getMaxSavedOffsets(spark : SparkSession, streamName : String) : Map[String, String] = {
 val df = spark.read.parquet("/data/*")
 df.createOrReplaceTempView(streamName)

 val queryStr : String =
  s"""
   |select shardId, seqNum, max(subSeqNum) from
   | (select shardId, max(seqNum) as seqNum from ${streamName} group  by shardId) as a
   |inner join ${streamName} on shardId = a.shardId and seqNum =   a.seqNum
   |group by shardId, seqNum
  """.stripMargin

 spark.sql(queryStr).collect.map { record =>
  val shardId = record.getString(0)
  val seqNum = record.getString(1)
  val subSeq = record.getString(2)
  (shardId, s"${seqNum},${subSeq}")
  }.toMap
 }

and then set the state of DynamoDB before restarting the stream so that the application will start consuming at the latest offset.

val shardOffsets = Utils.getMaxSavedOffsets(spark, streamName)
def setDynamoState(tableName : String, shardOffsets : Map[String, String]) : Unit = {

  // init dynamo and lease manager
  val dynamo = new AmazonDynamoDBClient(
   new DefaultAWSCredentialsProviderChain())
  val leaseManager = new KinesisClientLeaseManager(tableName, dynamo)

  // delete any existing offsets
  leaseManager.createLeaseTableIfNotExists(10L, 10L)
  leaseManager.waitUntilLeaseTableExists(10L, 30L)
 leaseManager.deleteAll()

 // write current offsets to dynamo
 val writeRequests : java.util.List[WriteRequest] = shardOffsets.map { offset =>
  val shardId = offset._1
  val seqNum = offset._2.split(",")(0)
  val subSeqNum = offset._2.split(",")(1)

  new WriteRequest(
   new PutRequest(
    Map(
     "leaseKey" -> new AttributeValue().withS(shardId),
     "checkpoint" -> new AttributeValue().withS(seqNum),
     "checkpointSubSequenceNumber" -> new AttributeValue().withN(subSeqNum),
     "ownerSwitchesSinceCheckpoint" -> new AttributeValue().withN("0"),
     "leaseOwner" -> new AttributeValue().withS(s"${tableName}-init"),
     "leaseCounter" -> new AttributeValue().withN("0")
   ).asJava
 ))
}.toList.asJava

 dynamo.batchWriteItem(new BatchWriteItemRequest().withRequestItems(Map(tableName -> writeRequests).asJava))
}

The major benefit of this approach is that data from Kinesis can be processed exactly-once using spark streaming. This process could be easily modified to work with Kafka as well. If you’re interested in data solutions like this one, follow us @b23llc and visit us at http://www.b23.io/.

About the Author: David Kegley is a Data Engineer with B23 LLC working with our clients to provide scalable cloud data solutions. He has a B.S. in Computer Science from James Madison University and enjoys listening to live music and spending time outdoors in his free time.

ref:https://medium.com/@b23llc/exactly-once-data-processing-with-amazon-kinesis-and-spark-streaming-7e7f82303e4

Exactly Once Data Processing with Amazon Kinesis and Spark Streaming

猜你喜欢