Is there a way to read from specific offset in a Kafka stream from a Spark streaming job?

abhishek :

I am trying to commit offsets from my Spark streaming job to Kafka using the following:

OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();

            // some time later, after outputs have completed
              ((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);

as I got from this question:

Spark DStream from Kafka always starts at beginning

And this works fine, offsets are being committed. However, the problem is that this is asynchronous, which means that even after two more offset commits have been sent down the line, Kafka may still hold on to the offset two commits before. If the consumer crashes at that point, and I bring it back up, it starts reading messages which have already been processed.

Now, from other sources, like the comments section here:

https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o

I understood that there's no way to commit offsets synchronously from a Spark streaming job, (though there is one if I use Kafka streams). People rather suggest to keep the offsets in the databases where you are persisting the end result of your calculations on the stream.

Now, my question is this: If I DO store the currently read offset in my database, how do I start reading the stream from exactly that offset the next time?

abhishek :

I researched and found the answer to my question, so I'm posting it here for anyone else who might face the same problem:

  • Make a Map object with org.apache.kafka.common.TopicPartition as the key and a Long as the value. The TopicPartition constructor takes two arguments, the topic name and the partition from which you will be reading. The value of the Map object is the long representation of the offset from which you want to read the stream.

    Map startingOffset = new HashMap<>(); startingOffset.put(new TopicPartition("topic_name", 0), 3332980L);

  • Read the stream contents into an appropriate JavaInputStream, and provide the previously created Map object as an argument to the ConsumerStrategies.Subscribe() method.

    final JavaInputDStream> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams, startingOffset));

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=133044&siteId=1