Kafka offset (Offset) Management

1. Definitions

Each partition consists Kafka ordered series, immutable composition message, these messages are continuously added to the partition. Each partition has a continuous message serial number for uniquely identifying a message partition.

Offset number of recorded message to be transmitted next to the Consumer.

Stream processing system three common semantics:

At most once Each record is either processed once, or not at all deal with
At least once This time, better than most, because it ensures that no data is lost. But there may be duplicate
And only once Each record will be treated exactly once, no data will be lost, and no data will be processed multiple times

The semantics of streaming systems are often captured in terms of how many times each record can be processed by the system. There are three types of guarantees that a system can provide under all possible operating conditions (despite failures, etc.)

  1. At most once: Each record will be either processed once or not processed at all.
  2. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.
  3. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.

 2.Kafka offset Management with Spark Streaming

Offset First, I propose to store in Zookeeper, Zookeeper compared to HBASE etc. for more lightweight, and do HA (high-availability cluster, High Available) is, offset safer.

For the common management offset two steps:

  • Save offsets
  • Get offsets

3. Environmental ready

Start a Kafka producer, tests using topic: tp_kafka:

./kafka-console-producer.sh --broker-list hadoop000:9092 --topic tp_kafka

Kafka started a consumer:

./kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic tp_kafka

Production data in IDEA:

package com.taipark.spark;

import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;

import java.util.Properties;
import java.util.UUID;

public class KafkaApp {

    public static void main(String[] args) {
        String topic = "tp_kafka";

        Properties props = new Properties();
        props.put("serializer.class","kafka.serializer.StringEncoder");
        props.put("metadata.broker.list","hadoop000:9092");
        props.put("request.required.acks","1");
        props.put("partitioner.class","kafka.producer.DefaultPartitioner");
        Producer<String,String> producer = new Producer<>(new ProducerConfig(props));

        for(int index = 0;index <100; index++){
            KeyedMessage<String, String> message = new KeyedMessage<>(topic, index + "", "taipark" + UUID.randomUUID());
            producer.send(message);
        }
        System.out.println("数据生产完毕");

    }
}

4. The first offset management: smallest

Spark Streaming links count the number of Kafka:

package com.taipark.spark.offset

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")
    val ssc = new StreamingContext(sparkConf,Seconds(10))

    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)

    messages.foreachRDD(rdd=>{
      if(!rdd.isEmpty()){
        println("Taipark" + rdd.count())
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }

}

Kafka reproduction data 100 -> Spark Streaming accepted:

But then if Spark Streaming stop reboot:

You will find here a re-start counting, because the code in order to set the value of auto.offset.reset smallest. (Before kafka-0.10.1.X version)

5. The second offset management: checkpoint

Create a / offset folder in HDFS:

hadoop fs -mkdir /offset

Use Checkpoint:

package com.taipark.spark.offset

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")

    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    val checkpointDirectory = "hdfs://hadoop000:8020/offset/"
    def functionToCreateContext():StreamingContext = {
      val ssc = new StreamingContext(sparkConf,Seconds(10))
      val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
      //设置checkpoint
      ssc.checkpoint(checkpointDirectory)
      messages.checkpoint(Duration(10*1000))

      messages.foreachRDD(rdd=>{
        if(!rdd.isEmpty()){
          println("Taipark" + rdd.count())
        }
      })

      ssc
    }
    val ssc = StreamingContext.getOrCreate(checkpointDirectory,functionToCreateContext _)




    ssc.start()
    ssc.awaitTermination()
  }

}

Note: IDEA modify HDFS user, VM options in the settings of:

-DHADOOP_USER_NAME=hadoop

First start:

It found that consumption of 100 before. This is after the stop, the production of 100, and then start:

Found here only to read the last 100 between the end of the start, rather than the smallest number of pieces before reading all the same.

But there is a problem checkpiont, if this way management offset, as long as the business logic changes, there will be no effect of the checkpoint. Because it invokes is getOrCreate ().

6. The third management offset: offset manually manage

Ideas:

  1. Creating StreamingContext
  2. Get data from Kafka <== get offset
  3. The service logic processing
  4. The processing result is written to the external storage ==> Save offset
  5. Start the program waits for the thread to terminate
package com.taipark.spark.offset

import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Offset01App {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("Offset01App")
    val ssc = new StreamingContext(sparkConf,Seconds(10))


    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> "hadoop000:9092",
      "auto.offset.reset" -> "smallest"
    )
    val topics = "tp_kafka".split(",").toSet
    //从某地获取偏移量
    val fromOffsets = Map[TopicAndPartition,Long]()

    val messages = if(fromOffsets.size == 0){  //从头消费
      KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
    }else{  //从指定偏移量消费

      val messageHandler = (mm:MessageAndMetadata[String,String]) => (mm.key,mm.message())
      KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,(String,String)](ssc,kafkaParams,fromOffsets,messageHandler)

      )
    }

    messages.foreachRDD(rdd=>{
      if(!rdd.isEmpty()){
        //业务逻辑
        println("Taipark" + rdd.count())

        //将offset提交保存到某地
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        offsetRanges.foreach(x =>{
          //提交如下信息提交到外部存储
          println(s"${x.topic} ${x.partition} ${x.fromOffset} ${x.untilOffset}")
        })
      }
    })

    ssc.start()
    ssc.awaitTermination()
  }

}
  • First save after save data offset may cause data loss
  • After the first save data stored offset data may lead to Repeat

Solution 1: Achieving idempotent (idempotent)

In a programming operation is idempotent characteristics is performed any number of times which are the impact and influence of the first performance of the same.

Solution 2: Transaction (transaction)

1. The transaction database may contain one or more database operations, these operations constitute a logical whole.

2. These database operations constitute a logical whole, either all executed succeed or not performed.

3. All operations make up the transaction, either all have an impact on the database, or all do not have an impact, that is, regardless of whether the transaction is successful, the database is always consistent state.

4. The above is still valid in the presence of a fault and even if there is concurrent transactions in the database.

The business logic and offset stored in one transaction, only once.

7.Kafka-0.10.1.X later version auto.kafka.reset:

earliest When offset has been submitted under the district, from start spending offset submitted; no offset submitted from scratch consumption
latest Under no offset data when submitted, a new consumer created the partition; when there is offset submitted under the district, from start spending offset submitted
none topic when the partitions are present offset submitted, after the offset from the beginning of the consumer; as long as there is a partition offset submitted does not exist, an exception is thrown

 

Published 75 original articles · won praise 30 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_36329973/article/details/104825902