Flink manually maintains the offset of Kafka

Flink maintains offset manually

introduction

Compare with spark, how does flink maintain the offset of kafka in redis like spark, so as to ensure that data is only consumed once, so that data is not lost or repeated. Students who have used spark know that spark is reading After fetching the Kafka data, DStream(Accurately InputDStream[ConsumerRecord[String, String]]), there will be these pieces of information: topic、partition、offset、key、value、timestampwait for the information, you only need to perform a foreach operation on the DStream during maintenance, and choose the location to save the offset according to the scene. When restarting again, read Just take the offset in redis.
Students who use flink for the first time will find that the content in flink env.addSourceis DataStream[String]directly the value after obtaining it, so what should I do?

step

  1. Rewrite FlinkKafkaConsumer010: FormationNewKafkaDStream
  2. Store offset to redis
  3. Read

Code

import java.nio.charset.StandardCharsets
import java.util._

import com.oneniceapp.bin.KafkaDStream
import my.nexus.util.StringUtils //私仓
import org.apache.flink.api.common.typeinfo.{
    
    TypeHint, TypeInformation}
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
import org.apache.flink.streaming.connectors.kafka.{
    
    FlinkKafkaConsumer010, FlinkKafkaConsumerBase}
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
import org.slf4j.LoggerFactory
import redis.clients.jedis.Jedis

import scala.collection.JavaConversions._

1. Create an NewKafkaDStreamobject

case class KafkaDStream(topic:String, partition:Int, offset:Long, keyMessage:String, message:String){
    
    
}

2. Form Kafka information toNewKafkaDStream

  /**
    * 组建kafka信息
    * @param topic
    * @param groupid
    * @return
    */
  def createKafkaSource(topic:java.util.List[String], groupid:String): FlinkKafkaConsumer010[KafkaDStream] ={
    
    

    // kafka消费者配置
    val dataStream = new FlinkKafkaConsumer010[KafkaDStream](topic:java.util.List[String], new KeyedDeserializationSchema[KafkaDStream]() {
    
    
      override def getProducedType: TypeInformation[KafkaDStream] = TypeInformation.of(new TypeHint[KafkaDStream]() {
    
    })

      override def deserialize(messageKey: Array[Byte], message: Array[Byte], topic: String, partition: Int, offset: Long): KafkaDStream = {
    
    

        val kafkasource = new KafkaDStream(topic,  partition, offset, new String(messageKey, StandardCharsets.UTF_8), new String(message, StandardCharsets.UTF_8))

        kafkasource
      }
      override def isEndOfStream(s: KafkaDStream) = false
    }, getKafkaProperties(groupid))

    //是否自动提交offset
    dataStream.setCommitOffsetsOnCheckpoints(true)

    dataStream
  }
  
/**
* kafka配置
* @param groupId
* @return
*/
private def getKafkaProperties(groupId:String): Properties = {
    
    
   val kafkaProps: Properties = new Properties()
   kafkaProps.setProperty("bootstrap.servers", "kafka.brokersxxxxxxx")
   kafkaProps.setProperty("group.id", groupId)
   kafkaProps.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   kafkaProps.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
   kafkaProps
 }

  /**
    * 从redis中获取kafka的offset
    * @param topic
    * @param groupId
    * @return
    */
  def getSpecificOffsets(topic:java.util.ArrayList[String]): java.util.Map[KafkaTopicPartition, java.lang.Long]  ={
    
    

    import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
    val specificStartOffsets: java.util.Map[KafkaTopicPartition, java.lang.Long] = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()

    for(topic <- topic){
    
    
      val jedis = new Jedis(redis_host, redis_port)
      val key = s"my_flink_$topic"
      val partitions = jedis.hgetAll(key).toList
      for(partition <- partitions){
    
    
        if(!StringUtils.isEmpty(topic) && !StringUtils.isEmpty(partition._1) && !StringUtils.isEmpty(partition._2)){
    
    
          Logger.warn("topic:"+topic.trim, partition._1.trim.toInt, partition._2.trim.toLong)
          specificStartOffsets.put(new KafkaTopicPartition(topic.trim, partition._1.trim.toInt), partition._2.trim.toLong)
        }
      }
      jedis.close()
    }
    specificStartOffsets
  }

3. Get the data of Kafka in the text

val topics = new java.util.ArrayList[String]
topics.add(myTopic)
val consumer = createKafkaSource(topics, groupId)
consumer.setStartFromSpecificOffset(getSpecificOffsets(topics))
val dataStream = env.addSource(consumer)

4. Save the offset, usually written in the invoke of the custom sink, ensure that the offset is stored after processing

def setOffset(topic:String, partition:Int, offset:Long): Unit ={
    
    
   val jedis = new Jedis(GetPropKey.redis_host, GetPropKey.redis_port)
   val gtKey = s"my_flink_$topic"
   jedis.hset(gtKey, partition.toString, offset.toString)
   jedis.close()
 }

other

When using flink, I found a more interesting thing. Just like spark, if you do not perform additional data partitioning to ensure the original parallelism, the partition of kafka is fixed, and there is no need to worry about the disorder of the unified partition.

Guess you like

Origin blog.csdn.net/jklcl/article/details/112793970