Flink maintains offset manually
introduction
Compare with spark, how does flink maintain the offset of kafka in redis like spark, so as to ensure that data is only consumed once, so that data is not lost or repeated. Students who have used spark know that spark is reading After fetching the Kafka data, DStream
(Accurately InputDStream[ConsumerRecord[String, String]])
, there will be these pieces of information: topic、partition、offset、key、value、timestamp
wait for the information, you only need to perform a foreach operation on the DStream during maintenance, and choose the location to save the offset according to the scene. When restarting again, read Just take the offset in redis.
Students who use flink for the first time will find that the content in flink env.addSource
is DataStream[String]
directly the value after obtaining it, so what should I do?
step
- Rewrite
FlinkKafkaConsumer010
: FormationNewKafkaDStream
- Store offset to redis
- Read
Code
import java.nio.charset.StandardCharsets
import java.util._
import com.oneniceapp.bin.KafkaDStream
import my.nexus.util.StringUtils //私仓
import org.apache.flink.api.common.typeinfo.{
TypeHint, TypeInformation}
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
import org.apache.flink.streaming.connectors.kafka.{
FlinkKafkaConsumer010, FlinkKafkaConsumerBase}
import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
import org.slf4j.LoggerFactory
import redis.clients.jedis.Jedis
import scala.collection.JavaConversions._
1. Create an NewKafkaDStream
object
case class KafkaDStream(topic:String, partition:Int, offset:Long, keyMessage:String, message:String){
}
2. Form Kafka information toNewKafkaDStream
/**
* 组建kafka信息
* @param topic
* @param groupid
* @return
*/
def createKafkaSource(topic:java.util.List[String], groupid:String): FlinkKafkaConsumer010[KafkaDStream] ={
// kafka消费者配置
val dataStream = new FlinkKafkaConsumer010[KafkaDStream](topic:java.util.List[String], new KeyedDeserializationSchema[KafkaDStream]() {
override def getProducedType: TypeInformation[KafkaDStream] = TypeInformation.of(new TypeHint[KafkaDStream]() {
})
override def deserialize(messageKey: Array[Byte], message: Array[Byte], topic: String, partition: Int, offset: Long): KafkaDStream = {
val kafkasource = new KafkaDStream(topic, partition, offset, new String(messageKey, StandardCharsets.UTF_8), new String(message, StandardCharsets.UTF_8))
kafkasource
}
override def isEndOfStream(s: KafkaDStream) = false
}, getKafkaProperties(groupid))
//是否自动提交offset
dataStream.setCommitOffsetsOnCheckpoints(true)
dataStream
}
/**
* kafka配置
* @param groupId
* @return
*/
private def getKafkaProperties(groupId:String): Properties = {
val kafkaProps: Properties = new Properties()
kafkaProps.setProperty("bootstrap.servers", "kafka.brokersxxxxxxx")
kafkaProps.setProperty("group.id", groupId)
kafkaProps.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
kafkaProps.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
kafkaProps
}
/**
* 从redis中获取kafka的offset
* @param topic
* @param groupId
* @return
*/
def getSpecificOffsets(topic:java.util.ArrayList[String]): java.util.Map[KafkaTopicPartition, java.lang.Long] ={
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
val specificStartOffsets: java.util.Map[KafkaTopicPartition, java.lang.Long] = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()
for(topic <- topic){
val jedis = new Jedis(redis_host, redis_port)
val key = s"my_flink_$topic"
val partitions = jedis.hgetAll(key).toList
for(partition <- partitions){
if(!StringUtils.isEmpty(topic) && !StringUtils.isEmpty(partition._1) && !StringUtils.isEmpty(partition._2)){
Logger.warn("topic:"+topic.trim, partition._1.trim.toInt, partition._2.trim.toLong)
specificStartOffsets.put(new KafkaTopicPartition(topic.trim, partition._1.trim.toInt), partition._2.trim.toLong)
}
}
jedis.close()
}
specificStartOffsets
}
3. Get the data of Kafka in the text
val topics = new java.util.ArrayList[String]
topics.add(myTopic)
val consumer = createKafkaSource(topics, groupId)
consumer.setStartFromSpecificOffset(getSpecificOffsets(topics))
val dataStream = env.addSource(consumer)
4. Save the offset, usually written in the invoke of the custom sink, ensure that the offset is stored after processing
def setOffset(topic:String, partition:Int, offset:Long): Unit ={
val jedis = new Jedis(GetPropKey.redis_host, GetPropKey.redis_port)
val gtKey = s"my_flink_$topic"
jedis.hset(gtKey, partition.toString, offset.toString)
jedis.close()
}
other
When using flink, I found a more interesting thing. Just like spark, if you do not perform additional data partitioning to ensure the original parallelism, the partition of kafka is fixed, and there is no need to worry about the disorder of the unified partition.