SparkStreaming踩坑之Kafka重复消费

1.问题描述
使用SparkStreaming连接Kafka的demo程序每次重启，都会从Kafka队列里第一条数据开始消费。

修改enable.auto.commit相关参数都无效。

2.原因分析
demo程序使用"KafkaUtils.createDirectStream"创建Kafka输入流，此API内部使用了Kafka客户端低阶API，不支持offset自动提交（提交到zookeeper）。

"KafkaUtils.createDirectStream"官方文档：

http://spark.apache.org/docs/2.2.0/streaming-kafka-0-8-integration.html

3.对策
方案一）通过zookeeper提供的API，自己编写代码，将offset提交到zookeeper；服务启动时，从zookeeper读取offset，并作为"KafkaUtils.createDirectStream"的输入参数
优点：可与基于zookeeper的监控系统融合，对消费情况进行监控

缺点：频繁的读写offset可能影响zookeeper集群性能，从而影响到Kafka集群的稳定性

方案二）自己编写代码维护offset，并将offset保存到MongoDB或者redis
优点：不影响zookeeper集群性能；可基于MongoDB或者redis自主实现消费情况的监控

缺点：无法与基于zookeeper的监控系统融合

4.代码示例
基于上述方案二，将offset保存到redis，并在服务重启时从redis获取offset，确保不会重复消费。

1）Scala操作redis的工具类

package xxx.demo.scala_test

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import org.apache.commons.pool2.impl.GenericObjectPoolConfig

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

class RedisUtil extends Serializable {

@transient private var pool: JedisPool = null

@transient val logger = Logger(LoggerFactory.getLogger("cn.com.flaginfo.demo.scala_test.RedisUtil"))

def makePool(redisHost: String, redisPort: Int,

password: String, database: Int): Unit = {

if (pool == null) {

val poolConfig = new GenericObjectPoolConfig()

pool = new JedisPool(poolConfig, redisHost, redisPort, Protocol.DEFAULT_TIMEOUT, password, database)

val hook = new Thread {

override def run = {

pool.destroy()

logger.debug("JedisPool destroyed by ShutdownHook")

}

sys.addShutdownHook(hook.run)

}

def jedisPool: JedisPool = {

assert(pool != null)

pool

}

def generateKafkaOffsetGroupIdTopicKey(groupId : String, topic : String) : String = {

groupId + "/" + topic

}

2）初始化redis工具

import kafka.serializer.{StringDecoder, DefaultDecoder}

import kafka.common.TopicAndPartition

import kafka.message.MessageAndMetadata

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

import org.apache.spark.sql.SparkSession

import org.bson.Document

import com.mongodb.spark.config._

import com.mongodb.spark._

import com.mongodb._

import xxx.demo.model._

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig, Protocol}

import scala.collection.JavaConversions.{mapAsScalaMap}

import org.slf4j.LoggerFactory

import com.typesafe.scalalogging.slf4j.Logger

...

var redisUtil = new RedisUtil()

redisUtil.makePool(redisHost, redisPort, redisPassword, redisDatabase)

var jedisPool = redisUtil.jedisPool

注意，不需要密码验证时，redisPassword必须设置为null，空字符串会报错。

3）从redis获取上次的offset

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

var allOffset: java.util.Map[String, String] = jedisPool.getResource().hgetAll(kafkaOffsetKey)

val fromOffsets = scala.collection.mutable.Map[TopicAndPartition,Long]()

if( allOffset != null && !allOffset.isEmpty() ){

// Jedis获取的Java Map转换为Scala Map

var allOffsetScala : scala.collection.mutable.Map[String, String] = mapAsScalaMap[String, String](allOffset)

for(offset <- allOffsetScala){

// 将offset传入kafka参数。offset._1 : partition, offset._2 : offset

fromOffsets += (TopicAndPartition(newsAnalysisTopic, offset._1.toInt) -> offset._2.toLong)

}

logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

else{

// 初次消费

for( i <- 0 to (newsAnalysisTopicPartitionCount - 1) ){

fromOffsets += (TopicAndPartition(newsAnalysisTopic, i) -> 0)

}

logger.debug( "fromOffsets : " + fromOffsets.toString() )

}

// mutable转换为imutable

var imutableFromOffsets = Map[TopicAndPartition,Long](

fromOffsets.map(kv => (kv._1, kv._2)).toList: _*

)

4）定义消息过滤器：根据metadata取出需要的字段

val messageHandler: (MessageAndMetadata[String, String]) => (String,String, Long, Int) = (mmd: MessageAndMetadata[String, String]) =>

(mmd.topic, mmd.message, mmd.offset, mmd.partition)

5）创建kafka输入流

val kafkaParam = Map[String, String](

"bootstrap.servers" -> kafkaServer,

"group.id" -> groupId,

"client.id" -> clientId,

"auto.offset.reset" -> "smallest",

"enable.auto.commit" -> "false"

)

var kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String,String, Long, Int)](ssc, kafkaParam, imutableFromOffsets, messageHandler)

其中ssc为StreamingContext对象

6）业务逻辑代码中，将offset更新到redis

var kafkaOffsetKey = redisUtil.generateKafkaOffsetGroupIdTopicKey(groupId, kafkaTopicName)

// _._1 : topic name, _._2 : message body, _._3 : offset, _._4 : partition

kafkaStream.foreachRDD { rdd =>

if( !rdd.isEmpty() ){ // 此处判断可防止offsetRanges.foreach循环意外执行

var offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

// 处理数据

rdd.foreach{ row =>

logger.info("message : " + row + offsetRanges)

}

// 开启Redis事务

var jedis = jedisPool.getResource()

var jedisPipeline = jedis.pipelined()

jedisPipeline.multi()

// 更新offset

offsetRanges.foreach { offsetRange =>

logger.debug("partition : " + offsetRange.partition + " fromOffset: " + offsetRange.fromOffset + " untilOffset: " + offsetRange.untilOffset)

jedisPipeline.hset(kafkaOffsetKey, offsetRange.partition.toString(), offsetRange.untilOffset.toString())

}

jedisPipeline.exec() //提交事务

jedisPipeline.sync //关闭pipeline

jedis.close()

}

站内首发文章

lvtula

发布了276 篇原创文章 · 获赞 109 · 访问量 24万+

私信关注

SparkStreaming踩坑之Kafka重复消费

猜你喜欢