前言
kafka消息队列在项目开发中经常被使用,尤其是在大数据领域经常见到它的身影。spring集成了kafka方便我们使用,只要引入spring-kafka即可。
问题描述
有一天我们后台版本发布,虽然改动很大,但是大家还是自信满满,因为经过了好几轮测试验证都没有问题,但是很不幸,结果还是出现问题了,上线后发现kafka消费线程只拉取了一次就莫名其妙停止,重启后会重新拉,但是也就一次就停止(理论上消费者是不停的从服务端(broke)拉取(poll)消息的)最奇怪的是没有任何异常堆栈信息打出来,于是大家都陷入了沉思,尝试各种无效措施都无法解决,最后已只能回滚。
项目使用的是spring-kafka这个组件来和kafka服务端交互的,如果大家没有用它,那就可能不是同一个问题哈。
分析过程
spring-kafka版本是1.3.5版本,它依赖kafka-client版本是0.11.0.2
业务中使用@KafkaListener注解来启动消费者线程
@KafkaListener(topics = {
"my.topic"}, groupId = "mygroup")
public void listen(ConsumerRecord<String, String> record, Acknowledgment ack) {
// do something
ack.acknowledge();
}
有了这个注解,spring-kafka就会帮我们启动消费者线程,看以下处理类
ConcurrentMessageListenerContainer.java
@Override
protected void doStart() {
if (!isRunning()) {
ContainerProperties containerProperties = getContainerProperties();
TopicPartitionInitialOffset[] topicPartitions = containerProperties.getTopicPartitions();
if (topicPartitions != null
&& this.concurrency > topicPartitions.length) {
this.logger.warn("When specific partitions are provided, the concurrency must be less than or "
+ "equal to the number of partitions; reduced from " + this.concurrency + " to "
+ topicPartitions.length);
this.concurrency = topicPartitions.length;
}
setRunning(true);
for (int i = 0; i < this.concurrency; i++) {
KafkaMessageListenerContainer<K, V> container;
if (topicPartitions == null) {
container = new KafkaMessageListenerContainer<>(this.consumerFactory, containerProperties);
}
else {
container = new KafkaMessageListenerContainer<>(this.consumerFactory, containerProperties,
partitionSubset(containerProperties, i));
}
String beanName = getBeanName();
container.setBeanName((beanName != null ? beanName : "consumer") + "-" + i);
if (getApplicationEventPublisher() != null) {
container.setApplicationEventPublisher(getApplicationEventPublisher());
}
container.setClientIdSuffix("-" + i);
container.start();
this.containers.add(container);
}
}
}
按设置的并发度(concurrency默认1)开启创建消费者线程,
进入KafkaMessageListenerContainer.java
@Override
protected void doStart() {
if (isRunning()) {
return;
}
// 省略掉部分代码
this.listenerConsumer = new ListenerConsumer(this.listener, this.acknowledgingMessageListener);
setRunning(true);
this.listenerConsumerFuture = containerProperties
.getConsumerTaskExecutor()
.submitListenable(this.listenerConsumer);
}
仍然是doStart方法,可以看到最终是通过
this.listenerConsumerFuture = containerProperties
.getConsumerTaskExecutor()
.submitListenable(this.listenerConsumer);
创建一个Executor,并且提交了任务(其中ListenerConsumer是一个Runnable)
Executor默认是SimpleAsyncTaskExecutor,进入它的方法
@Override
public ListenableFuture<?> submitListenable(Runnable task) {
ListenableFutureTask<Object> future = new ListenableFutureTask<Object>(task, null);
execute(future, TIMEOUT_INDEFINITE);
return future;
}
我们看到了ListenableFutureTask,是不是有点熟悉的味道,没错,它就是JDK中FutureTask的衍生类,到了这里我们也应该能猜到,它最终会调用FutureTask的run方法,最终会回到KafkaMessageListenerContainer的run方法,如下
@Override
public void run() {
this.consumerThread = Thread.currentThread();
if (this.theListener instanceof ConsumerSeekAware) {
((ConsumerSeekAware) this.theListener).registerSeekCallback(this);
}
if (this.transactionManager != null) {
ProducerFactoryUtils.setConsumerGroupId(this.consumerGroupId);
}
this.count = 0;
this.last = System.currentTimeMillis();
if (isRunning() && this.definedPartitions != null) {
initPartitionsIfNeeded();
}
long lastReceive = System.currentTimeMillis();
long lastAlertAt = lastReceive;
while (isRunning()) {
try {
if (!this.autoCommit && !this.isRecordAck) {
processCommits();
}
processSeeks();
ConsumerRecords<K, V> records = this.consumer.poll(this.containerProperties.getPollTimeout());
this.lastPoll = System.currentTimeMillis();
if (records != null && this.logger.isDebugEnabled()) {
this.logger.debug("Received: " + records.count() + " records");
}
if (records != null && records.count() > 0) {
if (this.containerProperties.getIdleEventInterval() != null) {
lastReceive = System.currentTimeMillis();
}
invokeListener(records);
}
else {
if (this.containerProperties.getIdleEventInterval() != null) {
long now = System.currentTimeMillis();
if (now > lastReceive + this.containerProperties.getIdleEventInterval()
&& now > lastAlertAt + this.containerProperties.getIdleEventInterval()) {
publishIdleContainerEvent(now - lastReceive);
lastAlertAt = now;
if (this.theListener instanceof ConsumerSeekAware) {
seekPartitions(getAssignedPartitions(), true);
}
}
}
}
}
catch (WakeupException e) {
// Ignore, we're stopping
}
catch (NoOffsetForPartitionException nofpe) {
this.fatalError = true;
ListenerConsumer.this.logger.error("No offset and no reset policy", nofpe);
break;
}
catch (Exception e) {
if (this.containerProperties.getGenericErrorHandler() != null) {
this.containerProperties.getGenericErrorHandler().handle(e, null);
}
else {
this.logger.error("Container exception", e);
}
}
}
ConsumerRecords<K, V> records = this.consumer.poll(this.containerProperties.getPollTimeout());
这句就是从broke服务端拉取消息的,它里面用了while循环来实现线程不退出拉取(poll) 消息,拉取到消息的话,就会触发invokeListener(records); 最终就调到我们开头使用@KafkaListener这个方法的地方来执行我们的业务代码,好了,按道理这里应该会不断拉取消息消费的,可为什么偏偏停止了呢?
我们接着分析,跟进去invokeListener
private void invokeListener(final ConsumerRecords<K, V> records) {
if (this.isBatchListener) {
invokeBatchListener(records);
}
else {
invokeRecordListener(records);
}
}
private void invokeRecordListener(final ConsumerRecords<K, V> records) {
if (this.transactionTemplate != null) {
innvokeRecordListenerInTx(records);
}
else {
doInvokeWithRecords(records);
}
}
接着doInvokeWithRecords
private void doInvokeWithRecords(final ConsumerRecords<K, V> records) throws Error {
Iterator<ConsumerRecord<K, V>> iterator = records.iterator();
while (iterator.hasNext()) {
final ConsumerRecord<K, V> record = iterator.next();
if (this.logger.isTraceEnabled()) {
this.logger.trace("Processing " + record);
}
doInvokeRecordListener(record, null);
}
}
上面这段是遍历前面拉取到的消息,一个个去调用处理它,接着往下
private RuntimeException doInvokeRecordListener(final ConsumerRecord<K, V> record,
@SuppressWarnings("rawtypes") Producer producer) throws Error {
try {
if (this.acknowledgingMessageListener != null) {
this.acknowledgingMessageListener.onMessage(record,
this.isAnyManualAck
? new ConsumerAcknowledgment(record)
: null);
}
else {
this.listener.onMessage(record);
}
ackCurrent(record, producer);
}
this.acknowledgingMessageListener.onMessage然后最终会调用到以下部分:
InvocableHandlerMethod.java
public Object invoke(Message<?> message, Object... providedArgs) throws Exception {
Object[] args = getMethodArgumentValues(message, providedArgs);
if (logger.isTraceEnabled()) {
logger.trace("Invoking '" + ClassUtils.getQualifiedMethodName(getMethod(), getBeanType()) +
"' with arguments " + Arrays.toString(args));
}
Object returnValue = doInvoke(args);
if (logger.isTraceEnabled()) {
logger.trace("Method [" + ClassUtils.getQualifiedMethodName(getMethod(), getBeanType()) +
"] returned [" + returnValue + "]");
}
return returnValue;
}
接着doInvoke方法
protected Object doInvoke(Object... args) throws Exception {
ReflectionUtils.makeAccessible(getBridgedMethod());
try {
return getBridgedMethod().invoke(getBean(), args);
}
catch (IllegalArgumentException ex) {
assertTargetBean(getBridgedMethod(), getBean(), args);
String text = (ex.getMessage() != null ? ex.getMessage() : "Illegal argument");
throw new IllegalStateException(getInvocationErrorMessage(text, args), ex);
}
catch (InvocationTargetException ex) {
// Unwrap for HandlerExceptionResolvers ...
Throwable targetException = ex.getTargetException();
if (targetException instanceof RuntimeException) {
throw (RuntimeException) targetException;
}
else if (targetException instanceof Error) {
throw (Error) targetException;
}
else if (targetException instanceof Exception) {
throw (Exception) targetException;
}
else {
String text = getInvocationErrorMessage("Failed to invoke handler method", args);
throw new IllegalStateException(text, targetException);
}
}
}
这里看到了很多对异常处理的地方,也就是如果我们业务异常了,这里就能全部捕捉到,包括了Error这种非检查型异常。好了,我们再回到之前方法doInvokeRecordListener
在KafkaMessageListenerContainer类中
private RuntimeException doInvokeRecordListener(final ConsumerRecord<K, V> record,
@SuppressWarnings("rawtypes") Producer producer) throws Error {
try {
if (this.acknowledgingMessageListener != null) {
this.acknowledgingMessageListener.onMessage(record,
this.isAnyManualAck
? new ConsumerAcknowledgment(record)
: null);
}
else {
this.listener.onMessage(record);
}
ackCurrent(record, producer);
}
catch (RuntimeException e) {
if (this.containerProperties.isAckOnError() && !this.autoCommit && producer == null) {
ackCurrent(record, producer);
}
if (this.errorHandler == null) {
throw e;
}
try {
this.errorHandler.handle(e, record);
// 省略部分代码
}
catch (RuntimeException ee) {
// 省略部分代码
}
return null;
}
如果我们业务异常了this.errorHandler.handle会帮我们处理,errorHandler默认是LoggingErrorHandler类,它里面很简单,就是抛出堆栈信息。
但是注意到没有,它这里只是catch了RuntimeException这种类型的异常,对于其他异常,如Error这种它是不管的,也就是会往前面继续抛,好,我们回到最开始的run方法,就是while循环拉取消息那个地方,最终它会抛到这里去
catch (WakeupException e) {
// Ignore, we're stopping
}catch (NoOffsetForPartitionException nofpe) {
this.fatalError = true;
ListenerConsumer.this.logger.error("No offset and no reset policy", nofpe);
break;
}
catch (Exception e) {
if (this.containerProperties.getGenericErrorHandler() != null) {
this.containerProperties.getGenericErrorHandler().handle(e, null);
}
else {
this.logger.error("Container exception", e);
}
}
这里会再次catch,保证run方法不退出,线程保持住继续拉取,但是发现没有,如果抛出的是Error呢?是不是线程就退出了,好了,真相大白了,就是因为我们业务中有代码抛出了这种Error类型的异常(我们业务中确实是触发了Error),导致消费者线程退出了,也就是run方法结束,好,既然消费线程都退出了,还怎么拉取消息对吧,到此真相大白。
只是心跳线程还在,后续因为kafka会有检测消费者两次拉取间隔时长来判断消费者是否还活着,如果超过最大时长没有拉取(poll)就被踢掉,所以最后心跳线程也结束了,一切都结束了…
等等,这就完了吗?
还有个问题,线程是将Error往外面抛了,理论上JVM会帮我们打印出来堆栈,可是怎么没有看到异常堆栈信息呢? 为了弄清楚这个问题,又得回到上面提到的FutureTask这哥们,如果它run方法里面异常,不管你什么异常,如果往外抛就被捕捉到,并且最终将异常setException,也就是被吞掉了,熟悉JDK线程池的应该都知道,OK,分析到此可以收尾了。
最后补充下我为什么知道业务存在Error异常
异常没有抛出来,我是怎么知道业务存在Error异常的,因为重启应用就触发了doStop方法,会重新抛出来异常信息,这个跟JDK线程池将异常先吞掉,只有通过Future.get()才抛出来思想是类似的
org.springframework.kafka.listener.KafkaMessageListenerContainer#doStop
@Override
protected void doStop(final Runnable callback) {
if (isRunning()) {
this.listenerConsumerFuture.addCallback(new ListenableFutureCallback<Object>() {
@Override
public void onFailure(Throwable e) {
KafkaMessageListenerContainer.this.logger.error("Error while stopping the container: ", e);
if (callback != null) {
callback.run();
}
}
// 省略
});
setRunning(false);
this.listenerConsumer.consumer.wakeup();
}
}
总结
消费者线程停止消费罪魁祸首其实是我们在业务中抛了Error类型的异常导致线程退出,异常被吞掉所以看不到异常堆栈,所以我们在开发业务过程要警惕这种错误异常的抛出,即使是有也要在业务代码中catch它,以免造成这种情况发生。