kafka Auto offset commit faild reblance

 

 

 

Today, we encountered some problems when using python consumption kafka, special record it.

scene one,

Special case: write a separate program only for production and consumption data

Start time: 10 : 42

Topic: t_facedec

Partition: 1

Program starts: 168 Start consumer, 158 start consumer, windows machine producer to push data

Run Length: 15 Fenzhong

result:

1 , 168 of consume pause, 158 of the consumer has been consumption

2 , 10 : 46 minutes producer restart pushed back a few stops, 158 stop spending and start spending

3 , 10 : 49 minute stop 168 , 158 of the consumer according to the order to restart, 168 after some data consumption 158 start has been consumer

4 , after the start of consumer data in the consumer

 

 

Scene Two,

Special Scene: on-line program, comprising a face recognition process

Start time: 11 : 00

Topic: t_facedec

Partition: 1

 

1 , 11:46 start 168 of conumer consumption, 6 minutes after the log is as follows, no abnormal information

Scene Two,

Special Scene: on-line program, comprising a face recognition process

Start time: 11 : 00

Topic: t_facedec

Partition: 1

 

1 , 11:46 start 168 of conumer consumption, 6 minutes after the log is as follows, no abnormal information

2 , 11:53 start 158 of the consumer , as a log, no abnormal, 158 of the consumer join group kongzhagen

 

3168consumer发出警告, 心跳失败,因组正在重新平衡

4windows端启动producer168consumer开始消费数据, 158consumer没有消费数据

 

 

结论: 先启动的consumer会消费数据, 168consumer关闭后, 158consumer开始消费

 

 

5、半小时后

 

 

分解错误图:

9分钟后空连接时间到'connections_max_idle_ms': 540000,

'max_poll_records': 500, 'heartbeat_interval_ms': 3000, 'session_timeout_ms': 30000,

后续:

1432开始继续生产数据, 数据开始被消费

1448分再次出现平衡超时

1502分再次出现平衡超时

原因分析:

1、引起timeout的原因是consumer3秒触发一次心跳, 由于某种原因在30秒内协调者没有收到此consumer的心跳信息, 认为此consumer已经死掉,topic内的分区在group的成员间重新分配(reblance)

2、默认consumer的每次最大poll数据量为500, 如果处理这500条记录的时候超过了最大时间间隔max_poll_interval_msconsumer也会退出group, 导致reblance的产生

3、如果consumer没有产生消费行为的时间超过最大值connections_max_idle_ms540000 (9 min)时, 也会导致consumer退出该组。

 

解决方法:

1、增加心跳会话超时间隔

session_timeout_ms = 300000(从30改为300秒)

2、减少每次获取任务的数量

max_poll_records = 5(从500改为5

3、增加空闲连接时间

connections_max_idle_ms=5400000(从9min改为90min

 

Guess you like

Origin www.cnblogs.com/kongzhagen/p/11320015.html