Kafka Under what circumstances will lose data

Kafka is a high-throughput data bus, used properly will we handle business like a duck, even more powerful. Improper handling system will also dilapidated, maintenance is painful, this is our reliability for the data to analyze this message components under what circumstances lose data.

A, producer arranged acks = 0

在acks=0模式下,消息传输到Broker端没收到Broker的反馈即发送下一条,这种纯异步的发送方式,难免会丢数据。

Two, producer arranged acks = 1

在ack=1模式,只要消息传输到partition的leader节点,leader节点返回ack,即认为发送数据成功,无需等待副本全部同步完。这种模式下,在leader节点宕机时,副本还没同步完leader的数据,就会发生数据丢失。

三、NO_ENOUGH_REPLICATE

produer配置acks=all的时候,也是有可能会丢数据的,当某个partition的ISR列表中的副本数,不满足min.inSync.replicate的时候,生产者发送消息就得不到ack确认,这时候生产者会进入重试,重试次数为配置的message.send.max.retries,如果在重试次数内ISR列表副本数仍然达不到最小同步副本数,那么,生产者会抛出NO_ENOUGH_REPLICATE的异常,如果没能正确处理这个异常,很可能这条数据就丢失了

那么什么情况下ISR列表的副本数不足最小副本数呢?

follower stuck copy process, there is no leader to initiate a synchronization request a copy of a period of time, such as frequent Full GC.
follower replica synchronization process is too slow, over a period of time to catch up on the leader can not copy, such as IO overhead is too large.
Four, NOT_LEADER_FOR_PARTITION

One of them will be connected to the Broker sessionTime out zk timeout, then lead Controller re-election, leading producer of metadata is incorrect, writing in the Broker, throws NOT_LEADER_FOR_PARTITION warnings, data loss can occur at this time
auto. leader.rebalance.enable = true will be re-elected leader of the operation, resulting in a write original leader, thrown NOT_LEADER_FOR_PARTITION
five disk failure

kafka的数据一开始就是存储在PageCache上的,定期flush到磁盘上的,也就是说,不是每个消息都被存储在磁盘了,如果出现断电或者机器故障等,PageCache上的数据就丢失了。

Flush interval may be configured by log.flush.interval.messages and log.flush.interval.ms

Six, Producer production data is too long

Single batch length of the data exceeds the limit will lose data, reported kafka.common.MessageSizeTooLargeException abnormal
data producers to produce is greater than the maximum message size can pull consumer configuration, this data will be a large consumer fails
Seven, no retransmission Retry

网络负载很高或者磁盘很忙写入失败的情况下,没有自动重试重发消息。没有做限速处理,超出了网络带宽限速。kafka一定要配置上消息重试的机制,并且重试的时间间隔一定要长一些,默认1秒钟并不符合生产环境(网络中断时间有可能超过1秒)。

Eight, consumers crash

如果auto.commit.enable=true,当consumer fetch了一些数据但还没有完全处理掉的时候,刚好到commit interval出发了提交offset操作,接着consumer crash掉了。这时已经fetch的数据还没有处理完成但已经被commit掉,因此没有机会再次被处理,数据丢失。

Nine, consumers did not correctly handle an exception

Consumer Consumer automatic submission sites, while consumer spending data is abnormal, the abnormal data is not dealt with properly, resulting in abnormal business data loss
Consumer manual batch submission sites, while sites in the bulk of a site data exceptions, no proper exception handling, but will submit a final site-volume sites, resulting in abnormal data loss

Published 55 original articles · won praise 14 · views 20000 +

Guess you like

Origin blog.csdn.net/qq422243639/article/details/98597647