Kafka classic interview questions

online problem rebalance

Due to the rebalancing of the consumer group due to the change of the cluster structure, if there are many nodes in the Kafka set, such as hundreds, the rebalancing may take several minutes to several hours . At this time, Kafka is basically in an unavailable state. For Kafka The TPS has a great influence

Causes:

  • The number of group members changes

  • The number of subscribed topics changes

  • The number of partitions subscribed to the topic changes

    The collapse of group members and the voluntary departure of group members are two different scenarios. Because the members will not actively inform the coordinator of the crash, the coordinator may need a complete session.timeout cycle (heartbeat cycle) to detect the crash, which will inevitably cause consumer lag. It can be said that leaving the group is actively initiating rebalance; while crashing is initiating rebalance passively.

成员1 Coordinator 成员2 Heartbeat(我是g1的成员C1,generation是2,hello) Heartbeat(hello,c1) Heartbeat(我是g1的成员C2,generation是2,hello) Heartbeat(hello,c2) Heartbeat(我是g1的成员C1,generation是2,hello) Heartbeat(c1,不好意思,你需要重新加入g1) JoinGroup(请求加入g1,这是我的订阅信息) JoinGroup(批准加入。你是c1,也是leader,现在generation是3,g1成员为c1,这些是订阅信息) SyncGroup(这是我的分配方案) SyncGroup(c1,你需要消费这些分区) 成员1 Coordinator 成员2

solution:

加大超时时间 session.timout.ms=6s
加大心跳频率 heartbeat.interval.ms=2s
增长推送间隔 max.poll.interval.ms=t+1 minutes

The role of ZooKeeper

Currently, Kafka uses ZooKeeper to store cluster metadata, member management, Controller election, and other management tasks. After that, after the KIP-500 proposal is completed, Kafka will no longer depend on ZooKeeper at all.

  • Storing metadata means that all data of the topic partition is stored in ZooKeeper, and other "people" must maintain alignment with it.
  • Member management refers to the registration, deregistration and attribute changes of Broker nodes.
  • Controller election refers to the election of the cluster Controller, including but not limited to topic deletion, parameter configuration, etc.

In a nutshell: KIP-500 uses the community-developed Raft-based consensus algorithm to realize Controller self-election .

The same is to store metadata, etcd based on the Raft algorithm has become more and more recognized in recent years

More and more systems are using it to save critical data. For example, the seckill system often uses it to save the information of each node in order to control the number of services that consume MQ. The configuration data of some business systems will also be synchronized to each node of the business system in real time through etcd . For example, the seckill management background will use etcd to synchronize the configuration data of the seckill activity to each node of the seckill API service in real time .

The role of the Replica copy

Only the Leader copy of Kafka can provide external read and write services and respond to the requests of Clients. The Follower copy only uses the PULL method to passively synchronize the data in the Leader copy, and is ready to apply for the Leader copy at any time after the Broker where the Leader copy is located is down.

  • Starting from Kafka 2.4 , the community can allow Follower replicas to provide limited read services through configuration parameters.
  • Previously, the main means to ensure consistency was the high water mark mechanism, but the high water mark value cannot guarantee the data consistency in the scenario where the Leader continuously changes. Therefore, the community introduced the Leader Epoch mechanism to fix the disadvantages of the high water mark value.

Why not support read-write separation?

  • Since Kafka 2.4 , Kafka provides limited read-write separation.
  • Scenario does not apply . Read-write separation is suitable for scenarios where the read load is heavy and the write operations are relatively infrequent.
  • synchronization mechanism . Kafka uses the PULL method to realize Follower synchronization, and the replication delay is relatively large.

How to prevent double consumption

  • At the code level, the offset needs to be submitted for each consumption
  • Through the unique key constraint of Mysql , combined with Redis to check whether the id is consumed , you can directly use the set method to save Redis
  • When the amount is large and misjudgment is allowed, the Bloom filter can also be used

How to ensure that data will not be lost

  • The producer's production message can be solved by comfirm configuration ack=all
  • Leader downtime during Broker synchronization can be solved by configuring ISR copy + retry
  • If the consumer is lost, you can turn off the function of automatically submitting the offset, and submit the offset when the system completes the processing

How to guarantee sequential consumption

  • Single topic, single partition, single consumer, single thread consumption, low throughput, not recommended
  • If you only need to ensure that the single key is in order , apply for a separate memory queue for each key, and each thread consumes a memory queue respectively, so that the order of the single key (such as user id, activity id) can be guaranteed.

####【Online】How to solve the backlog of consumption

  • Repair the consumer , make it capable of consumption, and expand the capacity of N units
  • Write a distribution program to evenly distribute topics to temporary topics
  • Simultaneously start N sets of consumers to consume different temporary Topics

How to Avoid Message Backlogs

  • Increase Consumption Parallelism
  • Bulk consumption
  • Reduce the number of interactions of component IO
  • priority consumption
if (maxOffset - curOffset > 100000) {
    
    
  // TODO 消息堆积情况的优先处理逻辑
  // 未处理的消息可以选择丢弃或者打日志
  return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;
}
// TODO 正常消费过程
return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;

作者:我是hope啊
链接:https://juejin.cn/post/7062892826241007646
来源:稀土掘金
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

How to Design a Message Queue

Need to support rapid horizontal expansion, broker+partition, partition on different machines, migrate data according to topic when adding machines, and consider consistency, availability, and partition fault tolerance in distributed

  • Consistency: message confirmation of producers, idempotence of consumers, data synchronization of Broker
  • Availability: how to ensure that data is not lost or repeated, how to persist data, and how to read and write when persisting
  • Partition fault tolerance: what election mechanism to use and how to synchronize multiple copies
  • Massive data: how to solve the backlog of messages and performance degradation of massive topics

In terms of performance, you can learn from time wheel, zero copy, IO multiplexing, sequential reading and writing, and compressed batch processing

Guess you like

Origin blog.csdn.net/zhw21w/article/details/129517642