20 common kafka interview questions and answers

JAVA interview book, get the JAVA interview is no longer a problem, the address of the series of articles, please click this link.

Table of contents

1. Is the consumer of Kafka pull (pull) or push (push) mode, what are the benefits of this mode?

2. Kafka maintains the tracking method of message status

3. What is the role of zookeeper for Kafka?

4. What are the two conditions for kafka to judge that a node is still alive?

5. Talk about the three mechanisms of Kafka's ack

6. In the case of kafka distributed (not stand-alone), how to ensure the sequential consumption of messages?

7. How does kafka not consume duplicate data? For example, deductions, we cannot deduct repeatedly.

8. Tell me about the composition of the kafka cluster?

9. What is Kafka?

10. Partition data files (offffset, MessageSize, data)

11. How does kafka achieve efficient data reading? (sequential read and write, segmented command, binary search)

12. When does the Rebalance operation on the Kafka consumer end occur?

13. What do ISR (InSyncRepli), OSR (OutSyncRepli), and AR (AllRepli) in Kafka represent?

14. What do HW, LEO, etc. in Kafka represent?

15. What are the designs of Kafka that allow it to have such high performance?

16. Why doesn't Kafka support read-write separation?

17. How many partition leader election strategies are there?

18. Please briefly describe in which scenarios you would choose Kafka?

19. Please talk about the principle of Kafka data consistency

20. What are the disadvantages of Kafka?



1. Is the consumer of kafka pull ( pull ) or push ( push ) mode, what are the benefits of this mode ?

Kafka follows a traditional design common to most messaging systems: producers push messages to brokers, and consumers pull messages from brokers.

Advantages: In the pull mode, consumers independently decide whether to pull data from the broker in batches, while in the push mode, it is difficult to control the push speed without knowing the consumption capacity of consumers. Too fast may cause consumers to crash, and too slow may cause waste.

Disadvantage: If the broker has no messages to consume, it will cause the consumer to keep polling in a loop until new messages arrive. In order to avoid this, Kafka has a parameter that allows the consumer to block until new messages arrive (of course, it can also block until the number of messages reaches a certain amount so that they can be sent in batches).

2. The tracking method of kafka maintaining message status

Topic in Kafka is divided into several partitions, and each partition is consumed by only one consumer at the same time . Then use the offset to mark the message location, and use the location offset to track the consumption status. Compared with some other message queues, the advantage of using "the broker immediately marks a message after it is distributed to the consumer or marks it after waiting for the customer's notification" is that it avoids possible program crashes and message loss after the communication message is sent. Or in the case of repeated consumption. At the same time, there is no need to maintain the state of the message, and no lock is required, which improves the throughput.

3. What is the role of zookeeper for kafka ?

Zookeeper is mainly used to communicate between different nodes in the cluster, in Kafka it is used to commit offsets, so if a node fails in any case, it can get from the previously committed offsets , among other things, it performs other activities such as: leader detection, distributed synchronization, configuration management, recognizing when new nodes leave or connect, clustering, node real-time status, and more.

4. What are the two conditions for kafka to judge that a node is still alive?

(1) The node must maintain the connection with ZooKeeper, and Zookeeper checks the connection of each node through the heartbeat mechanism
(2) If the node is a follower, it must be able to synchronize the write operation of the leader in time, and the delay should not be too long

5. Talk about the three mechanisms of Kafka 's ack

request.required.acks has three values ​​0 1 -1 (all), as follows:
0 : The producer will not wait for the broker's ack, which has the lowest delay but the weakest storage guarantee. When the server hangs up, data will be lost.
1 : The server will wait for the ack value leader copy to confirm receipt of the message and then send ack. However, if the leader hangs up, he does not ensure whether the copy is completed for the new leader, which will also cause data loss.
-1(all) : The server will receive the ack sent by the leader after all copies of the follower have received the data, so that the data will not be lost.

6. In the case of kafka distributed (not stand-alone), how to ensure the order consumption of messages ?

When sending a message in Kafka, three parameters (topic, partition, key) can be specified, and partiton and key are optional.

The distributed unit of Kafka is partition, and the same partition is organized with a write ahead log, so the order of FIFO can be guaranteed. The order is not guaranteed between different partitions. Therefore, you can specify a partition and send corresponding messages to the same partition, and on the consumer side, Kafka guarantees that a partition can only be consumed by one consumer, and the sequential consumption of these messages can be realized.

In addition, you can also specify a key (such as order id), and all messages with the same key will be sent to the same partition, which also realizes the order of messages.

7. How does kafka not consume duplicate data? For example, deductions, we cannot deduct repeatedly.

Another way to ask this question is how Kafka guarantees the idempotence of messages. For message queues, the probability of repeated messages is quite high, and we cannot completely rely on message queues. Instead, data consistency idempotence checks should be performed at the business layer.

For example, if the data you process needs to be written to a database (mysql, redis, etc.), you should first check it based on the primary key. If you have all the data, you should stop inserting it and perform other operations such as message registration or update. In addition, a unique key can also be set at the database level to ensure that data is not inserted repeatedly. Generally, producers are required to carry a globally unique id when sending messages.

8. Tell me about the composition of the kafka cluster?

The cluster diagram of kafka is as follows:

Broker (agent)

A Kafka cluster usually consists of multiple brokers to maintain load balance. Kafka brokers are stateless, so they use ZooKeeper to maintain their cluster state. A Kafka broker instance can handle hundreds of thousands of reads and writes per second, and each Broker can handle TB of messages without performance impact. Kafka broker leader election can be done by ZooKeeper.

ZooKeeper

ZooKeeper is used to manage and coordinate Kafka brokers. The ZooKeeper service is mainly used to notify producers and consumers of any new brokers in the Kafka system or broker failures in the Kafka system. Depending on Zookeeper receiving a notification about the presence or failure of a broker, then producers and consumers take a decision and start coordinating their tasks with some other broker.

Producers _

Producers push data to brokers. When a new broker starts, all producers search for it and automatically send messages to this new broker. Kafka producers do not wait for acknowledgments from brokers, and send messages as fast as brokers can process them.

Consumers _

Because Kafka brokers are stateless, this means that consumers must maintain how many messages have been consumed by using partition offsets. If a consumer acknowledges a particular message offset, it means the consumer has consumed all previous messages. Consumers issue asynchronous pull requests to brokers to have byte buffers ready to be consumed. Consumers can rewind or jump to any point in the partition simply by providing an offset value. Consumer offset values ​​are notified by ZooKeeper.

9. What is Kafka ?

Kafka is a high-throughput, distributed, publish/subscribe-based messaging system originally developed by LinkedIn, written in Scala language, and currently an open source project of Apache.

broker: Kafka server, responsible for message storage and forwarding

topic: message category, Kafka classifies messages according to topic

partition: topic partition, a topic can contain multiple partitions, and topic messages are stored on each partition 4. offset: the position of the message in the log, which can be understood as the offset of the message on the partition, and is also the only one representing the message serial number

Producer: message producer

Consumer: message consumer

Consumer Group: Consumer grouping, each Consumer must belong to a group

Zookeeper: saves the cluster broker, topic, partition and other meta data; in addition, it is also responsible for broker fault discovery, partition leader election, load balancing and other functions

10. Partition data files ( offffset , MessageSize , data )

Each Message in the partition contains the following three attributes: offset, MessageSize, data, where offset represents the offset of the Message in this partition, and offset is not the actual storage location of the Message in the partition data file, but logically A value, which uniquely determines a Message in the partition. It can be considered that offset is the id of the Message in the partition; MessageSize indicates the size of the message content data; data is the specific content of the Message.

11. How does kafka realize efficient reading of data? (sequential read and write, segmented command, binary search)

Kafka creates an index file for each segmented data file. The file name is the same as the data file name, but the file extension is index. The index file does not create an index for each message in the data file, but uses a sparse storage method to create an index every certain byte of data. This prevents the index file from occupying too much space, so that the index file can be kept in memory.

12. When does the Rebalance operation on the Kafka consumer end occur?

  • In the same consumer consumer group group.id, a new consumer comes in, and the Rebalance operation will be performed
  • The consumer leaves the consumer group to which it belongs in the current period. such as downtime
  • When the number of partitions changes (that is, when the number of partitions of a topic changes)
  • Consumer voluntarily unsubscribes

The process of Rebalance is as follows:

Step 1: All members send a request to the coordinator to join the group. Once all members have sent requests, the coordinator will select a consumer to act as the leader, and send the group member information and subscription information to the leader.

Step 2: The leader starts to allocate consumption plans, specifying which consumers are responsible for consuming which partitions of which topics. Once the allocation is completed, the leader will send the plan to the coordinator. After receiving the allocation plan, the coordinator will send the plan to each consumer, so that all members in the group will know which partitions they should consume.

So for Rebalance, Coordinator plays a vital role

13. What do ISR (InSyncRepli) , OSR (OutSyncRepli) and AR (AllRepli) in Kafka represent?

Answer: The copies (including the leader) in Kafka that maintain a certain degree of synchronization with the leader copy form the ISR. Replicas that lag behind the leader too much form an OSR. All replicas in a partition are collectively referred to as AR.

ISR: A collection of followers with a speed difference of less than 10 seconds from the leader
OSR: Followers with a speed difference of more than 10 seconds from the leader
AR: Followers of all partitions

14. What do HW , LEO , etc. in Kafka represent?

Answer: HW: high water level, which means that consumers can only pull data before this offset

LEO: identifies the offset of the next message to be written in the current log file, and its size is equal to the offset+1 of the last message in the current log file.

15. What are the designs of Kafka that make it have such high performance ?

1. Kafka is a distributed message queue
2. Segment the log file and create an index for the segment
3. (For a single node) use sequential read and write, the speed can reach 600M/s
4. Reference to zero copy, The read and write operations are completed in the os system

16. Why doesn't Kafka support read-write separation?

1. This is actually a common problem in distributed scenarios, because we know that under the CAP theory, we can only guarantee one of C (consistency) and A (availability). If read-write separation is supported, then in fact, for consistency There may be a certain discount for the requirement, because in the usual scenario, the data consistency between the copies is achieved through synchronization, and there will definitely be time consumption during the synchronization process. If read-write separation is supported, it means that it is possible The data is inconsistent, or the data lags behind.

2. The Leader/Follower model does not stipulate that the Follower copy cannot provide external read services. Many frameworks allow this, but Kafka initially adopted the method of allowing the Leader to provide services uniformly in order to avoid the problem of inconsistency.

3. However, since Kafka 2.4, Kafka provides a limited read-write separation, that is to say, Follower copies can provide external read services.

17. How many partition leader election strategies are there?

The election of the leader copy of the partition is completely transparent to the user, and it is completed independently by the Controller. What you need to answer is in which scenarios you need to perform partition leader election. Each scenario corresponds to an election strategy.

1. OfflinePartition Leader election: Whenever a partition goes online, a Leader election needs to be performed. The so-called partition going online may be the creation of a new partition, or it may be that the previously offline partition is back online. This is the most common partition leader election scenario.

2. ReassignPartition Leader election: When you manually run the Kafka-reassign-partitions command, or call the alterPartitionReassignments method of Admin to perform partition copy reassignment, this type of election may be triggered. Assuming that the original AR is [1, 2, 3], and the Leader is 1, after performing replica reassignment, the replica set AR is set to [4, 5, 6]. Obviously, the Leader must be changed, and Reassign will occur at this time Partition Leader election.

3. PreferredReplicaPartition Leader election: When you manually run the Kafka-preferred-replica-election command, or automatically trigger the Preferred Leader election, this type of strategy is activated. The so-called Preferred Leader refers to the first copy in AR. For example, AR is [3, 2, 1], then the Preferred Leader is 3.

4. ControlledShutdownPartition Leader election: When a Broker shuts down normally, all Leader copies on the Broker will go offline. Therefore, it is necessary to perform corresponding Leader elections for the affected partitions.

The general idea of ​​these four types of election strategies is similar, that is, to select the first copy in the ISR from the AR as the new Leader.

18. Please briefly describe in which scenarios you would choose Kafka ?

• Log collection: A company can use Kafka to collect logs of various services, and open them to various consumers through Kafka as a unified interface service, such as hadoop, HBase, Solr, etc.
•Message system: Decoupling producers and consumers, caching messages, etc.
• User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing the web, searching, clicking and other activities. These activity information is published by each server to the topic of Kafka, and then subscribers subscribe to these Topic for real-time monitoring and analysis, or loaded into hadoop or data warehouse for offline analysis and mining.
• Operational indicators: Kafka is also often used to record operational monitoring data. This includes collecting data from various distributed applications and producing centralized feedback for various operations, such as alarms and reports.
• Stream processing: such as spark streaming and Flink

19. Please talk about the principle of Kafka data consistency

Consistency means that whether it is an old leader or a newly elected leader, consumers can read the same data.


Assume that the partition has 3 copies, where copy 0 is the leader, copy 1 and copy 2 are followers, and are in the ISR list. Although copy 0 has written Message4, Consumer can only read Message2. Because all ISRs synchronize Message2, only messages above High Water Mark can be read by Consumers, and High Water Mark depends on the partition with the smallest offset in the ISR list, which corresponds to copy 2 in the above figure, which is very similar to The barrel principle.

The reason for this is that messages that have not been replicated by enough replicas are considered "unsafe" and are likely to be lost if the leader crashes and another replica becomes the new leader. If we allow consumers to read these messages, we may break consistency. Just imagine, a consumer reads and processes Message4 from the current leader (copy 0). At this time, the leader hangs up and elects copy 1 as the new leader. At this time, another consumer reads the message from the new leader. It was found that this message did not actually exist, which led to the problem of data inconsistency.

Of course, the introduction of the High Water Mark mechanism will cause the message replication between Brokers to slow down for some reason, and the time for the message to reach the consumer will also be longer (because we will wait for the message to be copied first). The delay time can be configured through the parameter replica.lag.time.max.ms parameter, which specifies the maximum delay time that the replica can allow when replicating messages.

20. What are the disadvantages of Kafka ?

Because it is sent in batches, the data is not real-time;
• It does not support the mqtt protocol;
• It does not support the direct access of IoT sensor data;
• It only supports the order of messages in the unified partition, and cannot achieve the order of global messages
; Perfect, plug-ins need to be installed;
• Rely on zookeeper for metadata management;

JAVA interview book, getting the JAVA interview done is no longer a problem. For the address of the series of articles, please click this link. https://blog.csdn.net/wanghaiping1993/article/details/125075785

Guess you like

Origin blog.csdn.net/wanghaiping1993/article/details/125346010