Summary of Kafka high-frequency interview questions

table of Contents

 

1. What do ISR (InSyncRepli), OSR (OutSyncRepli) and AR (AllRepli) in Kafka represent?

2. What do HW and LEO in Kafka stand for?

3. How does Kafka reflect the order of messages?

4. Do you understand the partitioner, serializer, and interceptor in Kafka? What is the processing order between them?

5. How many threads does the Kafka producer client use for processing? What are they?

6. "If the number of consumers in the consumer group exceeds the topic partition, then some consumers will not be able to consume data." Is the sentence correct?

7. When the consumer submits the consumption displacement, is the offset or offset+1 of the latest news currently consumed?

8. What situations will cause repeated consumption?

9. Which scenarios will cause the message to miss consumption?

10. After you use kafka-topics.sh to create (delete) a topic, what logic will be executed behind Kafka?

11. Can the number of topic partitions be increased? If it can be increased? If not, why?

12. Can the number of topic partitions be reduced? If it can be reduced? If not, why?

13. Does Kafka have internal topics? If so what? What's the use?

14. What is the concept of Kafka partition allocation?

15. Briefly describe the log directory structure of Kafka?

16. If I specify an offset, how can Kafka Controller find the corresponding message?

17. Talk about the role of Kafka Controller?

18. Which places in Kafka need to be elected? What are the election strategies in these places?

19. What is an invalid copy? What are the countermeasures?

20. What design of Kafka makes it have such high performance?

21. What are the uses of Kafka? What is the usage scenario?

22. Talk about your understanding of Kafka's Log Retention

23. Why choose Kafka?

24. KafkaConsumer is not thread-safe, so how to achieve multi-threaded consumption?

25. Briefly describe the relationship between consumers and consumer groups 

26. How to choose the appropriate number of partitions when creating a topic?

27. What is a priority copy? What is its special function?

28. Kafka expired data cleaning?

29. How is idempotence in Kafka achieved?

The relevant knowledge points of Kafka entry:


1. What do ISR (InSyncRepli), OSR (OutSyncRepli) and AR (AllRepli) in Kafka represent?

ISR (InSyncRepli): The internal replica synchronization queue, which is a subset of the replica list, which is currently active and submitted to the leader. The "leader" is the node responsible for all read and write operations for a given partition. Each node will be a randomly selected part of the leader in the partition, and "replicas" is a list of nodes that replicate the logs of this partition, regardless of whether they are the master node or the active node.

OSR (OutSyncRepli): external copy synchronization queue

AR (AllRepli): all copies

2. What do HW and LEO in Kafka stand for?

    LEO: the offset of the last message of each copy

    HW: the smallest offset of all copies in a partition

3. How does Kafka reflect the order of messages?

In each partition, each message has an offset, so it can only guarantee the order in the partition.

4. Do you understand the partitioner, serializer, and interceptor in Kafka? What is the processing order between them?

Kafka sends the message to the broker through the send() method of the producer KafkaProducer, but in the sending process, it needs to go through a series of functions of the interceptor (Interceptor), serializer (Serializer) and partitioner (Partitioner) before it can be truly Send it to the broker. After the message is serialized, it is necessary to determine the partition to which it is sent. If the partition field is specified in the message ProducerRecord, then the role of the partitioner is not needed, because the partition represents the partition number to be sent to

 Interceptor -> Serializer -> Partitioner

5. How many threads does the Kafka producer client use for processing? What are they?

 

The entire producer client has two main threads, the main thread and the sender thread. Producer generates messages in the main thread, and then buffers them in the message accumulator RecordAccumulator through the interceptor, serializer, and partitioner. The Sender thread gets the message from the RecordAccumulator and sends it to Kafka.

6. "If the number of consumers in the consumer group exceeds the topic partition, then some consumers will not be able to consume data." Is the sentence correct?

 correct

7. When the consumer submits the consumption displacement, is the offset or offset+1 of the latest news currently consumed?

 offset+1

8. What situations will cause repeated consumption?

The reason for the repeated consumption problem of Kafka is that the data has been consumed, but the offset has not been submitted (for example, Kafka does not or does not know that the data has been consumed).

9. Which scenarios will cause the message to miss consumption?

Submit the offset first, then consume, which may cause data duplication

10. After you use kafka-topics.sh to create (delete) a topic, what logic will be executed behind Kafka?

    1) A new topic node will be created under the /brokers/topics node in zookeeper, such as: /brokers/topics/first

    2) Trigger the listener of the Controller

    3) Kafka Controller is responsible for topic creation and update metadata cache

11. Can the number of topic partitions be increased? If it can be increased? If not, why?

can increase

bin/kafka-topics.sh --zookeeper localhost:2181/kafka --alter --topic topic-config --partitions 3

12. Can the number of topic partitions be reduced? If it can be reduced? If not, why?

Can not be reduced, the existing partition data is difficult to handle.

13. Does Kafka have internal topics? If so what? What's the use?

Have 

__consumer_offsets 

Save consumer offset

14. What is the concept of Kafka partition allocation?

A topic has multiple partitions, and a consumer group has multiple consumers, so it is necessary to assign a partition to consumers (roundrobin, range)

  • The use of RoundRobin is group-oriented, and the possible problem is that different consumers in the same group can subscribe to different topics, because it is a polling strategy, this configuration will lead to invalid
  • Considering that range is subject-oriented, the problem with this strategy is that it may cause uneven load.

15. Briefly describe the log directory structure of Kafka?

 One folder per partition, containing four types of files. index .log .timeindex leader-epoch-checkpoint

.index .log .timeindex three files appear in pairs with the offset of the last message prefixed with the previous segment

16. If I specify an offset, how can Kafka Controller find the corresponding message?

  • Find the file where the absolute offset corresponds to the message by the file name prefix number x
  • offset-x is the relative offset in the file
  • Find the location of the most recent message through the index recorded in the index file
  • Find one by one from the nearest location

17. Talk about the role of Kafka Controller?

   Responsible for managing the online and offline of cluster brokers, the allocation of partition copies of all topics, and leader election.

18. Which places in Kafka need to be elected? What are the election strategies in these places?

 partition leader (ISR), controller (first come first served)

19. What is an invalid copy? What are the countermeasures?

Can't synchronize with leader in time

Temporarily kick out the ISR and wait for it to catch up with the leader before rejoining

20. What design of Kafka makes it have such high performance?

Partition, write disk sequentially, 0-copy

21. What are the uses of Kafka? What is the usage scenario?

Asynchronous processing, daily system decoupling, peak cutting, speed increase, broadcast

If you say something more specific, such as: messages, website activity tracking, monitoring indicators, log aggregation, stream processing, event collection, submission logs, etc.

22. Talk about your understanding of Kafka's Log Retention

Kafka retention strategy includes deletion and compression

  • Delete: Delete according to the two methods of time and size. The size is the size of the entire partition log file. If it is exceeded, it will be deleted from old to new. Time refers to the maximum timestamp in the log file instead of the last modification time of the file
  • Compression: The value of the same key only saves one compressed, clean, uncompressed dirty, and the offset after compression is discontinuous and continuous when uncompressed

23. Why choose Kafka?

High throughput, the only choice for big data messaging system.

24. KafkaConsumer is not thread-safe, so how to achieve multi-threaded consumption?

  • Each thread maintains a KafkaConsumer
  • Maintain one or more KafkaConsumers and maintain multiple event processing threads (worker threads) 

25. Briefly describe the relationship between consumers and consumer groups 

Consumer subordination and consumption group, consumption deviation is based on consumption group. Each consumer group can independently consume all the subject data, consumers in the same consumer group consume the subject data together, and each partition can only be consumed by one consumer in the same consumer group.

26. How to choose the appropriate number of partitions when creating a topic?

  1. Create a topic with only 1 partition
  2. Test the producer throughput and consumer throughput of this topic.
  3. Assuming their values ​​are Tp and Tc, the unit can be MB/s.
  4. Then assume that the total target throughput is Tt, then the number of partitions = Tt / max (Tp, Tc)

27. What is a priority copy? What is its special function?

The priority copy will be the default leader copy. When the leader changes, the re-election will give priority to the priority copy as the leader.

28. Kafka expired data cleaning?

There are only two strategies for log cleaning and saving: delete and compact

log.cleanup.policy=delete enable delete policy

log.cleanup.policy=compact enable compression policy

29. How is idempotence in Kafka achieved?

The idempotence of the Producer means that when the same message is sent, the data will only be persisted once on the server side, and the data will not be lost or repetitive, but the idempotence here is conditional:

1) It can only be guaranteed that the Producer will not be lost in a single session. If the Producer unexpectedly hangs and restarts, it cannot be guaranteed (in the case of idempotence, the previous state information cannot be obtained, so it is impossible to achieve cross-session level It’s not lost or heavy).

2) Idempotence cannot span multiple Topic-Partitions, and can only guarantee idempotence within a single P artition . When multiple Topic-Partitions are involved, the states in the middle are not synchronized.

The relevant knowledge points of Kafka entry:

https://blog.csdn.net/Poolweet_/article/details/109246515

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109312019