Kafka of common big data interview questions

Article Directory

1. The difference between kafka and traditional message queues

  • First of all, Kafka will partition the received messages, and the messages of each topic have different partitions, so that on the one hand, the storage of messages will not be limited by the storage space of a single server, and on the other hand, the processing of messages Can also run in parallel on multiple servers
  • Secondly, in order to ensure high availability, each partition has a certain number of replicas, so if some servers are unavailable, the server where the replica is located will take over to ensure the continuity of the application
  • In addition, Kafka guarantees the orderly consumption of messages within the partition
  • Kafka also has the concept of consumer group. Each partition can only be consumed by one consumer of the same group, but can be consumed by multiple groups.

Compare with RabbitMQ:

  • 1. Architecture model
  • RabbitMQ follows the AMQP protocol and consists of RabbitMQ's brokerExchange, Binding, and queue. Exchange and binding form the routing key of the message; the client Producer communicates with the server through the connection channel, and the consumer obtains the message from the queue for consumption (long connection, queue has message It will be pushed to the consumer, and the consumer reads data from the input stream in a loop). rabbitMQ is broker-centric: a confirmation mechanism for messages
  • Kafka follows the general MQ structure. Producer, broker, and consumer pull data from the broker in batches according to the point of consumption, without a message confirmation mechanism
  • 2. Throughput
  • Kafka has high throughput, internally uses message batch processing, zero-copy mechanism, data storage and acquisition are local disk sequential batch operations, with O(1) complexity, and high message processing efficiency
  • RabbitMQ is slightly inferior to kafka in terms of throughput. Their starting point is different. RabbitMQ supports reliable delivery of messages, supports transactions, and does not support batch operations. Storage based on storage reliability requirements can use memory or hard disk
  • 3. Availability
  • rabbitMQ supports the queue of mirror, the main queue fails, and the mirror queue takes over
  • Kafka's broker supports active and standby mode
  • 4. Cluster load balancing
  • Kafka uses zookeeper to manage the brokers and consumers in the cluster, and can register topics to zookeeper. Through the coordination mechanism of zookeeper, the producer saves the broker information of the corresponding topic, which can be randomly or polled and sent to the broker, and the producer can be specified based on semantics. Sharding, the message is sent to a certain shard of the broker

Two. Kafka application scenarios

  • Kafka is a high-throughput distributed publish-subscribe messaging system that can process all action flow data in consumer-scale websites. Simply put, Kafka is like a mailbox, and the producer is the person who sends the mail, and the consumer The person is the person who receives the email, and Kafka is used to store things, but it provides some mechanisms for processing emails. Use scenarios include:
  • Log collection: A company can use Kafka to collect logs of various services, and open it to various consumers in a unified interface service through Kafka
  • Message system: decoupling producers and consumers, caching messages, etc.
  • User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing the web, searching, clicking and other activities. These activity information is published by various servers to Kafka topics, and then consumers subscribe to these topics For real-time monitoring and analysis, it can also be saved to the database
  • Operational indicators: Kafka is also often used to record operational monitoring data, including collecting data from various distributed applications, and producing centralized feedback for various operations, such as alarms and reports
  • Streaming: such as sparkstreaming and storm

Three. How can Kafka avoid message loss and message duplication in the case of high concurrency

1. Message loss solution

  • 1) Speed ​​limit on kafka
  • 2) Enable the retry mechanism and set the retry interval to be longer
  • 3) Kafak sets acks=all, that is, after all the corresponding partitions in the ISR have confirmed the receipt of the message, the transmission is considered successful

2. Message Duplication Solution

  • 1) The message can be identified by a unique id
  • 2) Producer (ack=all represents at least one successful transmission)
  • 3) Consumer (offset is submitted manually, and the offset is submitted after the business logic is successfully processed)
  • 4) Drop table (primary key or unique index to avoid duplicate data)
  • Business logic processing (select the unique primary key to store in Redis or mongdb, first query whether it exists, if it exists, do not process; if it does not exist, insert Redis or Mongdb first, and perform business logic processing)

4. How does Kafka to sparkstreaming ensure data integrity and how to ensure that data is not consumed repeatedly?

Ensure that data is not lost (at-least)

  • The internal mechanism of spark rdd can guarantee data at-least semantics
  • Receiver mode . Turn on WAL (write ahead log), write the data received from Kafka to the log file, and all data can be recovered from the failure

Direct method :

  • a. Rely on the checkpoint mechanism to ensure
  • b. To ensure that the data is not repeated, that is, Exactly once semantics
  • Idempotent operation: repeated execution will not cause problems, no additional work is required to ensure that data is not repeated
  • The business code adds transaction operations, that is, for the data of each partition, a uniqueld is generated. Only if all the data of this partition is completely consumed, it is considered successful, otherwise it is considered invalid and needs to be rolled back. The next time this uniqueld is repeated, if Has been executed successfully, skip off

Five. Kafka consumer high-level and low-level API difference

  • Kafka provides two consumer APIs: The high-level Consumer API and The SimpleConsumer API. The high-level consumer API provides a high-level abstraction of consuming data from Kafka, while the SimpleConsumer API requires developers to pay more attention to details.

1.The high-level Consumer API

  • The high-level Consumer API provides the semantics of the consumer group. A message can only be consumed by one consumer in the group, and the consumer does not pay attention to the offset when consuming the message. The last offset is saved by zookeeper
  • Using high-level consumer API can be a multi-threaded application, you should pay attention to:
  • 1) If the number of consumer threads is greater than the number of partitions, some threads cannot receive messages
  • 2) If the number of partitions is greater than the number of threads, some threads receive messages from multiple partitions
  • 3) If a thread consumes multiple partitions, the order of the messages you receive cannot be guaranteed, and the messages within a partition are ordered

2.The SimpleConsumer API

  • If you want more control over the partition, you should use SimpleConsumer API such as:
  • 1) Read a message multiple times
  • 2) Only consume part of the message in a partition
  • 3) Use transactions to ensure that a message is only consumed once
  • But when using this API, the partition offset broker leader, etc. is no longer transparent to you, you need to manage it yourself, and you need to do a lot of extra work
  • The offset must be tracked in the application to determine which message should be consumed next
  • The application needs to know who the leader of each partition is through the program
  • Need to deal with leader changes

6. How does Kafka guarantee that data is consumed once and only once?

  • Idempotent producer: It is guaranteed that the message of a single partition will only be sent once, and there will be no duplicate messages
  • Transaction (transaction): to ensure atomic writing to multiple partitions, that is, messages written to multiple partitions are either all successful or all rolled back

7. Kafka guarantees data consistency and reliability

1. Data consistency guarantee

  • Consistency definition: If a message is visible to the client, then even if the leader dies, the data can still be read on the new leader
  • HW-HighWaterMark: The maximum msg offset that the client can read from the leader, even if the maximum offset is externally visible, HW=max(replica.offset)
  • For the new msg received by the leader, the client cannot consume it immediately. The leader will wait for the message to be synchronized by all the replicas in the ISR and update the HW. At this time, the message can be consumed, which ensures that if the leader fails, the message can still be consumed. Obtain from the newly elected leader
  • For read requests from internal Broker, there is no HW restriction. At the same time, follower will also maintain a copy of its own HW, Follower.HW = min(Leader.HW, Follower.offset)

2. Data reliability guarantee

  • When the producer sends data to the leader, the level of data reliability can be set through the acks parameter
  • 0: Regardless of whether the write is successful or not, the server does not need to send a response to the producer. If an exception occurs, the server will terminate the connection and trigger the producer to update the meta data
  • 1: After the leader writes successfully, the response is sent. In this case, if the leader fails, data will be lost
  • -1: Wait for all ISRs to receive the message before sending a response to the producer, this is the strongest guarantee

8. Spark real-time job is down, what should I do if the topic data specified by kafka accumulates?

  • 1.spark.streaming.concurrentJobs=10: Increase the number of concurrent jobs. It can be noticed from the source code that this parameter actually specifies the number of core threads of a thread pool. When not specified, the default is 1
  • 2.spark.streaming.kafka.maxRatePerPartition=2000: Set the maximum number of logs obtained per partition per second, control the amount of processed data, and ensure uniform data processing
  • 3.spark.streaming.kafka.maxRetries=50: When getting topic partition leaders and their offsets, increase the number of retries
  • 4. Configure retry at the application level. spark.yarn.maxAttemps=5 不能超过hadoop集群中yarn.resourcemanager.am.max-attempts
  • 5. Validity interval setting for failed attempts. spark.yarn.am.attempFailuresValidtyInterval=1h

Nine.kafka read and write process

1. Writing process

  • 1) Connect to the zk cluster and get the partition information of the corresponding topic and the leader of the partition from zk
  • 2) Connect to the broker corresponding to the corresponding leader
  • 3) Send the message to the leader of the partition
  • 4) Other followers copy data from the leader
  • 5) Return to ack in turn
  • 6) The submission is not completed until all the data in the ISR is written, and the entire writing process ends
  • Because it is describing the writing process, the heartbeat communication between the replica and zk is not expressed. The heartbeat communication is to ensure the high availability of kafka. Once the leader hangs up, or the follower synchronization timeout or synchronization is too slow, it will be reported to zk through the heartbeat information, and zk Election leader or move follower from ISR to OSR

2. Reading process

  • 1) Connect to the zk cluster and get the partition information of the corresponding topic and the leader of the partition from zk
  • 2) Connect to the broker corresponding to the corresponding leader
  • 3) The consumer sends the saved offset to the leader
  • 4) The leader locates the segment (index file and log file) according to offset and other information
  • 5) According to the content in the index file, locate the starting position corresponding to the offset in the log file, read the data of the corresponding length and return it to the consumer

10. Why does Kafka only allow the leader to read and write

  • Kafka only has the leader responsible for reading and writing, and the follower is only responsible for backup. If the leader goes down, Kafka dynamically maintains a set of in-sync replicas (a set of in-sync replicas), referred to as ISR, and there are f+1 nodes in ISR. , It is allowed to not lose messages and provide services normally when f nodes are down. The members of ISR are dynamic. If a node is eliminated. When it reaches the "synchronization" state again, he can Re-join the ISR. So if the leader goes down, just choose a follower from the ISR.
  • After Kafka introduces replication, the same partition may have multiple replicas, and at this time, a leader needs to be selected between these replications.The producer and consumer only interact with this leader, and other replicas copy data from the leader as followers, because It is necessary to ensure the data consistency between multiple replicas of the same partition (after one of them goes down, the other replicas must be able to continue to serve and neither cause data duplication nor cause data loss). If there is no leader, all replicas can be To read/write data at the same time, it is necessary to ensure that multiple replicas synchronize data with each other (n*n paths). The consistency and order of the data are very difficult to ensure, which greatly increases the complexity of the replication implementation, and also increases After the introduction of the leader, only the leader is responsible for reading and writing data, and the follower only fetch data (n channels) to the leader sequentially, the system is simpler and more efficient

11. In order to prevent the disk from being full, Kafka will periodically delete old messages. What are the deletion strategies?

  • There are two retention strategies in Kafka
  • One is based on the retention time of the message, when the message is saved in Kafka for more than the specified time, it can be deleted
  • The other is based on the size of the data stored in the topic. When the size of the log file occupied by the topic is greater than a threshold, you can start to delete the oldest messages
  • Kafka will start a background thread to periodically check whether there are messages that can be deleted
  • The configuration of the retention policy is very flexible, it can have a global configuration, or it can be configured for a topic to override the global configuration

12. The principle of Kafka data high availability

1. Data storage format

  • Kafka's high reliability guarantee comes from its robust replication strategy. A topic can be divided into multiple partitions, and a partition is physically composed of multiple segments
  • Segment is divided into 2 parts: index file and data file. The index file saves metadata and records the offset of the message in the data file. The message has a fixed physical structure to ensure the correct read length
  • The benefits of segment files: facilitate the cleaning of expired files. You only need to delete the expired segments as a whole. Write messages in an appended manner and write to disk sequentially, which greatly improves efficiency
  • The step of reading a certain offset message becomes: Find the segment where the offset is located through binary search. Through the index file of the segment, find the physical offset of the data file where the offset is located, and read the data

2. Replica replication and synchronization

  • From the outside, partition is similar to an array that is growing and storing messages. Each partition has a file similar to MySQL binlog to record data writing. There are two new terms, HW (HighWatermark) means that the current consumer can see the partition The offset position, LEO (LogEndOffset) represents the offset of the latest message of the current partition, and each copy is maintained separately. In order to improve the reliability of the message, the partition has n copies
  • Among n replicas, there is one leader and the remaining n-1 followers. Kafka write operations are only performed on the leader replica. There are usually two ways to write this kind of replica
  • 1) If the leader writes the log file successfully, it returns success. In this way, if the follower goes down before the data is synchronized, the data is lost. This method brings higher efficiency
  • 2) The leader waits for the follower to successfully write the log and receive the returned acks before returning success. In this way, the leader is down. The re-elected leader is consistent with the data of the down leader, and the data is not lost. But because it has to wait for the follower to return, the efficiency is slower Generally, the election method that the minority obeys the majority is adopted. If you want to deal with the downtime of f replicas, you need at least 2f+1 replicas and make f+1 of them write successfully. Kafka does not use the above mechanism. It implements ISR(In -Sync Replication) mechanism

13. Where is the offset of Kafka stored. Why?

  • Since the version of kafka-0.9, the consumer group and offset information of kafka is not stored in zookeeper, but stored on the broker server, so if you specify a consumer group name (group.id) for a consumer , Then, once the consumer is started, the consumer group name and offset information of which topic it wants to consume will be recorded on the broker server

1 Overview

  • Kafka version [0.10.1.1] has moved the consumption offset into a topic named __consumer_offsets in Kafka by default. In fact, as early as version 0.8.2.2, it has been supported to store the consumption offset into the topic, but at that time The default is to store the consumed offset in the zookeeper cluster. Now, the official default is to store the consumed offset in the topic of kafka. At the same time, it also retains the interface stored in zookeeper, which is set through the offsets.storage property.

2. Content

  • In fact, the official recommendation like this is also reasonable. In the previous version, Kafka actually has a relatively large hidden danger, which is to use zookeeper to store and record the consumption progress of each consumer/group. Although, in the process of use, JVM helps us to complete Some optimizations have been made, but consumers need to interact with zookeeper frequently, and using the API of zkClient to operate zookeeper frequently write is itself a relatively inefficient action, and it is also a headache for later horizontal expansion. If zookeeper during the period The cluster changes, and the throughput of the kafka cluster is also affected. After this, the official actually proposed the concept of migrating to kafka very early, but it has been stored in the zookeeper cluster by default before, and needs to be set manually. If, If you are not very familiar with the use of Kafka, we accepted the default storage. In the new version of Kafka and later, the offsets consumed by Kafka will be stored by default in a topic called __consumer_offsets in the Kafka cluster.

14. How to ensure the order of Kafka messages

  • Kafka can only guarantee that the messages in a partition are in order when consumed by a certain consumer.In fact, from a topic perspective, when there are multiple partitions, the messages are still not globally ordered.

15. Number of kafka partitions

  • The number of partitions is not the better. Generally, the number of partitions should not exceed the number of cluster machines. The more partitions, the larger the memory occupied (ISR, etc.). The more partitions in a node concentration, when it is down, the system will be affected. The greater the impact
  • The number of partitions is generally set to: 3-10

16. Kafka partition allocation strategy

  • There are two default partition allocation strategies in Kafka: Range and RoundRobin

1.Range

  • The default strategy, Range is for each topic (that is, the division of each topic). First, the partitions in the same topic are sorted by serial number, and the consumers are sorted in alphabetical order. Then use the partitions of partitions Divide the number by the total number of consumer threads to determine how many partitions each consumer thread consumes. If it is not divided, the previous consumer threads will consume one more partition
  • For example: we have 10 partitions, two consumers (C1, C2), three consumer threads, 10/3=3 and inexhaustible
  • C1-0 will consume 0,1,2,3 partitions
  • C2-0 will consume 4, 5, and 6 partitions
  • C2-1 will consume 7, 8, and 9 partitions

2.RoundRobin

  • Prerequisite: The num.streams (the number of consumer consumption threads) of all consumers in the same consumer group must be equal, and the topics subscribed by each consumer must be the same
  • Compose all topic partitions into a TopicAndPartition list, then sort the TopicAndPartition list according to hashCode, and finally send it to each consumer thread in a polling manner

Seventeen.kafka data volume calculation

  • The total data volume per day is 100g, 100 million logs are produced every day, 100 million/24/60/60=1150 records/sec
  • Average per second: 1150
  • Low valley per second: 400 bars
  • Peak per second: 1150*(2-20 times)=2300-23000
  • Each log size: 0.5-2k
  • How much data per second: 2.3M-20M

18. Kafka message data backlog, how to deal with Kafka insufficient consumption capacity

  • If Kafka's consumption capacity is insufficient, you can consider increasing the number of topic partitions and increasing the number of consumers in the consumer group at the same time, the number of consumers = the number of partitions
  • If the downstream data processing is not timely, increase the number of pulls in each batch, and the number of batches is too small (pulling data/processing time <production speed), making the processed data smaller than the production data, which will also cause data Backlog

19. The realization of Kafka high throughput

1. Sequential read and write

  • Kafka messages are continuously appended to the file. This feature allows Kafka to make full use of the sequential read and write performance of the disk. Sequential read and write does not require the seek time of the hard disk head, and only requires a little sector rotation time, so the speed is much faster than random read and write.

2. Zero copy

  • After Linux kernel 2.2, a system call mechanism called "zero-copy" appeared, which is to skip the copy of the "user buffer" and establish a direct mapping between disk space and memory, and data is no longer copied to "User mode buffer". Zero copy does not require copying, but reduces the number of unnecessary copies. Usually it is in the IO read and write process. "Zero copy technology" only needs to copy the data of the disk file to the page cache once, and then send the data from the page cache directly to the network.

3. Partition

  • Kafka's queue topic is divided into multiple partitions, and each partition is divided into multiple segments, so the messages in a queue are actually stored in N multiple fragment files in a segmented manner. Each file operation They are all operations on a small file, which is very portable and also increases the parallel processing capability.

4. Batch send

  • Kafka allows sending messages in batches. The messages are first cached in memory, and then sent out in batches at one request. For example, you can specify that the cached messages are sent when a certain amount is reached, or sent after a fixed amount of time is cached, such as 100 messages. Messages are sent, or sent every 5 seconds. This strategy will greatly reduce the number of I/O on the server.

5. Data compression

  • Kafka also supports compression of message collections. Producer can compress message collections in GZIP or Snappy format. The advantage is to reduce the amount of data transmitted and reduce the pressure on network transmission.

6.Consumer load balancing

  • When a consumer joins or leaves in a group, it will trigger the partitions balance. The ultimate goal of the balance is to improve the concurrent consumption capacity of the topic.

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108069129