Dachang interviewers are so fond of asking Kafka, and I was confused by eight Kafka questions in a row

During the interview, I found that many interviewers especially like to ask Kafka-related questions. It is not difficult to understand who makes Kafka the only king of message queues in the field of big data , with a single machine throughput of 100,000 and a delay of milliseconds. Who can not love this kind of natural distributed message queue?

In a recent interview, an interviewer saw that Kafka was written on the item in the resume, so he asked Kafka directly, and basically did not ask other questions. Let's take a look at the interviewer's Kafka eight consecutive questions:

(The following answers are compiled after the interview, and only about one-third of the answers were answered during the actual interview)

1. Why use Kafka?

  1. Buffering and peak clipping: When there is a burst of upstream data, the downstream may not be able to handle it, or there are not enough machines in the downstream to ensure redundancy. Kafka can act as a buffer in the middle, temporarily storing messages in Kafka and downstream The service can be processed slowly at its own pace.

  2. Decoupling and scalability: At the beginning of the project, specific requirements cannot be determined. The message queue can be used as an interface layer to decouple important business processes. You only need to comply with the conventions, and you can get expansion capabilities for data programming.

  3. Redundancy: A one-to-many approach can be used. A producer publishes a message, which can be consumed by multiple subscription topic services for use by multiple unrelated businesses.

  4. Robustness: The message queue can accumulate requests, so even if the consumer business dies in a short time, it will not affect the normal operation of the main business.

  5. Asynchronous communication: In many cases, users do not want or need to process messages immediately. The message queue provides an asynchronous processing mechanism that allows users to put a message into the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

2. How to consume messages that have been consumed by Kafka?

The offset of Kafka consumption messages is defined in zookeeper. If you want to repeatedly consume Kafka messages, you can record the offset checkpoint points (n) in redis. When you want to consume messages repeatedly, read the checkpoint points in redis. Reset the zookeeper's offset, so that you can achieve the purpose of repeated consumption of messages

3. Is Kafka's data stored on disk or memory, why is the speed faster?

Kafka uses disk storage.

The speed is fast because:

  1. Sequential writing: Because the hard disk is a mechanical structure, each read and write will be addressed -> write, in which addressing is a "mechanical action", it is time-consuming. So hard drives "hate" random I/O and prefer sequential I/O. In order to increase the speed of reading and writing hard disks, Kafka uses sequential I/O.
  2. Memory Mapped Files: A 64-bit operating system can generally represent 20G data files. Its working principle is to directly use the operating system's Page to implement the direct mapping of files to physical memory. After the mapping is completed, your operations on the physical memory will be synchronized to the hard disk.
  3. Kafka efficient file storage design: Kafka divides a large partition file in a topic into multiple small file segments. Through multiple small file segments, it is easy to periodically clear or delete files that have been consumed and reduce disk usage. The index information can quickly locate the
    message and determine the size of the response. By mapping all index metadata to memory (memory mapped file),
    segment file IO disk operations can be avoided. Through sparse storage of index files, the space occupied by index file metadata can be greatly reduced.

Note:

  1. One of Kafka's methods to solve query efficiency is to segment data files. For example, there are 100 messages, and their offset is from 0 to 99. Suppose the data file is divided into 5 segments, the first segment is 0-19, the second segment is 20-39, and so on, each segment is placed in a separate data file, and the data file is named after the small offset in the segment. In this way, when searching for a
    Message with a specified offset , binary search can be used to locate which segment the Message is in.
  2. Building an index for the data file data file segmentation makes it possible to find the Message corresponding to the offset in a smaller data file, but this still requires sequential scanning to find the Message corresponding to the offset.
    In order to further improve the search efficiency, Kafka creates an index file for each segmented data file. The file name is the same as the data file name, but the file extension is .index.

4. How can Kafka data not be lost?

In three points, one is the producer side, the consumer side, and the broker side.

  1. No loss of producer data

Kafka's ack mechanism: When Kafka sends data, there will be a confirmation feedback mechanism every time a message is sent to ensure that the message can be received normally, and the status is 0, 1, -1.

If it is synchronous mode:  
ack is set to 0, which is very risky. Generally, it is not recommended to set it to 0. Even if it is set to 1, data will be lost as the leader goes down. So if you want to strictly ensure that the production end data is not lost, you can set it to -1.

If it is asynchronous mode:  
the status of ack will also be considered. In addition, there is a buffer in asynchronous mode. The control data is sent through the buffer. There are two values ​​for control, the time threshold and the number of messages. If the buffer is full and the data has not been sent out, there is an option to configure whether to clear the buffer immediately. It can be set to -1 to block permanently, which means that data is no longer produced. In asynchronous mode, even if set to -1. It is also possible that the operation data is lost due to unscientific operations of the programmer, such as kill -9, but this is a special exception.

Note:  
ack=0: The producer does not wait for the confirmation of the completion of the broker synchronization, and continues to send the next (batch) message.  
ack=1 (default): The producer waits for the leader to successfully receive the data and get the confirmation before sending the next message.  
ack=-1: The producer will send the next piece of data only after getting confirmation from the follower.

  1. No loss of consumer data

The offset commit is used to ensure that the data is not lost. Kafka records the offset value of each consumption. When it continues to consume next time, it will continue to consume with the last offset.

The offset information is saved in zookeeper before version 0.8 of kafka, and saved to topic after version 0.8. Even if the consumer hangs up during operation, the offset value will be found when restarting, and the previous consumption message will be found. Location, then consumption, because when the offset information is written, not every message is written after consumption is completed, so this situation may cause repeated consumption, but the message will not be lost.

The only exception is when we set
KafkaSpoutConfig.bulider.setGroupid to the same groupid when we set KafkaSpoutConfig.bulider.setGroupid to two consumer groups that originally did different functions in the program . This situation will cause the two groups to share the same data. Group A will consume messages in partition1 and partition2, and group B will consume messages in partition3. In this way, the messages consumed by each group will be lost and are incomplete. In order to ensure that each group has an exclusive share of message data, the groupid must not be repeated.

  1. The data of the brokers in the Kafka cluster is not lost

We generally set the number of replications (replicas) for each partition in the broker. When the producer writes it, first write it to the leader according to the distribution strategy (partition by partition, key by key, no polling). , The follower (replica) synchronizes data with the leader, so that with a backup, it can also ensure that the message data is not lost.

5. Why choose kafka for data collection?

The acquisition layer can mainly use Flume, Kafka and other technologies.

Flume: Flume is a pipeline flow method, which provides many default implementations, allowing users to deploy through parameters and extend the API.

Kafka: Kafka is a durable distributed message queue. Kafka is a very versatile system. You can have many producers and many consumers sharing multiple topics.

In contrast, Flume is a special tool designed to send data to HDFS and HBase. It has special optimizations for HDFS and integrates the security features of Hadoop.

Therefore, Cloudera recommends using Kafka if the data is consumed by multiple systems; if the data is designed to be used by Hadoop, use Flume.

6. Will restarting Kafka cause data loss?

  1. Kafka writes data to disk, and generally data will not be lost.
  2. But in the process of restarting Kafka, if there are consumers who consume messages, if Kafka does not have time to submit the offset, it may cause inaccurate data (loss or repeated consumption).

7. How to solve if Kafka is down?

  1. First consider whether the business is affected

Kafka is down. The first question we should consider is whether the service provided is affected by the down machine. If the service is provided, if the disaster recovery mechanism of the cluster is implemented, then there is no need to worry about this. .

  1. Node troubleshooting and recovery

To restore the nodes of the cluster, the main step is to check the cause of the node downtime through log analysis, so as to solve it and restore the node again.

8. Why does Kafka not support read-write separation?

In Kafka, the operations of producers writing messages and consumers reading messages all interact with the leader copy, thus realizing a production and consumption model of master writing and reading .
Kafka does not support master-write-slave reading , because master-write-slave reading has two obvious disadvantages:

  1. Data consistency problem: There will be a delay time window for data from the master node to the slave node. This time window will cause the data inconsistency between the master and slave nodes. At a certain moment, the value of A data in both the master node and the slave node is X, and then the value of A in the master node is modified to Y, then before the change is notified to the slave node, the application reads the A data in the slave node The value of is not the latest Y, which creates a data inconsistency problem.

  2. Delay problem: For components like Redis, the process of writing data from the master node to synchronization to the slave node needs to go through the stages of network → master node memory → network → slave node memory. The whole process will take a certain amount of time. In Kafka, master-slave synchronization is more time-consuming than Redis. It needs to go through the stages of network → master node memory → master node disk → network → slave node memory → slave node disk. For delay-sensitive applications, the function of master writing and slave reading is not very suitable.

And the advantages of kafka's main write and main read are many:

  1. Can simplify the implementation logic of the code and reduce the possibility of errors;  
  2. The load granularity is refined and evenly distributed, compared with the master write and slave read, not only the load performance is better, but also the user is controllable;
  3. There is no delay effect;
  4. When the copy is stable, there will be no data inconsistency.

Search the public account "Learning Big Data in Five Minutes" to delve into big data technology


Guess you like

Origin blog.51cto.com/14932245/2591151