Kafka common interview questions

1 What is kafka

Kafka is distributed publish - subscribe messaging system, which was originally developed by LinkedIn and later became part of the Apache project, Kafka is a distributed, divided, redundant backup of persistent logging service, which is mainly used processing streaming data.

2 Why use kafka, why use a message queue

And clipping buffer: The burst traffic data upstream and downstream may not carry, not enough or the downstream machine to ensure redundancy, kafka can play a role in the intermediate buffer, the message is temporarily stored in kafka, the downstream service can be slow process at their own pace.
Decoupling and scalability: start of the project, and can not determine the specific requirements. Message queue can be used as an interface layer, the decoupling important business processes. Only need to comply with the agreement, it can get scalability for data programming.
Redundancy: can be used many ways, a producer announced that topic can be multiple subscription services to consumer, business to use for multiple unrelated.
Robustness: the message queue request can be stacked, so even if the consumer end of the business for a short time to die, it will not affect the normal operation of the main business.
Asynchronous communication: In many cases, users do not want nor need to immediately process the message. It provides asynchronous message queue processing mechanism, allowing the user to put a message on the queue, but does not deal with it immediately. Think how many messages are placed into the queue to put the number, then go to process them in time of need.

What 3.Kafka in the ISR, AR, also represent? ISR telescopic also refers to what

ISR: In-Sync Replicas replica synchronous queue
AR: Assigned Replicas all copies
ISR is maintained by the leader, follower leader synchronization data from some delay (delay includes delay time and the Number replica.lag.time.max.ms replica.lag. max.messages two dimensions, the most current version 0.10.x only supports replica.lag.time.max.ms this dimension), any more than a threshold value will be removed out of the follower ISR, into OSR (outof-Sync Replicas) list, the newly added first follower will be stored in the OSR. AR = ISR + OSR.

The 4.kafka broker is doing

the message broker proxy, Producers Topic Brokers to the designated message written inside, pulling Consumers specified Topic Brokers message from the inside, and performs service processing, saved message broker functions as a proxy of the intermediate transfer station.

The 5.kafka zookeeper play what role, it can not zookeeper

zookeeper is a distributed coordination component, an earlier version of kafka zk do with meta information storage, consumer spending, the value of the group's management and offset. Taking into account a number of factors zk itself and the whole structure there is greater probability of a single point, the new version of the gradual weakening of the role of zookeeper. New consumer use inside kafka group coordination protocol, also less reliant on the zookeeper,
But the broker still rely on ZK, zookeeper in kafka also be used to detect and elections controller broker is alive, and so on.

6.kafka follower how to synchronize data with leader

Kafka replication mechanism is neither fully synchronous replication, asynchronous replication is not simple. The full synchronization replication requirements All Alive Follower are copied, the message will be considered commit, this replication great impact on throughput. The asynchronous replication, Follower asynchronous data replication from the Leader, as long as the data is written to the log Leader is considered to have commit, in this case, if the leader hang up, lose data, well-balanced kafka use ISR way to ensure data is not lost and throughput. Follower can bulk copy data from the Leader and Leader full use of sequential disk read and send file (zero copy) mechanism, which greatly improves the performance of replication, internal batch write to disk, greatly reducing the amount of messages Follower and Leader of the poor.

7. Under what circumstances a broker will kick in from isr

leader maintains a list of its remained Replica synchronization, the list is called ISR (in-sync Replica), each Partition will have an ISR, and is maintained by a dynamic leader, a follower than a leader if too much behind, or over a certain time has not initiated a data replication request, the leader will remove a weight ISR.

Why so fast 8.kafka
 
 Cache Cache Filesystem Cache PageCache
 
 
 order to write due to the modern operating system provides a pre-order reading and writing technology, disk write than random write memory in most cases even faster.
 
 
 Zero-copy technique reduces the number of copies zero copy
 
 
 Batching of Messages batch processing amount. The combined request is small, then the flow interacting manner, straight top network limit.
 
 
 Pull pull mode using a pull mode for message acquisition consumption, in line with the consumer side processing capacity.
 

How 9.kafka producer to optimize speed into
 
 Increasing thread
 
 
 improve batch.size
 
 
 adding more producer instance
 
 
 increasing the number of partition
 
 
 setting acks = -1, if the delay is increased: can be increased num.replica.fetchers (threads follower synchronization data) to mediate;
 
 
 transmission across data centers : increase the socket buffer settings and OS tcp buffer settings.
 

10.kafka producer hit data, ack is 0, 1, -1 representative of what time, set time -1, and under what circumstances, leader considers a message commit the

After transmission 1 (default) data to Kafka, acknowledgment messages after successful reception of the leader, even if successfully transmitted. In this case, if the leader goes down, you lose data.
 0 producer data sent on the matter, do not wait for any return. Maximum data transmission efficiency in this case, but the data is indeed the lowest reliability.
 All follower -1 producer to wait in the ISR are considered confirmed after receiving the data transmission completion time, the highest reliability. When all Replica ISR sends an ACK to the Leader, leader did commit, this time the producer to consider a request commit the messages.

11.kafka unclean configuration represents what, what effect would spark streaming consumption

unclean.leader.election.enable is true, then, broker means that non-ISR collection can also participate in the elections, which likely will lose the data, get the spark streaming process in the consumer end offset will suddenly become smaller, resulting in spark streaming job hang. If unclean.leader.election.enable parameter set to true, there is data loss and data inconsistencies may occur, Kafka's reliability will be reduced; and if unclean.leader.election.enable parameter is set to false, Kafka availability it will be reduced.

12. If the leader crash when, ISR how to do is empty

Broker kafka provided a side configuration parameters: unclean.leader.election, there are two values of this parameter:
to true (default): Allow Leader becomes unsynchronized copies, a copy of the synchronization messages due to lags behind, this time becomes Leader, possible inconsistent news happens.
false: do not allow copies to be leader of sync, this time the list is empty if the ISR occurs, the old leader has been waiting for recovery, reduced availability.

13.kafka the message format is what

Kafka of a Message by a fixed length header and a variable length message body of a body consisting of
byte header portion of a magic (file format), and four bytes of the CRC32 (body for determining the message body is normal) configuration.
When the magic value 1 will be a multi-byte data between the magic and crc32: attributes (stored some relevant properties,
Such as whether compression, compressed format, etc.); magic if the value is 0, no attributes property exists
a message body is composed of N bytes body, comprising a specific key / value message

14.kafka What is the concept of the consumer group

Also is a logical concept, the means to achieve Kafka two message unicast and broadcast models. The same topic data will be broadcast to a different group; the same group of worker, a worker can only get this data. In other words, for a same topic, each group can get all the same data, but the data into the group after which a worker can only be the consumer. worker in the group to use multiple threads or processes to achieve, the process also may be dispersed on multiple machines, the number of worker generally does not exceed the number of partition, and both an integer multiple relationship is preferably maintained as designed Kafka a partition can only be assumed that a worker consumption (within the same group).

Whether 15.Kafka messages will be lost and repeated consumption?

To determine whether Kafka message loss or duplication, starting from two aspects: send messages and news consumption.
1, message transmission
         Kafka message in two ways: synchronization (sync) and asynchronous (the async), default sync mode can be configured by producer.type properties. Kafka produced was confirmed by the configuration message request.required.acks properties:
0 --- indicates that no successful message received acknowledgment;
 1 --- to confirm if reception succeeds when Leader;
 0 --- represents confirmation Follower Leader receives and successfully;
In summary, there are six kinds of message production situation, to analyze the following message loss scenario points:
(1) acks = 0, no clusters, and Kafka reception confirmation message, when the network is abnormal, the other buffer is full, the message may be lost;
(2) acks = 1, the synchronous mode, only after confirming successful reception Leader but hung up, no synchronous copy, data may be lost;
2, news consumption
Kafka consumer news consumption has two interfaces, Low-level API and the High-level API:

 Low-level API: the consumers to maintain offset equivalent, can achieve complete control of Kafka's;
 
 
 High-Level API: the management of packaging and offset parition using simple;
 
if you use the Advanced Interface High-level API, there may be a the problem is that when a message from a cluster of consumers to take out the news, and submitted a new message offset value, the consumer has not had time to hang up, then before consumption not successful the next time you consume news on the "strange" in Disappeared;
Solution:
        For messages are lost: In synchronous mode, the acknowledgment mechanism is set to -1, which allow messages written Leader and Follower after re-confirmation message is sent successfully; asynchronous mode, in order to prevent the buffer is full, you can not limit settings in the configuration file blocking timeout when the buffer is full so that the producer has been in a blocked state;
        Repeat for the message: Save uniquely identifies the message to the external medium, each time to determine whether or not treated to the time of consumption.
Repeat the message and resolve consumer reference: https: //www.javazhiyin.com/22910.html

16. Why Kafka does not support separate read and write?

In Kafka, the producer writes a message, the message read operations are consumer interacts with the leader of a copy, in order to achieve the main read-write is a master of production and consumption models.
Kafka does not support reading from master to write, because writing from the main reading has two obvious disadvantages:

 (1) data consistency problems. Data transferred from node from the master node will inevitably have a window of time delay, this time window may result in inconsistent data between the master and slave nodes. At some point, the master node and the value of the data from the node A are X, then modifying the value of the master node A is Y, then from the node before the application reads the data from node A in the change notification the values are not up to date Y, thus it will produce inconsistent data problems.
 
 
 (2) delay problem. This assembly is similar to the Redis, data is written from the master node to the synchronization network → go through the network from the master node memory → → node from node memory during these stages, the whole procedure takes a certain period of time. In Kafka, the master-slave synchronization will be more time-consuming than Redis, it needs to go through the primary network node memory → → → → Network disk master node from the node from the node disk memory → these stages. For delay-sensitive applications, the main function of reading and writing from is not applicable.
 

17.Kafka is reflected in how the order of messages?

kafka each partition messages are ordered in writing, the time of consumption, each partition can only be consumed by a consumer in each group to ensure that when the consumer is in order.
The whole topic does not guarantee orderly. If the topic in order to ensure an orderly whole, then the partition will be adjusted to 1.

18. The consumer submits submitted when consumption is offset or displacement current consumption to offset the latest news of + 1?

offset+1

19.kafka how to delay queue?

Kafka did not use or JDK Timer DelayQueue own to achieve the delay function, but on a time round custom timer (SystemTimer) for implementing a delay function. Timer and the average time complexity DelayQueue insert and delete operations JDK is O (nlog (n)), and can not meet the performance requirements Kafka, the wheel may be based on the time period complexity of the insert and delete operations are reduced to O (1). Application time round is not unique to Kafka, there are many application scenarios, in Netty, Akka, Quartz, Zookeeper and other components are the presence of trace time round.
Underlying implemented using an array, each element of the array can store a TimerTaskList object. TimerTaskList is a circular doubly linked list, wherein the list of items in the package TimerTaskEntry real timer task TimerTask.
Kafka in time in the end is how to promote it? Kafka timer in aid of the JDK DelayQueue to assist in advancing time round. This is done for each use to TimerTaskList will be added to the DelayQueue in. Kafka in TimingWheel dedicated to insertion and deletion TimerTaskEntry operation, and DelayQueue time dedicated to advancing the mission. Imagine then, DelayQueue the first timeout expiration task list is 200ms, the second timeout task is 840ms, get here DelayQueue team head needs only the complexity of O (1) time. If the timing advance per second, 200 times less when acquired to enhance the implementation of the first overtime in the task list 199 times belong to the "advance air", while getting into the second overtime tasks to be performed 639 times "Empty advance "this will make no use of performance resources of the machine for no reason, DelayQueue used here to assist with a small amount of space for time, thereby leading to the" precision propulsion. " Kafka in timer turns out to be "Understand use", add, and delete operations with TimingWheel do the most good at the task, but with DelayQueue do the most good at the time to promote the work and complement each other.
Reference: https: //blog.csdn.net/u013256816/article/details/80697456

20.Kafka of affairs is how to achieve?

Reference: https: //blog.csdn.net/u013256816/article/details/89135417

21.Kafka in those places need elections? These local elections and what strategy?

https://blog.csdn.net/yanshu2012/article/details/54894629
----------------
Disclaimer: This article is the original article CSDN bloggers "Xu Week", follow CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https: //blog.csdn.net/qq_28900249/article/details/90346599

Guess you like

Origin www.cnblogs.com/mouse1983/p/11957376.html