This article is enough for Kafka knowledge points

One, Kafka concept

Kafka is a distributed publish-subscribe messaging system. Kafka is a distributed, partitionable, redundant backup and persistent log service. It is mainly used to process streaming data. It consists of four components, theme, producer, consumer, and broker (manage the message storage in the theme).

Second, the role of message queues

(1) Buffering and peak clipping

When there is a burst of upstream data, the downstream may not be able to handle it, or there are not enough machines in the downstream to ensure redundancy. Kafka can act as a buffer in the middle, temporarily storing the message in Kafka, and the downstream service can follow its own The rhythm is processed slowly.

(2) Decoupling and scalability

At the beginning of the project, the specific needs could not be determined. The message queue can be used as an interface layer to decouple important business processes. You only need to comply with the conventions, and you can get expansion capabilities for data programming.

(3) Redundancy

A one-to-many approach can be used. A producer publishes a message, which can be consumed by multiple subscription topic services for use by multiple unrelated businesses.

(4) Robustness

The message queue can accumulate requests, so even if the consumer business dies in a short time, it will not affect the normal operation of the main business.

(5) Asynchronous communication

In many cases, the user does not want and does not need to process the message immediately. The message queue provides an asynchronous processing mechanism that allows users to put a message into the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

3. Advantages of Kafka over traditional technology

There are two traditional message delivery methods:
1. Queuing: In a queue, a group of users can read messages from the server, and each message is sent to one of them.
2. Publish-subscribe: In this model, messages are broadcast to all users.
Compared with traditional messaging technology, Apache Kafka has the following advantages:

(1) Fast

A single Kafka broker can handle thousands of clients, processing megabytes of read and write operations per second.

(2) Scalable

Partition and simplify data on a set of machines to support larger data

(3) Endurance

The message is persistent and replicated in the cluster to prevent data loss.

(4) Design

It provides fault tolerance guarantee and durability

4. Why is Kafka so fast?

Sequential read and write, zero copy, message compression, and batch sending
Kafka implements the principle of zero copy to move data quickly and avoid switching between cores. Kafka can send data records in batches, from the producer to the file system (Kafka topic log) to the consumer, and you can view these batches of data end-to-end.
Batch processing can perform more effective data compression and reduce I/O latency. Kafka adopts a sequential write to disk method to avoid the waste of random disk addressing

5. Does Kafka message use Pull mode or Push mode?

Kafka adopts the traditional Pull model, and consumers can decide these strategies according to their own spending power.
Disadvantage: If the broker has no messages available for consumption, the consumer will continue to poll in the loop until the new message arrives. To avoid this, Kafka has a parameter that allows the consumer to block until new messages arrive.

Six, the role of broker in kafka

The broker is the message agent. Producers write messages to the specified topic in Brokers, and consumers pull the message of the specified topic from Brokers, and then perform business processing. The broker acts as a relay station for the agent to save messages in the middle.

Seven, the role of Zookeeper in Kafka

The early version of Kafka used zk for meta information storage, consumer consumption status, group management, and offset value.
The role of zookeeper is gradually weakened in the new version. The new consumer uses the group coordination protocol inside Kafka, which also reduces the dependence on zookeeper, but the broker still relies on ZK. Zookeeper is also used in Kafka to elect controllers and detect whether the broker is alive.

8. How does Kafka judge whether a node is alive?

1. It must be able to maintain the connection with ZooKeeper. ZooKeeper checks the connection of each node through the heartbeat mechanism.
2. If the node is a follower, it must be able to synchronize the write operations of the leader in time, and the delay cannot be too long

Nine, consumer group in Kafka

It is Kafka's means to achieve both unicast and broadcast message models. The data of the same topic will be broadcast to different groups; only one worker in the same group can get this data.

10. How does Kafka reflect the order of messages?

The messages in each partition of Kafka are in order when they are written. During consumption, each partition can only be consumed by one consumer in each group, which ensures that the consumption is also in order. The entire topic is not guaranteed to be in order. If in order to ensure that the topic is in order, adjust the partition to 1.

11. When the consumer submits the consumption displacement, is the offset of the latest news currently consumed or offset+1 submitted?

offset+1

12. ISR, AR concept

ISR: In-Sync Replicas Replica synchronization queue
AR: Assigned Replicas All replica
ISRs are maintained by the leader, and the follower has some delay in synchronizing data from the leader. Any one exceeding the threshold will remove the follower from the ISR and store it in the OSR (Outof-Sync Replicas) List, newly added followers will also be stored in OSR first. AR=ISR+OSR.

13. How to reduce the disturbance in the ISR, when will the broker leave the ISR?

The ISR is a set of message copies that are fully synchronized with the leaders, which means that all submitted messages are included in the ISR. The ISR should always contain all copies until a real failure occurs. If a replica leaves the leader, it will be deleted from the ISR.
If a copy is kept in the ISR for a long period of time, it indicates that the tracker cannot obtain data as quickly as the leader collects data.
If the preferred replica is not in the ISR, the controller will not be able to transfer the leadership to the preferred replica.

14. How to synchronize data between Kafka follower and leader

Kafka uses ISR in a well-balanced way to ensure that data is not lost and throughput. Follower can copy data from Leader in batches, and Leader makes full use of disk sequential read and send file (zero copy) mechanism, which greatly improves copy performance, internal batch writes to disk, and greatly reduces the message volume difference between Follower and Leader.

15. Under what circumstances will a broker be kicked out of the ISR

The leader maintains a Replica list that is basically synchronized with it. This list is called ISR (in-sync Replica). Each Partition has an ISR and is dynamically maintained by the leader. If a follower lags behind a leader too much, or If a data replication request is not initiated for a certain period of time, the leader will remove it from the ISR.

16. The ack mechanism when Kafka producer hits data

ack: How many broker replies are received by the producer before it is really sent successfully.
0 means that the producer does not need to wait for the confirmation of the leader (the highest throughput and the worst data reliability).
1 means that the leader is required to confirm writing to its local log and confirm it immediately
-1/ all means that all ISRs are completed and confirmed (the lowest throughput and the highest data reliability)

17. Leader election process

The election mechanism in the ZooKeeper cluster uses the Paxos algorithm to vote for the leader by sending information from different nodes to other nodes, but the leader election in Kafka is not so complicated.
Kafka's Leader election is achieved by creating a /controller temporary node on ZooKeeper, and writing the current broker information in the node {"version":1,"brokerid":1,"timestamp":"1512018424988" } Utilizing the strong consistency feature of ZooKeeper, a node can only be successfully created by one client. The successfully created broker is the leader, that is, the first-come, first-served principle. The leader is the controller in the cluster and is responsible for all large and small transactions in the cluster. When the leader and ZooKeeper lose connection, the temporary node will be deleted, and other brokers will monitor the changes of the node. When the node is deleted, other brokers will receive event notifications and re-initiate leader election.

18. Under what circumstances will Kafka rebalance?

1. A new consumer joins
2. The old consumer hangs up
3. The coordinator hangs up, and the cluster elects a new coordinator
4. The topic partition is newly added
5. The consumer calls unsubscrible() to cancel the topic subscription. When
rebalance occurs, the Group All consumer instances will be coordinated to participate together, and Kafka can guarantee the fairest distribution possible. However, the Rebalance process will have a serious impact on the consumer group. During the Rebalanc process, all consumer instances under the consumer group will stop working and wait for the Rebalance process to complete.

19. What is the impact of Rebalance

Rebalance itself is a protection setting of the Kafka cluster, which is used to eliminate consumers who cannot consume or are too slow. Then, because of our large amount of data, and subsequent data writing after consumption requires network IO, it is very likely to exist The dependent third-party services are slow and cause us to time out. The main impacts of Rebalance on our data are as follows:
1. Data repeated consumption: The data that has been consumed will also fail due to the submission of the offset task. When the partition is allocated to other consumers, it will cause repeated consumption, and the data will be repeated and increased. Cluster pressure
2. Rebalance spreads to all consumers of the entire ConsumerGroup. Because of the withdrawal of one consumer, the entire Group rebalances and reaches a stable state in a relatively slow time, which has a larger impact.
3. Frequent Rebalances instead Reduced the message consumption speed, most of the time is repeated consumption and Rebalance
data cannot be consumed in time, it will accumulate lag, and the data will be discarded after Kafka's TTL

Twenty, Kafka producer optimizes the speed of penetration

1. Increase threads
2. Increase batch.size
3. Increase more producer instances
4. Increase the number of partitions
5. When acks=-1, if the delay increases: You can increase num.replica.fetchers (threads that follower synchronizes data) Number) to mediate;
6, cross-data center transmission: increase socket buffer settings and OS tcp buffer settings.

21. Other

1. Improve the throughput of remote users.
If the user is located in a data center different from the broker, it may be necessary to adjust the socket buffer size to amortize the long network delay.
2.
Kafka 's configuration to improve throughput. The batch.size defaults to a maximum of 16384 bytes in a single batch. If this value is exceeded, it will be sent immediately.
The default value of linger.ms is 0ms, and it will be sent immediately after this time.
If one of the above two conditions is met, the message will be sent immediately, otherwise it will wait.

Guess you like

Origin blog.csdn.net/wh672843916/article/details/115055174