Kafka knowledge points (two)

Interesting knowledge:
Apples and milk are good things that can enhance memory.
If you eat an apple or drink a glass of milk, after 10 minutes (digest), remember these knowledge points, the efficiency will double.
(Pro-test valid)

Article Directory

What is Apache Kafka?

A publish-subscribe messaging system, which is a distributed, partitioned and repetitive log service.

What is the traditional messaging method?

There are two traditional messaging methods:

  • Queue : In a queue, a group of users can read messages from the server, and each message is sent to one of them.
  • Publish-subscribe : In this model, messages are broadcast to all users.

What are the advantages of Kafka over traditional messaging methods?

  • High performance: A single Kafka broker can handle thousands of clients, processing megabytes of read and write operations per second, Kafka performance far exceeds traditional ActiveMQ, RabbitMQ, etc., and Kafka supports Batch operations;
  • Scalable: Kafka cluster can be expanded transparently, adding new servers into the cluster;
  • Fault tolerance: Kafka each Partition data will be replicated to several servers. When a Broker fails, Zookeeper will notify the producer and consumer to use other Brokers.

How does Kafka ensure the orderliness of messages?

The messages in each partition in kafka are in order when written, and a single partition can only be consumed by one consumer, which can guarantee the order of messages. But the order of messages between partitions is not guaranteed.

What are ISR, OSR, AR?

ISR: In-Sync Replicas replica synchronization queue
OSR: Out-of-Sync Replicas
AR: Assigned Replicas all replicas

ISR is maintained by the leader, leader follower synchronization data from some delay (specifically see Kafka understand graphic replica copying machines
, Ltd.), exceeds a corresponding threshold would eliminate the ISR follower, into OSR (Out-of-Sync Replicas ) List, newly added followers will also be stored in OSR first. AR=ISR+OSR.

What do LEO, HW, LSO, LW, etc. stand for?

LEO: is the abbreviation of LogEndOffset, which represents the next
HW in the current log file : the word watermark or watermark, which can also be called high watermark, and is usually used in the field of streaming processing (such as Apache Flink, Apache Spark, etc.) to characterize the progress of an element or event on a time-based level. In Kafka, the concept of water level has nothing to do with time, but with location information. Strictly speaking, it represents position information, namely displacement (offset). Taking the smallest LEO in the ISR corresponding to the partition as the HW, the consumer can only consume at most one piece of information where the HW is located.
LSO: short for LastStableOffset. For unfinished transactions, the value of LSO is equal to the position of the first message in the transaction
(firstUnstableOffset), and for completed transactions, its value is the same as HW.
LW: Low Watermark Low The water level represents the smallest logStartOffset value in the AR set.

How many kinds of data transfer transactions are there?

The transaction definition of data transmission usually has the following three levels: At
most once: The message will not be sent repeatedly, and it will be transmitted at most once, but it may not be transmitted at a time. At least once: The message will not be missed, and it will be transmitted at least once. May be repeatedly transmitted
Exactly once: neither missed transmission nor repeated transmission, every message is transmitted and received

Can Kafka consumers consume messages from specified partitions?

When the Kafa consumer consumes messages, it sends a fetch request to the broker to consume the messages of a specific partition. The consumer specifies the offset of the message in the log (offset), and then the message starting from this position can be consumed. The customer has the control of offset, You can roll back to re-consume the previous message, which is very meaningful.

Does Kafka message use Pull mode or Push mode?

The question that Kafka initially considered was whether customers should pull messages from broke or brokers push messages to consumers, that is, pull or push. In this regard, Kafka follows a traditional design common to most messaging systems: producers push messages to brokers, and consumers pull messages from brokers.
Some messaging systems such as Scribe and Apache Flume use the push mode to push messages to downstream consumers. This has advantages and disadvantages: the broker determines the rate of message push, which is not easy to handle for consumers with different consumption rates. Messaging systems are all committed to allowing consumers to consume messages at the highest rate and fastest. Unfortunately, in push mode, when the broker push rate is much greater than the consumer's consumption rate, the consumer may crash. In the end, Kafka chose the traditional pull mode.
Another advantage of the pull mode is that consumers can independently decide whether to pull data from brokers in batches. Push mode must decide whether to push each message immediately or batch push after buffering without knowing the consumption capacity and consumption strategy of downstream consumers. If a lower push rate is used in order to avoid consumer crashes, it may result in only pushing fewer messages at a time and causing waste. In the pull mode, consumers can decide these strategies according to their own spending power. A disadvantage of Pull is that if the broker has no messages available for consumption, the consumer will continue to poll in the loop until the new message reaches t. In order to avoid this, Kafka has a parameter that allows the consumer to block until new messages arrive (of course, it can also block until the number of messages reaches a certain amount so that they can be sent in batches)

What are the design features of Kafka's efficient file storage?

1) Kafka divides a large partition file in the topic into multiple small file segments. Through multiple small file segments, it is easy to periodically clear or delete the consumed files, reducing disk usage.
2) The index information can quickly locate the message and determine the maximum size of the response.
3) By mapping all index metadata to memory, segment file IO disk operations can be avoided. 4) Through the sparse storage of index files, the space occupied by index file metadata can be greatly reduced.

How to place partitions in different Brokers when Kafka creates topics?

1) The copy factor cannot be greater than the number of
brokers ; 2) The first copy placement position of the first partition (numbered 0) is randomly selected from the brokerList;
3) The first copy placement position of the other partitions is relative to the first 0 partitions move backward in turn. That is, if we have 5 Brokers and 5 partitions, assuming the first partition is placed on the fourth Broker, then the second partition will be placed on the fifth Broker; the third partition will be placed on the fourth Broker. One Broker; the fourth partition will be placed on the second Broker, and so on;
4) The placement of the remaining replicas relative to the first replica is actually determined by nextReplicaShift, and this number is also randomly generated;

In which directory will Kafka create new partitions?

We know that before starting the Kafka cluster, we need to configure the log.dirs parameter, whose value is the directory where Kafka data is stored. This parameter can configure multiple directories, separated by commas, usually these directories are distributed in different Used to improve read and write performance on the disk. Of course, we can also configure the log.dir parameter with the same meaning. Just set one of them.
If only one directory is configured in the log.dirs parameter, then the partitions allocated to each Broker must only be used to create folders in this directory

Store data.
But if multiple directories are configured in the log.dirs parameter, in which folder will Kafka create partition directories? The answer is: Kafka will create a new partition directory in the folder with the fewest partition directories, the partition directory name is Topic name + partition ID. Note that it is the directory with the least total number of partition folders, not the directory with the least disk usage! In other words, if you add a new disk to the log.dirs parameter, the new partition directory must be created on the new disk first until the partition directory owned by the new disk directory is not the least.

Talk about Kafka's rebalancing

In Kafka, when a new consumer joins or the number of subscribed topics changes, the Rebalance (rebalance: in the same consumer group, the ownership of the partition is transferred from one consumer to another consumer) mechanism will be triggered. Rebalance, as its name implies, is to rebalance consumer consumption. The process of Rebalance is as follows:
Step 1: All members send a request to the coordinator, requesting to join the group. Once all members have sent the request, the coordinator will select a consumer to assume the role of leader, and send the group membership information and subscription information to the leader.
Step 2: The leader starts to allocate consumption plans, specifying which consumer is responsible for consuming which partitions of which topics. Once the allocation is completed, the leader will send this plan to the coordinator. After the coordinator receives the distribution plan, it will send the plan to each consumer, so that all members in the group know which partitions they should consume.
So for Rebalance, Coordinator plays a vital role.

How does Kafka achieve high throughput?

Kafka is a distributed messaging system that needs to process a large amount of messages. Kafka is designed to write all messages to a low-speed and large-capacity hard disk in exchange for stronger storage capacity, but in fact, the use of hard disks does not bring Too much performance loss. Kafka mainly uses the following methods to achieve ultra-high throughput:
1) Sequential read and write
2) Zero copy
3) File segmentation
4) Batch sending
5) Data compression

The disadvantages of Kafka?

1) Because it is sent in batches, the data is not really real-time;
2) It does not support the mqtt protocol;
3) It does not support the direct access of IoT sensor data;
4) It only supports the order of messages in the unified zone, and cannot realize the global message Preface;
5) The monitoring is not perfect and plug-ins need to be installed;
6) Relying on zookeeper for metadata management.

What is the difference between Kafka's new and old consumers?

The old Kafka consumer API mainly includes: SimpleConsumer (simple consumer) and ZookeeperConsumerConnectir (advanced consumer). The name SimpleConsumer seems to be a simple consumer, but its practicality is not simple. You can use it to read messages from a specific partition and offset. Advanced consumers are a bit similar to new consumers now. There are consumer groups and partition rebalancing. However, it uses ZK to manage consumer groups and does not have the controllability of offset and rebalancing.
Consumers now support the above two behaviors at the same time, so why use the old consumer API?

Can the number of Kafka partitions be increased or decreased? why?

We can use the bin/kafka-topics.sh command to increase Kafka's partition data to Kafka, but Kafka does not support reducing the number of partitions. There are many reasons why Kafka partition data does not support reduction, such as where to put the data in the reduced partition? Delete or keep? If you delete it, then these unconsumed messages will not be lost. How to put these messages in other partitions if they are kept? If it is appended to other partitions, the order of a single partition in Kafka will be destroyed. If you want to ensure that the deleted partition data is inserted into other partitions to ensure orderliness, the logic will be very complicated to implement.

What are the characteristics of Kafka?

High throughput and low latency: Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds. Each topic can be divided into multiple
partitions, and consumer groups can consume partitions. Scalability: Kafka cluster supports hot-scalable
durability, reliability: messages are persisted to local disks, and data backup is supported to prevent data loss. Fault tolerance: Allow nodes in the cluster to fail (if the number of copies is n, then n-1 is allowed Node failure)
high concurrency: support thousands of clients to read and write at the same time

Please briefly describe in which scenarios would you choose Kafka?

Log collection: A company can use Kafka to collect logs of various services, and open it to various
consumers, such as Hadoop, HBase, Solr, etc., in a unified interface service through Kafka .
Message system: decoupling and producer and consumer, caching messages, etc.
User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing the web, searching, clicking and other activities. These activity information is published by various servers to Kafka topics, and then subscribers subscribe to these topics To do real-time monitoring and analysis, or load it into hadoop or data warehouse for offline analysis and mining.
Operational indicators: Kafka is also often used to record operational monitoring data. Including collecting data from various distributed applications and producing centralized feedback for various operations, such as alarms and reports.
Streaming processing: such as spark streaming and Flink

The purpose of Kafka partitioning?

The advantage of partitioning for Kafka clusters is to achieve load balancing. Partitioning can improve concurrency and efficiency for consumers.

What is ZooKeeper in Kafka? Can Kafka run independently without ZooKeeper?

Zookeeper is an open source, high-performance coordination service, which is used in Kafka distributed applications.
No, it is impossible to bypass Zookeeper to contact Kafka broker directly. Once Zookeeper stops working, it cannot serve client requests. Zookeeper is mainly used to communicate between different nodes in the cluster. In Kafka, it is used to submit offsets, so if the node
fails in any case, it can be obtained from the previously submitted offsets. In addition, it also performs other activities, such as: leader detection, distributed synchronization, configuration management, identifying when a new node leaves or connects, clusters, real-time status of nodes, and so on.

Explain how Kafka users consume information?

Passing messages in Kafka is done by using the sendfile API. It supports transferring the byte socket to disk, saving a copy through the kernel space, and calling the kernel among kernel users.

Explain how to improve the throughput of remote users?

If the user is located in a different data center from the broker, the Socket buffer size may need to be adjusted to amortize the long network delay.

Explain how to reduce the disturbance in ISR? When does the broker leave the ISR?

The ISR is a set of message copies that are fully synchronized with the leaders, which means that all submitted messages are included in the ISR. The ISR should always contain all copies until a real failure occurs. If a replica leaves the leader, it will be deleted from the ISR.

Why does Kafka need to replicate?

Kafka's information replication ensures that any published messages will not be lost and can be used in machine errors, program errors or more common software upgrades.

What does it mean if the copy stays in the ISR for a long time?

If a copy is kept in the ISR for a long period of time, it indicates that the tracker cannot obtain data as quickly as the leader collects data.

Please explain what happens if the preferred copy is not in the ISR?

If the preferred replica is not in the ISR, the controller will not be able to transfer the leadership to the preferred replica.

Is it possible for Kafka to experience message drift after production?

In most queuing systems, the class as the producer cannot do this. Its role is to trigger and forget the message. The broker will do the rest, such as using id for proper metadata processing, offset, etc.
As a message user, you can get compensation from the Kafka broker. If you look at the SimpleConsumer class, you will notice that it gets a MultiFetchResponse object that includes the offset as a list. In addition, when you iterate over Kafka messages, you will have a MessageAndOffset object including offset and message sending.

Please explain Kafka's delivery guarantee mechanism and how to implement it?

Kafka supports three message delivery semantics:
① At most once messages may be lost, but will never be delivered repeatedly
② At least one messages will never be lost, but may be delivered repeatedly
③ Exactly once Each message will definitely be transmitted once and Only transmit once. In many cases, this is the
consumer that the user wants. After reading the message from the broker, you can choose to commit. This operation will save the offset of the message read by the consumer under the partition in Zookeeper, and the consumer will next time When the partition is read again, it will be read from the next one. If it is not committed, the starting position of the next reading will be the same as the starting position after the previous commit.
The consumer can be set to autocommit, that is, once the consumer reads the data, it will automatically commit immediately. If only the process of reading messages is discussed, then Kafka ensures Exactly once. However, in actual use, the consumer does not end after reading the data, but for further processing. The order of data processing and commit largely determines the delivery guarantee semantics of the message from the broker and the consumer.
Commit after reading the message and then process the message. In this mode, if the consumer crashes before it has time to process the message after commit, it will not be able to read the unprocessed message after it restarts next time, which corresponds to At most once.
After reading the message, process it first and then commit the consumption state (save offset). In this mode, if the Consumer crashes before commit after processing the message, the message that has not been committed will be processed next time when the work is restarted. In fact, the message has already been processed, which corresponds to At least once.
If you must do Exactly once, you need to coordinate the output of offset and the actual operation. The classic approach is to introduce two-phase commit, but because many output systems do not support two-phase commit, a more general way is to store offset and operation input in the same place. For example, after the consumer gets the data, it may put the data in HDFS. If the latest offset and the data itself are written to HDFS, it can ensure that the output of the data and the update of the offset are either completed or not completed, which indirectly realizes Exactly once. (As far as the high level API is concerned, offset is stored in Zookeeper and cannot be stored in HDFS, while the offset of low level API is maintained by itself and can be stored in HDFS).

In short, Kafka guarantees At least once by default, and allows At most once to be achieved by setting the producer to submit asynchronously, while Exactly once requires cooperation with the target storage system. The offset provided by Kafka can easily implement this method.

How to ensure the order of Kafka's messages

Kafka does not have strict requirements for message duplication, loss, error, and order.
Kafka can only guarantee that the messages in a partition are in order when consumed by a certain consumer. In fact, from the perspective of Topic, when there are multiple partitions, the messages are still not globally ordered.

Kafka data loss problem, and how to guarantee?

1) Data loss: When
acks=1 (only the success of writing to the leader is guaranteed), if the leader just hangs. Data will be lost. When acks=0, when using asynchronous mode, Kafka cannot guarantee messages in this mode and may be lost.
2) How does the broker guarantee that it will not be lost:
acks=all: All copies are written successfully and confirmed.
retries = a reasonable value.
min.insync.replicas=2 The message must be written to at least so many replicas to be considered successful.
unclean.leader.election.enable=false Turn off unclean leader election, which means that non-ISR replicas are not allowed to be elected as the leader to avoid data loss.
3) How does the Consumer guarantee no loss?
If the offset is submitted before the message processing is completed, it may cause data loss.
enabel.auto.commit=false Turn off automatic submission offset
manually submit after processing the data.

How is Kafka's balance done?

Producers publish data to topics of their choice. The producer can choose which partition's message is allocated in the topic. This can be done in a round-robin way, just to balance the load, or it can be done based on some semantic partitioning functions (such as some keys in the message). More about the use of partitions in one second.

Kafka's consumer approach?

The consumer uses the pull mode to read data from the broker.
The push model is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. Its goal is to deliver messages as quickly as possible, but this can easily cause consumers to be too late to process messages. Typical manifestations are denial of service and network congestion. The pull mode can consume messages at an appropriate rate according to the consumer's consumption capacity.
For Kafka, the pull mode is more suitable. It simplifies the design of the broker. The consumer can independently control the rate of consuming messages. At the same time, the consumer can control the consumption mode by itself-either batch consumption or one-by-one consumption, and at the same time, different options can be selected. The submission method realizes different transmission semantics.
The downside of the pull mode is that if Kafka has no data, consumers may fall into a loop, waiting for the data to arrive. To avoid this situation, we have parameters in our pull requests that allow consumers to request to block in the "long polling" waiting for data to arrive.

Kafka's ISR copy synchronization queue

ISR (In-Sync Replicas), replica synchronization queue. ISR includes Leader and Follower. If the leader process hangs up, a service will be selected as the new leader in the ISR queue. There are replica.lag.max.messages (the number of delayed messages) and replica.lag.time.max.ms (delay time) two parameters determine whether a service can be added to the ISR replica queue. Replica.lag was removed in version 0.10 The .max.messages parameter prevents services from frequently entering the queue.

If any dimension exceeds the threshold, the follower will be removed from the ISR and stored in the OSR (Outof-Sync Replicas) list. The newly added follower will also be stored in the OSR first.

Kafka 4 partition strategies

The first partitioning strategy: Given the partition number, directly send the data to the designated partition.
The second partitioning strategy: Without a given partition number, given the key value of the data, use the key to take the hashCode to partition the
third A partition strategy: neither a given partition number nor a given key value, direct round-robin partitioning The
fourth partition strategy: custom partition

Guess you like

Origin blog.csdn.net/weixin_42072754/article/details/109295219