Summary of Kafka introductory knowledge

Kafka overview

Kafka is a distributed message queue (Message Queue) based on the publish/subscribe model, which is mainly used in the real-time processing of big data.

message queue

Insert picture description here
The benefits of using message queues;

  • Decoupling
    allows you to extend or modify the processing on both sides independently, as long as you ensure that they comply with the same interface constraints.

  • Recoverability When
    a part of the system fails, it will not affect the entire system. The message queue reduces the coupling between processes, so even if a message processing process hangs, the messages added to the queue can still be processed after the system is restored.

  • Buffering,
    peak-cutting, and valley-leveling helps to control and optimize the speed of data flow through the system, and solve the inconsistency between the processing speed of production messages and consumption messages.

  • Flexibility & peak processing capacity
    In the case of a surge in traffic, applications still need to continue to function, but such burst traffic is not common. It is undoubtedly a huge waste to invest resources at any time in order to be able to handle such peak visits. The use of message queues enables key components to withstand the sudden access pressure, and will not completely collapse due to sudden overloaded requests.

  • Asynchronous communication In
    many cases, users do not want or need to process messages immediately. The message queue provides an asynchronous processing mechanism that allows users to put a message into the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

Two ways of message queue

 (1)点对点模式(一对一,消费者主动拉取数据,消息收到后消息清除)消息生产者生产消息发送到Queue中,然后消息消费者从Queue中取出并且消费消息。消息被消费以后,queue中不再有存储,所以消息消费者不可能消费到已经被消费的消息。Queue支持存在多个消费者,但是对一个消息而言,只会有一个消费者可以消费。
 (2)发布/订阅模式(一对多,消费者消费数据之后不会清除消息)消息生产者(发布)将消息发布到topic中,同时有多个消息消费者(订阅)消费该消息。和点对点方式不同,发布到topic的消息会被所有订阅者消费。

The advantages of this are:

  • Support multiple consumers
  • The consumer determines the consumption speed. The disadvantage of this is the need to maintain a service. Long polling monitors whether there are messages in the message queue, which is a waste of resources.

What is Kafka?

Kafka is a distributed data streaming platform.
In streaming computing, Kafka is generally used to cache data, and Spark performs calculations by consuming Kafka data.

  • Apache Kafka is an open source messaging system, written in Scala. It is an open source messaging system project developed by the Apache Software Foundation.
  • Kafka was originally developed by LinkedIn and open sourced in early 2011.
    Graduated from Apache Incubator in October 2012 . The goal of this project is to provide a unified, high-throughput, low-waiting platform for processing real-time data.
  • Kafka is a distributed message queue. Kafka categorizes messages according to Topic when they are stored. The sender is called Producer, and the recipient of the message is called Consumer. In addition, the Kafka cluster is composed of multiple Kafka instances, and each instance (server) is called a broker.
  • Both the Kafka cluster and the consumer rely on the zookeeper cluster to store some meta information to ensure system availability.

Features of Kafka

  • Similar to message queues and commercial messaging systems, Kafka provides publishing and subscribing to streaming data
  • Kafka provides a durable, fault-tolerant way to store streaming data
  • Kafka has good performance and can process streaming data in a timely manner
  • Kafka's main cluster mode runs on one or more servers that can span multiple data centers
  • The Kafka cluster stores data according to category records, which are called topics in Kafka
  • Each record consists of a key, a value and a timestamp

Key concept

Broker : A kafka server is a broker. A cluster is composed of multiple brokers.
Topic : Topic is the data topic. Kafka recommends storing different data in different topics according to the business system! Topics in Kafka is always a multi-subscriber model, and a topic can have one or more consumers to subscribe to its data. A large topic can be distributed and stored in multiple kafka brokers! Topic can be compared to a library in a database

Partition : Each topic can have multiple partitions. Through the design of partitions, topics can be continuously expanded! That is, multiple partitions of
a topic are distributed and stored in multiple brokers! In addition, a topic can be consumed by multiple consumers through partitioning! To achieve parallel processing! Partitions can be compared to tables in a database!
Kafka only guarantees that messages will be sent to consumers in the order in a partition, and does not guarantee the order of a topic as a whole (among multiple partitions).
Offset : Data will be continuously appended to a structured commit log of the partition in chronological order! The records stored in each partition are in order, and the order is immutable! This order is uniquely identified by an id called offset! Therefore, offset can also be regarded as an ordered and immutable
Producer : the message producer is the client that sends messages to the Kafka broker. The producer is responsible for assigning records to the specified partition of the topic.
Consumer : the message consumer, the client that gets the message from the kafka broker. Each consumer has to maintain the offset for reading data. The offset is saved in Zookeeper by default before version 0.9, and in the "__consumer_offsets" topic of Kafka by default after 0.9.
Consumer Group : A topic can have multiple consumer groups. Topic messages will be copied (not really copied, but conceptually) to all CGs, but each partion will only send the message to one consumer in the CG. Each consumer will use a consumer group name to identify. Different consumer instances in the same group can be distributed across multiple processes or multiple machines

kafka-like endurance

The Kafka cluster keeps all published records-whether they have been consumed or not-and is controlled by a configurable parameter-the retention period. For example, if the retention policy is set to 2 days, a record can be consumed at any time within two days after it is released. After two days, the record will be cleared and the disk space will be released.
Kafka's performance has nothing to do with data size , so there is no problem with storing data for a long time.

Kafka copy mechanism

The log partition (distributed) is on the Kafka cluster server. Each server shares these partitions when processing data and requests. Each partition will be backed up on the configured server to ensure fault tolerance.
Each partition has one server as the "leader", and zero or more servers as followers. The leader server handles all read and write requests to the partition, while the followers only need to passively synchronize the data on the leader. When the leader goes down, a server in the followers will automatically become the new leader. Through this mechanism, you can not only ensure that the data has multiple copies, but also realize a highly available mechanism.
Based on security considerations/load balancing considerations, the leader and follower of each partition will not be assigned to a broker.

Kafka infrastructure

Insert picture description here

Kafka workflow and file storage mechanism

Insert picture description here
Messages in Kafka are classified by topic. Producers produce messages and consumers consume messages, which are all topic-oriented.
Topic is a logical concept, and partition is a physical concept. Each partition corresponds to a log file, and the log file stores the data produced by the producer. The data produced by Producer will be continuously appended to the end of the log file, and each piece of data has its own offset. Each consumer in the consumer group will record in real time which offset they have consumed, so that when the error is restored, they can continue to consume from the last location.
Insert picture description here
Since the messages produced by the producer are constantly appended to the end of the log file, in order to prevent the excessively large log file from causing inefficient data positioning, Kafka adopts a fragmentation and indexing mechanism to divide each partition into multiple segments.

Kafka producer

Kafka partition
reason :

  • It is convenient to expand in the cluster. Each Partition can be adjusted to adapt to the machine where it is located, and a topic can be composed of multiple Partitions, so the entire cluster can adapt to data of any size;
  • Concurrency can be improved because it can be read and written in units of Partition.
    The principle of zoning :
  • In the case of specifying the partition, directly use the specified value as the partition value;
  • If the partition value is not specified but there is a
    key, the hash value of the key and the partition number of the topic are obtained by taking the remainder to obtain the partition value; if there
    is neither a partition value nor a key value, an integer is randomly generated at the first call ( Each subsequent call will increment on this integer), and take the remainder of this value and the total number of partitions available for the topic to obtain the partition value, which is often referred to as the round-robin algorithm.

Data reliability guarantee

In order to ensure that the data sent by the producer can be reliably sent to the specified topic, after each partition of the topic receives the data sent by the producer, it needs to send an ack (acknowledgement confirmation) to the producer. If the producer receives an ack, it will Send the next round, otherwise resend the data.
Insert picture description here

  • The ISR
    Leader maintains a dynamic in-sync replica set (ISR), which means a follower set that is synchronized with the leader. When the follower in the ISR completes data synchronization, the leader will send an ack to the follower. If the follower does not synchronize data with the leader for a long time, the follower will be kicked out of the ISR. The time threshold is set by the replica.lag.time.max.ms parameter. After the leader fails, a new leader will be elected from the ISR

  • ack response mechanism

For some less important data, the reliability requirements of the data are not very high, and a small amount of data loss can be tolerated, so there is no need to wait for all the followers in the ISR to receive successfully.
Therefore, Kafka provides users with three reliability levels. Users can choose the following configuration according to the requirements of reliability and delay.
Acks parameter configuration:
acks:
0: Producer does not wait for the ack of the broker. This operation provides a minimum delay. The broker returns as soon as it receives it and has not written to the disk. When the broker fails, data may be lost ;
1: Producer Wait for the ack of the broker, and return ack after the leader of the partition is successfully placed. If the leader fails before the follower synchronization is successful, data will be lost;
-1 (all): the producer waits for the ack of the broker, and the leader and followers of the partition are all placed. Only return ack after success. But if the leader fails after the follower synchronization is completed and before the broker sends an ack, it will cause data duplication. Note that if there is only one leader in isr, even if it is 1, data may be lost.

Troubleshooting details

LEO: refers to the largest offset of each copy;
HW: refers to the largest offset that consumers can see, and the smallest LEO in the ISR queue.
(. 1) Fault follower
follower after a failure will be temporarily kicked the ISR, the recovery until the follower, the follower HW reads the last record of a local disk, and the log file is higher than the portion taken off HW, HW from
start Synchronize to the leader. After the follower's LEO is greater than or equal to the partition's HW, that is, after the follower catches up with the leader, you can rejoin the ISR.
(2) Leader failure After the
leader fails, a new leader will be selected from the ISR. After that, in order to ensure data consistency between multiple copies, the remaining followers will first set their log files higher than the part of the HW Cut off, and then synchronize data from the new leader.

Exactly Once semantics

  • At most once: at most once
  • At least once:
  • Have and only once: exactly once

If we use -1, then we can guarantee that the data is not lost, but it cannot be guaranteed that the data is not duplicated. If we need to ensure that there is one and only one time, then after the consumer consumes the data, we need to develop code to de-duplicate the data, and a topic Multiple consumer groups can be subscribed, and each group must be deduplicated when subscribing...

For some more important messages, we need to ensure exactly once semantics, that is, to ensure that each message is sent and only sent once.

After version 0.11, Kafka Producer introduced the idempotent mechanism (idempotent), with at least once semantics when acks = -1, to achieve exactly once semantics from producer to broker

Kafka consumer

The
consumer uses the pull mode to read data from the broker.
The push model is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. Its goal is to deliver messages as quickly as possible, but this can easily cause consumers to be too late to process messages. Typical manifestations are denial of service and network congestion. The pull mode can consume messages at an appropriate rate according to the consumer's consumption capacity.
The disadvantage of the pull mode is that if Kafka has no data, consumers may fall into a loop and always return empty data. In response to this, Kafka consumers will pass in a duration parameter timeout when consuming data. If there is currently no data available for consumption, the consumer will wait for a period of time before returning. This period of time is timeout

Partition allocation strategy

Kafka has two allocation strategies, one is RoundRobin and the other is Range.

  • The use of RoundRobin is group-oriented, and the possible problem is that different consumers in the same group can subscribe to different topics, because it is a polling strategy, this configuration will lead to invalid
  • Considering that range is subject-oriented, the problem with this strategy is that it may cause uneven load.

The use of RoundRobin is group-oriented. The problem that may result is that different consumers in the same group can subscribe to different topics. Because the polling strategy is adopted, this configuration will lead to invalidity.
Considering that the range is topic-oriented, this The problem with the strategy is that it may cause uneven load.

  • Add consumers in the same Consumer Group
  • Consumers leave the Consumer Group they currently belong to, including shuts down or crashes
  • New partitions for subscribed topics

Maintenance of offset (emphasis)

  • You need to understand where the offset is stored. Before 0.9, the offset is stored in zk by default, and after 0.9, it is stored in the theme of kafka by default.
  • The offset has nothing to do with the consumer, but with the consumer group, topic, and partition. We can assume that if it has a relationship with the consumer, we consider that if the consumer hangs up, is the offset lost or changed? Obviously It is unreasonable, that is, when the consumer hangs up, repartition the consumer group. This partition may be assigned to a new consumer, but it will continue to partition after this offset.

Kafka reads and writes data efficiently

1) Write the
production data of Kafka's producer on disk sequentially . It needs to be written to the log file. The writing process is to append to the end of the file, which is sequential writing. There is data on the official website that the same disk can be written up to 600M/s in sequence, but only 100K/s in random. This is related to the mechanical mechanism of the disk. The reason why sequential writing is fast is that it saves a lot of head addressing time.
2) Zero copy technology

The role of Zookeeper in Kafka

A broker in the Kafka cluster will be elected as the Controller. The election mechanism is to seize resources. Whoever starts it first is the Controller. It is responsible for managing the online and offline of the cluster brokers, the allocation of partition copies of all topics, and leader election.
Controller's management work is dependent on Zookeeper

Kafka 的 API

Producer API

Message sending process
Kafka's Producer sends messages asynchronously. In the process of message sending, two threads are involved-main thread and Sender thread, and a thread shared variable-RecordAccumulator. The main thread sends messages to the RecordAccumulator, and the Sender thread continuously pulls messages from the RecordAccumulator and sends them to the Kafka broker.
Related parameters:

  • batch.size: The sender will send data only after the data is accumulated to batch.size.
  • linger.ms: If the data does not reach batch.size, the sender will send the data after waiting for linger.time.

Consumer API

The reliability of the consumer's data consumption is very easy to guarantee, because the data is persistent in Kafka, so there is no need to worry about data loss.
Since the consumer may experience failures such as power outages and downtime during the consumption process, after the consumer recovers, it needs to continue to consume from the location before the failure, so the consumer needs to record in real time which offset it consumes so that it can continue to consume after the failure is restored.
Therefore, the maintenance of offset is a problem that Consumers must consider when consuming data.

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109246515