Introduction to Kafka concepts and components

Kafka-Distributed Message Queuing System

1. Distributed message queue system, first in, first out, while providing data distributed caching function

2. Message persistence: The data reading speed can reach O(1)-read-ahead, write-behind (in sequence, ABCDE, read A, read B; additional write at the end) sequential access to the disk is more than memory access fast)

1. Reasons for Kafka's fast speed:

1.cache cache+

2. Sequential write (write data, disk sequence)+

3. Zero copy (1. Let the data in the operating system cache be sent to the network card 2. The network card is transmitted to downstream consumers)

4. Batch sending + data compression

Kafka summary : distributed, partitioned, multi-copy, multi-subscriber, distributed log system coordinated based on zookeeper

2. To ensure the reliability of the message, at least the following parameters need to be configured: see below for details. 13.

Topic level: replication-factor>=3; multiple copies

Producer level: acks=-1; set producer.type=sync in the simultaneous sending mode;

              ack=-1: The producer needs to wait for all the followers in the ISR to confirm that the data is received before it is considered as a transmission is completed, the reliability is the highest

Broker level: close the incomplete leader election, that is, unclean.leader.election.enable=false;

Three, the role of kafka : message buffer, flume million, while sparkstreaming can only handle tens of thousands, Kafka buffer is needed in the middle

1. kafka characteristics:

- high throughput, low latency : can handle hundreds of thousands of messages per second kafka , its lowest latency of only a few milliseconds

- Scalability : kafka cluster support heat expansion

- durability, reliability : the message is persisted to local disk, and supports data backup to prevent data loss

- fault tolerance : allows the nodes in the cluster fails (if the copy number is n, n-1 nodes allow failure)

- high concurrency : support thousands of clients simultaneously read and write

Kafka application scenarios

The main application scenarios are: log collection, message system, user activity tracking, operational indicators, streaming processing, etc.

  • Message system
  • Log system
  • Stream processing

2. Component description

• Kafka is distributed internally, and a Kafka cluster usually includes multiple Brokers

• Zookeeper load balancing: divide Topic into multiple partitions , and each Broker stores one or more Partitions

• Multiple Producers and Consumers simultaneously produce and consume messages

1. Broker: Each instance of kafka (server) can be understood as a machine (node)

2.Producer: The message producer , publishes messages (writes messages) to the terminal or service of the kafka cluster.

  • The producer can publish data to the topic it specifies, and can specify which messages in the topic are allocated to which partitions (for example, simply distribute each partition in turn or assign the key to the corresponding partition by specifying the partition semantics)

  • The producer sends the message directly to the broker of the corresponding partition without any routing layer.
  • Batch processing is sent when the messages are accumulated to a certain number or after waiting for a certain period of time.

3. Consumer: A terminal or service that consumes messages (read data) from the Kafka cluster .

  • A more abstract way of consumption: consumer group (consumer groupid) streaming

  • This method includes the traditional queue and publish and subscribe methods

– First, consumers mark themselves with a consumer group name. The message will be delivered to a certain consumer instance in each consumer group.

– If all consumer instances have the same consumer group, this is just like the traditional queue method.

– If all consumer instances have different consumer groups, this is just like the traditional publish-and-subscribe method.

– A consumer group is like a logical subscriber, and each subscriber is composed of many consumer instances (used for expansion or fault tolerance).

  • Compared with the traditional message system, Kafka has a stronger order guarantee.
  • Because the topic uses partitions, it can ensure sequence and load balance in the operation of multiple Consumer processes .
  • Consumers of the same group can consume messages of the same topic in parallel, but consumers of the same group will not consume repeatedly

 

4. Topic ( virtual concept) queue : the category to which each message published to the kafka cluster belongs . Kafka is topic-oriented

Topic and offset

• A topic is a category or feed name used to publish messages. The Kafka cluster uses partitioned logs, and each partition is a sequence of messages that is in order and unchanged .

• The commit log can be added continuously. The message is assigned an id sequence called offset in each partition to uniquely identify the message in the partition.

 

1. Regardless of whether there is consumption or not, the message will be cleaned up (if there is no consumption, the message will always be persistent and clean up through the following two configurations)

       (1) Configure the persistence period: 7 days (2) Configure the maximum amount of data

2. Persist this offset in the log for each consumer . Usually the offset value increases linearly when consumers read the message, but in fact its position is determined by the consumer

      Control, it can consume messages in any order. For example, reset to the old offset to reprocess.

3. Each partition represents a parallel unit.

5. Partition : Each topic contains one or more partitions. The unit allocated by kafka is partition

  • Storage, to achieve load balancing (different partiton can be distributed on different machines), to ensure the order of messages

  • Sequentiality guarantee: local sequentiality (sequence is guaranteed to a certain extent: subscription messages are read from the beginning to the back, and write messages are appended at the end)

  • Partition exists as a folder

In most messaging systems, messages under the same topic are stored in a queue. The concept of partition is to divide this queue into several small queues, each small queue is a partition, as shown below:

It can be seen from the picture above. When there is no partition, only one consumer of a topic consumes the message queue. After partitioning is adopted, if there are two partitions and at most two consumers consume at the same time, the consumption speed will definitely be faster. If you feel that it is not fast enough, you can add four partitions to allow four consumers to consume in parallel.

The partition design greatly improves the throughput of Kafka! ! !

This picture contains the following knowledge points:

1. A partition can only be consumed by one consumer in the same group (only one arrow in the figure points to a partition)

2. A consumer in the same group can consume multiple partitions (the first consumer in the figure consumes Partition 0 and 3)

3. The highest consumption efficiency is the same number of partitions and consumers. This ensures that each consumer is fully responsible for a partition.

4. The number of consumers cannot be greater than the number of partitions . Due to the limitation of the first point, when there are more consumers than partitions, there will be consumers idle.

5. The consumer group can be considered as a cluster of subscribers, in which each consumer is responsible for the partitions it consumes


The relationship between Topic and Partition:

Each partition is orderly and immutable.

Kafka can guarantee the order of partition consumption, but it cannot guarantee the order of topic consumption .

(1) Topic is a logical concept, Partition is a physical concept, and one or more Partitions form a topic.

(2) Multiple partitions in the topic are saved to the broker in the form of folders (the contents of each folder are different), the serial number of each partition increases from 0, and the messages are in order

Note: Generally, there are as many topocs as there are tables, but some tables of the same kind may be pre-aggregated and stored in a topic

   Partition has 2 parts: (1) index log (location index information) (2) message log (store real data)

Search: Dichotomy + sequential traversal (How to quickly find the position of a value in a given sequential number queue?)


6. Segment: The partition is physically composed of multiple segments.

7. Offset: Each partition consists of a series of ordered and immutable messages , which are successively appended to the partition. Each message in the partition has a continuously increasing sequence number called offset , and the offset offset is unique in each partition.

Offset offset (there is an offset in topic), locate the position of data reading (not only determine the position of reading offset, but also which partition to read)

o ffset location : the number of messages (location) the consumer has consumed on the corresponding partition where the offset is saved has a certain relationship with the version of kafka.

    The offset was saved on zookeeper before version 0.8 of kafka.

    After Kafka version 0.8, the offset is stored on the Kafka cluster.

  •     LEO: the offset of the last message of each copy

  •     HW: the smallest offset of all replicas in a partition

Offset naming: Kafka's storage files are named after offset.kafka. The advantage of using offset as the name is that it is easy to find. For example, if you want to find the location at 2049, just find the file 2048.kafka. Of course the first offset is 00000000000.kafka.

Kafka's messages are stateless, which reduces the difficulty of Kafka implementation. Consumers must maintain the consumed state information by themselves

 

8. Replica copy: A copy of the partition to ensure the high availability of the partition (multiple copies are realized) .

Multiple copies of Kafka partitions are the core guarantee of Kafka's reliability. Writing messages to multiple copies allows Kafka to ensure the durability and reliability of messages in the event of a crash.

There are multiple partitions under topic, and each partition has its own replica. Only one of them is leader replica, and the rest are follower replicas.

The relationship between Topic, Partition, and Replica is as follows:

  • The copy can be set by the replication-factor parameter when setting the theme, or can be specified by setting defalut.replication-factor in the broker level, generally we set it to 3;

  • One of the three replicas is the leader and two replicas are the followers. The leader is responsible for reading and writing messages, and the follower is responsible for regularly replicating the latest messages from the leader to ensure the consistency of the messages between the follower and the leader . When the leader goes down, it will A new leader is elected from the followers to be responsible for reading and writing messages. Through the partition copy architecture, although data redundancy is introduced, the high reliability of Kafka is guaranteed.

• follower: A role in replica, which replicates (fentch) data from the leader.

• leader: A role in the replica, the producer and consumer only interact with the leader.

• Controller: One of the servers in the Kafka cluster, used for leader election and various failovers.

9.zookeeper:

Kafka uses zookeeper to store the meta information and offset of the cluster.

Kafka needs to be deployed jointly with zookeeper. Zookeeper guarantees the availability of the Kafka system . Some information in Topic must also be stored in Zookeeper.

(1) Kafka uses zookeeper to store the meta information of the cluster.

(2) Once the broker where the controller is located is down, the temporary node disappears at this time, and other brokers in the cluster will always monitor the temporary node. If the temporary node disappears, they will scramble to create the temporary node again to ensure that a new broker will become the controller Character.

The broker still relies on ZK, and zookeeper is also used in kafka to elect controllers and check whether the broker is alive and so on.

zk maintains the same offset:

10.Consumer group

Consumers in the same group can consume messages from the same topic in parallel, but consumers in the same group will not consume messages repeatedly.

If the same topic needs to be consumed multiple times, it can be achieved by setting up multiple consumer groups. Each group consumes separately and does not affect each other.

  • In the high-level consumer API, each consumer belongs to a consumer group.

  • Each message and partition can only be consumed by one Consumer in the consumer group , but can be consumed by multiple consumer groups

View command:

11.Message message--kafka data unit: (Flume-event, hdfs-block, Kafka--message)

Kafka's most basic data unit-message, the largest consumption message cannot exceed 1M, which can be controlled by configuration

• Each producer can publish some messages to a topic. If the consumer subscribes to this topic, the newly published message will be broadcast to this consumer.

• message format: – message length: 4 bytes -1 empty – "magic" value: 1 byte (version number of the Kafka service agreement, compatible)

– crc32 : 4 bytes      – timestamp 8 bytes       – payload : n bytes

kafka endurance:

• Kafka storage layout is simple: each Partition of Topic corresponds to a logical log (a log is a set of segmented files of the same size)

• Every time a producer publishes a message to a partition, the agent appends the message to the last segment file segment . When the number of published messages reaches the set value or after a certain period of time has elapsed, a segment of the file is truly flushed to the disk. After the writing is completed, the message is disclosed to the consumer.

• Unlike traditional messaging systems, the messages stored in the Kafka system do not have a clear message Id.

• The message is exposed through a logical offset in the log .

12. High transmission efficiency: zero-copy, kernel call, directly copy the data on the disk to the socket, instead of transmitting through the application.

zero-copy: In order to reduce byte copying, Kafka uses the sendfile system call provided by most systems

Kafka's messages are stateless, which reduces the difficulty of Kafka implementation. Consumers must maintain the consumed state information by themselves , and the agent does not care at all

This design is very subtle, it contains innovation itself

– Deleting a message from the agent becomes tricky because the agent does not know whether the consumer has already used the message. Kafka innovatively solves this problem by applying a simple time-based SLA to the retention policy. When the message is in the agent for a certain period of time, it will be automatically deleted.

– Benefit: Consumers can deliberately fall back to the old offset to consume data again. This violates the common conventions of queues, but has proven to be a basic characteristic of many consumers.

13. Reliability delivery guarantee

Kafka adopts the at least once message delivery strategy by default . That is, the processing sequence on the consumer side is to get the message -> process the message -> save location . This may cause messages that have already been processed by the previous client to be processed when the new client takes over once the client hangs up.

Three guarantee strategies:

– At most once messages may be lost, but never repeated transmission (rarely used)

– At least one message will never be lost, but it may be transmitted repeatedly ( commonly used )

– Exactly once every message will definitely be transmitted once and only once

Copy management

• Kafka replicates the log to multiple designated servers.

• The unit of the copy is partition. Under normal circumstances, each partition has a leader and 0 or more followers.

• The leader handles all read and write requests on the corresponding partition. The number of partitions can be more than the number of brokers, and the leader is also distributed.

• The follower's log is the same as the leader's log, and the follower passively replicates the leader. If the leader fails, one of the followers will automatically become the new leader.

14. ISR set (synchronized copy) in-sync replica: the number of follower partitions that the leader partition keeps in sync

There are multiple partitions for topics in the Kafka cluster. In order to achieve high availability, a log copy strategy is adopted:

---When some machines hang up, if the leader hangs up, it must be a follower in the ISR set to have a chance to become the leader. Because the data representing him in this ISR list is synchronized with the leader.

ISR collection, as long as the followers in the collection have the opportunity to become the leader.

How to let the leader know whether the follower successfully receives data (heartbeat, ack)

15. How to judge being alive:

(1) Heartbeat

(2) How the slave can follow the update of the leader so as not to fall too far, it is considered effective, otherwise it is considered that the slave is down, and the slave needs to be removed from the ISR

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/113637099