Introduction to the distributed messaging system Kafka and analysis of the principle of Kafka

About Kafka: Kafka is an open source messaging system launched by Linkedin in 2010. It is a distributed message queue based on the publish/subscribe model and is mainly used to process active streaming data. The traditional log analysis system provides a scalable solution for offline processing of log information. For real-time processing, there is usually a large delay, while the existing message (queue) system can handle real-time or near real-time well. However, the unprocessed data is usually not written on the disk, which may be a problem for offline applications such as hadoop. Kafka is designed to solve the above problems. It can well support offline and online applications.

1. About the benefits of kafka message queue:
(1) Decoupling:
Allows you to independently extend or modify the processing on both sides, as long as you ensure that they comply with the same interface constraints. Here is a simple example, if you and the repairman can be regarded as two systems when you repair the phone, but you don’t have to wait there while the repairman repairs the phone. You can go to the supermarket to buy good things. Go for a meal and get your phone when you come back. That is, the things handled by the two systems of each other may not need to be carried out at the same time.
(2) Recoverability: When
a part of the system fails, it will not affect the entire system. The message queue reduces the coupling between processes, so even if a message processing process hangs up, the messages added to the queue can still be processed after the system is restored.
(3) Peak cut:
In the case of a sharp increase in traffic, applications still need to continue to play a role, but such burst traffic is not common. It is undoubtedly a huge waste to invest resources at any time in order to be able to handle such peak visits. The use of message queues enables key components to withstand sudden access pressure, and will not completely crash due to sudden overloaded requests

2. Two modes in the message queue
(1) Point-to-point messaging system (one-to-one, consumers take the initiative to pull data, and the message is cleared after the message is received)
In the point-to-point system, the message is kept in the queue. One or more consumers can consume messages in the queue, but a particular message can only be consumed by at most one consumer. Once the consumer reads the message in the queue, it disappears from the queue. A typical example of this system is an order processing system, where each order will be processed by one order processor, but multiple order processors can also work at the same time. The following figure describes the structure.

Insert picture description here

(2) Publish-subscribe message system (one-to-many, consumers will not clear the message after consuming data)
In the publish-subscribe system, the message is kept in the topic. Unlike peer-to-peer systems, consumers can subscribe to one or more topics and use all messages in that topic. In a publish-subscribe system, message producers are called publishers, and message consumers are called subscribers. A real life example is Dish TV, which publishes different channels, such as sports, movies, music, etc. Anyone can subscribe to their own channel collection and get their subscribed channels available when they are available.

Insert picture description here

3. Kafka's architecture

Insert picture description here

About some basic concepts in the above picture

1) Producer: the message producer, which is the client that sends messages to the kafka broker;
2) Consumer: the message consumer, the client that fetches messages from the kafka broker;
3) Consumer Group (CG): the consumer group, consisting of multiple Consumer composition. Each consumer in a consumer group is responsible for consuming data in different partitions, and a partition can only be consumed by consumers in one group; consumer groups do not affect each other . All consumers belong to a certain consumer group, that is , the consumer group is a logical subscriber .
4) Broker: A kafka server is a broker. A cluster is composed of multiple brokers. A broker can hold multiple topics.
5) Topic: It can be understood as a queue, and both producers and consumers are facing one topic ;
6) Partition: In order to achieve scalability, a very large topic can be distributed to multiple brokers (ie servers), one topic Can be divided into multiple partitions, each partition is an ordered queue;
7) Replica: copy, to ensure that when a node in the cluster fails, the partition data on the node is not lost, and Kafka can still continue to work , Kafka used to live provided a copy of mechanisms, each partition a topic has several copies of a leader and a number of follower .
8) Leader: The "master" of multiple copies of each partition, the object to which the producer sends data, and the object to which consumers consume data are all leaders.
9) Follower: The "slave" in multiple copies of each partition synchronizes data from the leader in real time and keeps the data synchronized with the leader. When the leader fails, a follower will become a new follower.

Note: Early versions of Kafka used zk for meta information storage, consumer consumption status, group management, and offset value. Taking into account some of the factors of zk itself and the greater probability of a single point problem in the entire architecture, the new version has indeed gradually weakened the role of zookeeper. The new consumer uses the group coordination protocol inside Kafka, which also reduces the dependency on zookeeper

Analysis of the principle of Kafka

1. The main design concept.
The reason why Katka is different from most other information systems is determined by the following few more important design concepts:
(1) When Kafka is designed, it uses persistent messages as a normal use case. consider.
(2) The main design constraint of Kafka is throughput rather than function.
(3) The state information related to Kafka that has been used is stored as a part of the data consumer, rather than stored on the server.
(4) Kafka is an explicit distributed system. It assumes that data producers (Producers), brokers (Brokers) and data consumers (Consumers) are scattered on multiple machines.

2. The role of zookeeper in kafka

(1) Whether it is a Kafka cluster or Producer and Consumer, ZooKeeper is used to ensure system availability, and the cluster saves some meta meta information.
(2) Kafka uses ZooKeeper as its distributed coordination framework, which combines the processes of message production, message storage, and message consumption.
(3) With the help of ZooKeeper, Kafka can establish the subscription relationship between the producer and the consumer and realize the load between the producer and the consumer without any state of the components including the producer, consumer, and broker. balanced.
(4) Kafka increase and decrease servers will trigger corresponding events on the ZooKeeper node. The Kafka system will capture these events for a new round of load balancing, and the client will also capture these events for a new round of processing.

3. The execution process of kafka in zookeeper

Insert picture description here

In the above flowchart, we can draw the following conclusions.

(1) Serverl is Kafka's Server, because both Producer and Consumer use it, and Broker is mainly used as storage.
(2) Server2 is the Server side of ZooKeeper, which records the IP, port and other information of each node.
(3) The common thing of Server3, Server4, and Server5 is that they are all configured with zkClient. More specifically, the address of ZooKeeper must be configured before running. The connections between these must be distributed by ZooKeeper.
(4) Serverl and Server2 can be placed on the same machine, or they can be opened separately. ZooKeeper can also be configured into a cluster to prevent abnormal downtime of a ZooKeeper server.

Guess you like

Origin blog.csdn.net/weixin_44080445/article/details/107287767