Detailed explanation of Kafka’s basic principles, execution processes and usage scenarios

Detailed explanation of Kafka’s basic principles, execution processes and usage scenarios

1. Introduction
Apache Kafka is a distributed publish-subscribe messaging system. The definition of kafka on the kafka official website: a distributed publish-subscribe messaging system. It was originally developed by LinkedIn, which was contributed to the Apache Foundation in 2010 and became a top open source project. Kafka is a fast, scalable, and inherently distributed, partitioned, and replicable commit log service.

Comparison of several distributed system messaging systems:

Insert image description here

2. Kafka basic architecture
Its architecture includes the following components:

1. Topic: It is a specific type of message flow. The message is the payload of bytes, and the topic is the category name or feed name of the message;

2. Producer: any object that can publish messages to topics;

3. Service Broker: Published messages are stored in a group of servers, which are called Brokers or Kafka clusters;

4. Consumer: You can subscribe to one or more topics and pull data from the Broker to consume these published messages;
Insert image description here

As can be seen in the above figure, the producer sends data to the Broker agent. The Broker agent has multiple topics, and the consumer obtains data from the Broker.

3. Basic Principles
We call the message publishing (producer), the message subscription (subscribe) as the consumer, and the intermediate storage array as the broker (agent), so that we can roughly describe such a scene:

Insert image description here

The producer produces the data and sends it to the broker for storage. When the consumer needs to consume the data, he takes the data from the broker and completes a series of data processing operations.

At first glance, it is too simple. Didn’t it say that it is distributed? Is it distributed if the producer, broker and consumer are placed on three different machines? Look at the official picture given by kafka:
Insert image description here

Multiple brokers work together, and producer and consumer deployments are frequently called in various business logics. The three coordinate requests and forwarding through zookeeper management. Such a high-performance distributed message publishing and subscription system is completed.

There is a detail to note in the picture. The process from producer to broker is push, that is, when data is available, it is pushed to broker. The process from consumer to broker is pull, which means that the consumer actively pulls the data, rather than the broker mastering the data. Sent to the consumer side.

4. The role of Zookeeper in kafka
As mentioned above, Zookeeper is mentioned, so what is the role of Zookeeper in kafka?

(1) Both the Kafka cluster and the producer and consumer rely on zookeeper to ensure system availability and the cluster saves some meta information.

(2) Kafka uses zookeeper as its distributed coordination framework, which perfectly combines the processes of message production, message storage, and message consumption.

(3) At the same time, with the help of zookeeper, kafka can establish the subscription relationship between producers and consumers in a stateless situation, and realize the load balancing between producers and consumers.

5. Execution Process
First, take a look at the following process:

Insert image description here

Let's look at the picture above. We reduce the number of brokers to only one. Now suppose we deploy as shown above:

(1) Server-1 broker is actually Kafka's server, because both the producer and consumer have to return it. Broker is mainly used for storage.

(2) Server-2 is the server side of zookeeper. It maintains a table that records the IP, port and other information of each node.

(3) What Server-3, 4, and 5 have in common is that they are all configured with zkClient. To be more clear, the address of zookeeper must be configured before running. The reason is also very simple. The connection between them requires zookeeper. for distribution.

(4) The relationship between Server-1 and Server-2. They can be placed on one machine or separately. Zookeeper can also be configured with a cluster. The purpose is to prevent a certain one from hanging.

Let’s briefly talk about the order in which the entire system runs:

(1) Start the zookeeper server

(2) Start the kafka server

(3) If Producer produces data, it will first find the broker through zookeeper, and then store the data in the broker

(4) If the Consumer wants to consume data, it will first find the corresponding broker through zookeeper and then consume it.

6. Features of Kafka
(1) High throughput and low latency: Kafka can process hundreds of thousands of messages per second, and its latency is as low as a few milliseconds. Each topic can be divided into multiple partitions, and group consumers perform consume operations on partitions. ;

(2) Scalability: Kafka cluster supports hot expansion;

(3) Persistence and reliability: Messages are persisted to the local disk, and data backup is supported to prevent data loss;

(4) Fault tolerance: Allow node failures in the cluster (if the number of replicas is n, n-1 nodes are allowed to fail);

(5) High concurrency: supports thousands of clients reading and writing at the same time;

(6) Support real-time online processing and offline processing: you can use a real-time stream processing system such as Storm to process messages in real time, and you can also use a batch processing system such as Hadoop for offline processing;

7. Kafka usage scenarios
(1) Log collection: A company can use Kafka to collect logs of various services, and open them to various consumers through Kafka as a unified interface service, such as Hadoop, Hbase, Solr, etc.;

(2) Message system: decoupling producers and consumers, caching messages, etc.;

(3) User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing web pages, searching, clicking and other activities. These activity information are published by various servers to Kafka topics, and then subscribers pass Subscribe to these topics for real-time monitoring and analysis, or load them into Hadoop or data warehouse for offline analysis and mining;

(4) Operational indicators: Kafka is also often used to record operational monitoring data. Including collecting data from various distributed applications and producing centralized feedback for various operations, such as alarms and reports;

(5) Streaming processing: such as spark streaming and storm;

(6) Event source;

Guess you like

Origin blog.csdn.net/linwei_hello/article/details/105119102