[kafka column] core concepts - broker, topic, partition, consumer group, partition copy

Through the study of the previous sections, I believe you have understood what a message queue is and what a message queue does. As a kind of message queue, Kafka has the characteristics of high reliability and high throughput, so it is widely used.

This section will introduce some basic concepts of kafka to you. These concepts will be constantly mentioned in the follow-up learning, so it is necessary to be clear. The content of this section is the core theoretical knowledge content of applying kafka message queue. Beginners must study and figure it out. I can't read it for the first time and learn it later, and then I come back to this article. In short, I think over and over again. This is the core of kafka's theoretical knowledge.

What are the ways of Kafka topics, partitions, and consumers?

Kafka partition replica and high availability

1. Agent Broker

Before, we have introduced that producers deliver messages to message queues, and consumers pull data from message queues.
A very important concept in the Kafka message queue is the broker broker. You can imagine what a commodity broker in life does? Purchase, inventory, sales. Kafka's broker Broker also plays the same role: receiving messages, saving messages, and serving messages to consumers.

Specific to the kafka architecture level, we can think that a Broker agent is a kafka service instance . Kafka can start multiple service instances to form a service cluster with multiple Broker agents. Usually, the more brokers in a cluster, the stronger the overall throughput of the kafka cluster. This is also easy to understand. In real life, the more agents there are for a product, the stronger the sales ability. It is a truth.

Because kafka is usually deployed in a distributed manner, a physical server (an operating system) usually only deploys and starts one kafka instance , so in this scenario, the Broker agent can be understood as a server.

Is it possible to deploy multiple kafka instances on one server? Yes, it is possible to avoid port conflicts by modifying the ports, but this is not good . Because the distributed deployment of kafka takes into account high availability, that is, if a server goes down, the kafka cluster is still available. If a server deploys multiple kafka instances, once the server goes down, the impact will be very large, and it is likely to directly bring down the kafka cluster.

2. Themes and theme divisions

Broker, an agent, can help manufacturers to "invoice" the goods, but there is a problem that the goods are not classified . Obviously, the sales cycle and sales frequency of Moutai wine, dairy products, pork and other products are different. In order to effectively arrange commodity invoicing, it is necessary to classify commodities. In the same way, some of the messages received by kafka are data that needs to be processed quickly, and some are high-frequency data with low timeliness requirements. So to classify the message data, each classification is called a Topic . The author thinks that the word topic is better translated into "channel" here, but more people have recognized the translation method of "topic", so they are preconceived.

The middle solid line in the figure below is a Broker with a single kafka instance, including three topics. That is, the agent represents three commodities: alcohol, dairy products, and snacks. Topic is a logical concept used to categorize messages.

The dotted line range in the above figure represents the topic topic, and the pipe-shaped graph represents the partition partition of the topic.

The agent classifies the products according to the topic Topic, then there is another problem. If the work of an agent is single-threaded, it will be stretched when dealing with high-concurrency work. Therefore, we introduce the concept of partition for the topic topic. A partition is a real physical queue data structure used to store data, occupying resources such as system memory and disk data storage .

  • A partition is a partition of a topic, so a topic contains one or more partitions. How many partitions a topic contains depends on the throughput capacity requirements of commodity processing under the topic.
  • Because Topic is a logical concept, partitions can also be called "partition brokers", and a broker Broker contains multiple partitions. Just like a provincial agent can contain multiple prefecture-level agents.

3. Partitions and Consumer Groups

After the kafka agent has topic topics and partition partitions, the agent capacity of message data is enhanced, that is, messages can be classified, and the throughput capacity is also improved. The producer can send a large amount of production message data to the kafka message queue. But pay attention here: Unlike a partitioned proxy in real life that faces multiple consumers, a partition of Kafka can only face one consumer thread.

So we have the following concept

  • Multiple consumer threads consuming topic data of the same topic can form a consumer group. As shown above: adult consumer group, child consumer group.
  • A consumer group can subscribe to multiple topics and consume data under multiple topics. As shown above: The adult consumer group can consume under two themes of alcohol and dairy products.

If a topic has 2 partitions, and the consumer group subscribed to the topic has 5 consumer threads, then there will be 3 consumer threads that cannot consume the topic's data (idling).

4. Partition Replica and High Availability

After solving the problem of throughput, the following is how to ensure the high availability of the kafka cluster. As mentioned above: a kafka cluster contains multiple broker instances, and each broker instance is usually deployed on a different server to run independently. This is one of the common ways to ensure high availability in a distributed architecture: multiple instances of services. When a service instance hangs, there are other instances that can provide services to ensure high availability .

Then the second method to ensure the high availability of the distributed cluster: multiple copies of data, one copy of the data is lost, and other data copies can be used. Where does Kafka's data exist? Partitioning, so for Kafka, the high-availability way to ensure that data is not lost is to partition multiple copies.

The combination of multiple instances of Broker services and multiple copies of partitions is the picture above. Explain this picture:

  • 1 wine theme contains 4 partitions, each partition has three copies (one master and two slaves, the same color)
  • Three partition replicas are distributed on four Broker service instances
  • The producer only sends message data to the primary partition replica
  • Consumers also only pull message data from the primary partition replica

In addition, the master-slave relationship between the three partition replicas of a topic partition in the above figure is not fixed, but is elected according to the status of the Broker service instance where the partition replica is located. For example, partition replicas A, B, and C, the current state is that A is the primary partition replica. If the broker where partition copy A is located dies, then partition copy A loses the "primary copy qualification", and a primary partition copy is re-elected between partition copies B and C.

So take the cluster shown in the figure above as an example. Partition replicas are scattered on multiple Broker service instances, so even if one or two service instances fail, the message service will not be unavailable as a whole. Because there is one partition copy left, the producer and consumer only communicate with the master partition copy (Leader), and the slave partition copy (Follower) only plays the role of data backup .

5. Data synchronization of partition replicas

As we mentioned above, producers and consumers only communicate data with the primary partition copy, so where does the data from the secondary partition copy come from? The data from the partition replica (Follower) is synchronized from the primary partition replica (Leader).

There are a few nouns here, you may need to memorize them. They are all technical people, and they are good for chatting and coercion.

  • AR (Assigned Replicas): Represents the set of all replicas of a partition, including the master (leader) and slave (Follower).
  • ISR (In Sync Replicas): A collection of slave partition replicas that are in sync with the primary partition replica. There may also be a small difference in data latency, because data synchronization is inevitably time-consuming.
  • OSR (Out Sync Replicas): The data synchronization state has been unable to keep up with the set of slave partition replicas of the master partition replica. It may be caused by network and other issues.

What kind of partition replica will be judged as OSR?
Answer: To determine whether a follower partition copy is synchronized with the primary partition is realized by judging the time span of the data of the slave partition copy behind the data of the primary partition copy. If the time span is greater than replica.lag.time.max.msthe configured parameter value (the default value is 10 seconds), The slave partition replica is considered to be out of sync with the master partition replica.
Out-of-sync partition replicas will be kicked out of the ISR set and put into the OSR set. If the follower partition copy gradually catches up with the data progress of the leader copy, it will automatically return to the ISR set.

Guess you like

Origin blog.csdn.net/hanxiaotongtong/article/details/124407025