Kafka knowledge collation series 2

Kafka partition strategy

Under what circumstances will partition

  1. Consumers are added in the same Consumer Group
  2. Consumers leave the group they belong to (including active leaving or being forced to leave due to downtime)
  3. Partitions are added to the topic scheduled for

What are the partitioning methods

Kafka has 2 default partition allocation strategies. Range and RoundRobin

Range Strategy (also the default usage strategy)

First, the partitions in the same topic are sorted according to the serial number, and the consumers are sorted alphabetically.
Suppose there are 10 partitions

category name
Partition 0,1,2,3,4,5,6,7,8,9
Consumer thread c1-0,c2-0,c3-0

So how many partitions will each thread be allocated?
Assuming that there are m partitions, n consumers, and c each thread is assigned to the
formula

// 如果有余数,那么按照消费者的序号顺序,依次给每个消费者分区数+1
c=m/n 

Then this example has 10 partitions and 3 consumers. The final partition status is:

consumer Assigned partition
c1-0 0,3,6,9
c2-0 1,4,7
c3-0 2,5,8

Remarks: Consumers can set the allocation strategy by setting the parameter: partition.assignment.strategy .
But this strategy has a disadvantage: In
this example, does c1-0 have one more partition? This is a topic, then suppose there are 1000 For a topic, the same applies. C1-0 will have 1*1000 more partitions to consume, which will cause serious imbalance.

So there is a second partition strategy RoundRobin Strategy

RoundRobin Strategy

There are two prerequisites for using this partitioning strategy:

  1. All num.streams in the same Consumer Group need to be equal
  2. The topics consumed by each Consumer need to be the same

The realization principle is summarized in two sentences:

  1. List the consumers of all partitions, find the hashcode, and sort by value
  2. Assign partitions through polling assignment.

kafka-like Rebalance

Rebalance is also rebalance: it means to transfer the ownership of the partition from one consumer to another.
Conditions that trigger rebalance:

  1. The number of consumers in the same group has been changed
  2. Change the number of topic partitions
  3. Consumer unsubscribes actively, that is, unsubcribe()

Who is responsible for performing the Rebalance operation, and who will manage it?

Coordinator

When the first Consumer in a Consumer Group starts, he and Kafka Server will determine which broker (node) in the Group serves as the Coordinator, and once confirmed, the group members in the other groups will communicate with each other. The Coordinator communicates.

Rebalance steps:

Step ----- JoinGroup:
determine a good Coordinator after (note bold), all joinGroup Consumer sends a request (the request will be sent once started) to the Coordinator, Coordinator will select as a Consumer Group leader from , And send group member information and subscription status to other groups.
Here is a picture from the Internet: The
Insert picture description here
second step-Synchronizing Group State:
Mainly the leader synchronizes the allocation plan to all Consumers in the Group
Insert picture description here

Kafka's log removal strategy

Kafka has two log removal strategies:

  1. Clear according to the retention time of the message (configure log.retention.hours) The default is seven days
  2. According to the size of the data stored by the topic (configuration log.retention.bytes)

The point is here. If any of the above conditions meets the requirements, Kafka will perform log clearing.

What if the amount of data is huge or needs to be retained for a long time?
Kafka has a log compression function, similar to a log compression of Redis.
After opening

  1. The server starts the Cleaner thread pool in the background
  2. Regularly merge the same key and keep the latest value
    Insert picture description here

Kafka's TCP link management

Kafka producers and consumers use the TCP protocol to connect, so under what circumstances will TCP be created or closed?

3 situations in which producers create TCP

1. When creating an instance of kafkaProducer, the producer will start a send thread in the background, and the thread will create a connection with the broker when it starts running.

2. When updating metadata, there are two more situations

  1. When the producer sends a message to a non-existent topic, the broker will inform that the topic does not exist, and the producer will send a metadata request to kafka to obtain the latest metadata information. (TCP created at this time)
  2. Producer regularly updates metadata information according to metadata.max.age.ms parameters (default five minutes)

3. When sending messages, there is no doubt that TCP needs to be opened

2 situations when the producer/consumer closes TCP

  1. Producer.close();/consumer/close();
  2. Kafka shuts down automatically.
    You can set a parameter on the Producer side: connections.max.idle.ms (default 9 minutes, if there is a TCP connection without any request, then the TCP is closed)

3 situations when consumers create TCP

  1. Initiate a FindCoordinator request (the broker with the smallest load in the cluster serves as the Coordinator)
  2. When linking to Coordinator
  3. When consuming news.

Note: TCP is not created when KafkaConsumer is created, but when poll() method is called.

Kafka controller component Controller

What is a Controller: (Controller in kafka) Any
broker in the cluster can act as a Controller. But only one can become. When the Broker starts, it will try to create a /Controller node in zookeeper, and the first broker that successfully creates this node will become the Controller.

The role of Controller

  1. Topic management, create, delete topics, etc., add partitions.
  2. Partition reallocation.
  3. Preferred election: a leader-changing solution provided to solve the problem of heavy load on some brokers.
  4. Cluster member management: such as broker increase, downtime, shutdown, etc.
  5. Data service: The complete cluster metadata information is stored on the Controller, and other brokers regularly accept metadata update requests from the Controller to update the cached data.

Preferred replica:
For example:
if a partition has 3 replicas: 0, 1, 2. Then replica 0 is the preferred replica,
which is the first in the usual replica list.
Preferred leader election:
the process of adjusting the leader of the designated partition back to the preferred replica
(mainly to avoid too many leaders on the same broker, leading to high pressure on the leader)

Example:
Insert picture description here
If broker2 and broker3 are closed at this time, the leader of each partition will fall on broker1 (normally)
Insert picture description here

So what if we make a configuration and test it again?
Insert picture description here
Result: It can be seen that there is not the same as the above, the leader is all 1.
Insert picture description here
This avoids having 3 leaders on broker1 and causing excessive pressure. This is the so-called Preferred election

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107593356