Pulsar load balancing and transaction_coordinator_assign

background and current situation

Which broker the TC is loaded on depends on transaction_coordinator_assign-partition-${TC ID}which broker the partition is loaded on.

By default, transaction_coordinator_assign has 16 partitions, so there are 16 TCs by default. We need to set a reasonable number of TCs according to the number of cluster machines/brokers.
In order to ensure the balance of cluster pressure and improve service availability, we need to spread TCs as evenly as possible across the entire cluster machines to avoid loading all TCs on certain machines.

  • Availability: For example, if 16 TCs are loaded on two or three brokers, restarting the broker during the cluster rolling upgrade will cause a large number of TCs to be unavailable at the same time, which will obviously affect the client.
  • Load balancing: Moreover, the broker carrying TC itself will also cause certain resource consumption. For example, TC needs to coordinate with the client, TB, and TP. If a broker carries too many TCs, the load on the machine will be too high.

transaction_coordinator_assign currently belongs to the pulsar/system namespace.
At present, we do not perform load shedding for bundles in the pulsar/system namespace, that is, we only use hash allocation to distribute as evenly as possible on the currently available broker list when loading the bundle for the first time, and will not perform load shedding on these bundles afterwards Shedding switch. Unless the broker is restarted and the bundle is unloaded, or the administrator executes the amdin command, the bundle is unloaded, which triggers the bundle assign.

existing problems

Then there will be the following problems at present:
because there is a certain time interval between broker rolling restarts, and generally one broker is restored before restarting the next one.
If none of the brokers are online at the beginning of the cluster , and then start the cluster, it will cause all TCs to be loaded on the first started broker .
As shown in the figure below, the cluster is shut down for a period of time in the middle, and then the cluster is started, causing all 16 TCs to be loaded on broker1. After broker2~broker5 starts up later, no TC will be allocated.
insert image description here
This problem only occurs when the cluster is initially started . There will be no problem when the cluster is restarted and upgraded on a rolling basis, because the cluster always has a large number of brokers available, which can ensure that the transaction_coordinator_assign-partition-${TC ID} partition is distributed as evenly as possible to all on the broker.

To solve this problem, it is relatively simple at present. After the cluster is started, restart broker1 , so that all TCs will be reloaded to all other brokers.
As shown in the figure below:
insert image description here
There is another case, for example: If the number of brokers in the cluster is 2 at the beginning, 16 TCs are allocated to these two brokers, and the subsequent machines are expanded to 5 machines, then similarly, TCs will not be allocated to new brokers. Join the 3 machines.
Likewise, a rolling restart of the cluster is fine.

Dynamic Load Balancing

Analysis on whether to perform dynamic load balancing on transaction_coordinator_assign
As mentioned earlier, we currently do not perform load shedding on bundles in the pulsar/system namespace. Therefore, transaction_coordinator_assign will not perform dynamic load balancing .
Is it necessary to do dynamic load balancing?

Dynamic load balancing is definitely beneficial. The root cause of the above problems is that it cannot adapt to the situation of adding new brokers to the cluster .

bundle load metrics

However, all current load balancing algorithms, when allocating bundles, estimate (measure) how much load it will cause based on the traffic throughput and message rate of the bundles . This is suitable for ordinary business topics, but it is not suitable for bundles under pulsar/system, because transaction_coordinator_assign itself has no traffic, and the load caused by transaction_coordinator_assign is the operation of TC itself .
Because the shedding algorithm generally selects bundles from large to small loads to execute unload, and then loads them to low-load brokers, so when performing shedding, the bundles under pulsar/system will basically not be selected for shedding , that is, open It's almost the same without dynamic load balancing .

Therefore, even if dynamic load balancing is to be enabled, another set of load estimation logic should be set up . So how to estimate it?

First of all, when the client uses transactions, it directly establishes links with all TCs, and then uses the round robin mode to select TCs to serve it. For example, first create a transaction txn1 on TC 1, and then create a transaction on TC 2 next time. txn2.

Therefore, in theory, the load caused by each TC of the cluster is almost the same . Therefore, we can use the number of TCs as the load unit of the bundle .

broker load metrics

There is also a question of priorities. For example, broker1 loads all TCs, but because no business traffic is loaded, the CPU load (resource usage, or scores) is relatively low, but broker2 and broker3 do not load a single TC, but because most of the traffic is loaded, the CPU load is relatively low. The CPU load is relatively high.
So when we allocate TC, should we switch TC to broker2 and broker3?
should. Because we can ensure that the TC is evenly distributed to the three brokers, and then use the original dynamic load balancing algorithm to balance the business traffic.

The current load balancing algorithm also uses BrokerData to measure the load of the Broker itself, that is, to determine the low-load broker according to the CPU load and other resource usage .
Therefore, when we load balance TCs, we also need to update the broker load metrics. The new standard is also very simple: the number of TCs .

So far, the framework of the dynamic load balancing algorithm for TC has been set up, and the specific implementation algorithms have their own magic.
Referring to the implementation of AvgShedder, here is an example:
Statistically sort the number of TCs carried by all brokers to obtain a queue. Assume that the number of brokers is numbered according to the number of TCs carried by brokers. The number is broker1 ~ broker N, and broker1 carries the most TCs. Broker N carries the least TC.
Calculate the difference M between the number of TCs of broker1 and broker2, then if a bundle meeting the following conditions can be found on broker 1:

  • If the number of partitions containing transaction_coordinator_assign-partition-${TC ID} is less than or equal to M
    , then the bundle can be switched to broker N.

Guess you like

Origin blog.csdn.net/m0_43406494/article/details/130954079