RPC Talk: Current Limiting Problem

RPC Talk: Current Limiting Problem

RPC calls between microservices often use the current limiting function, but many times we use a very simple current limiting strategy, or engineers beat their heads to set a current limiting value.

This article mainly discusses the current problems and possible solutions in RPC rate limiting.

Why do you need current limiting

Avoid cascading crashes
Even if a service has undergone stress testing, when it actually runs online, the request traffic it receives and the traffic it can load are not fixed. If the service itself does not have a self-protection mechanism, when the traffic exceeds the expected After loading, this part of the load will be passed to the downstream of the service, causing a chain reaction or even an avalanche.

Provide reliable response time
The service caller generally has a timeout period. If a service is in a timeout state due to congestion, then even if the service finally provides a correct response, it is completely meaningless to the Client.

A service's commitment to the caller includes both the result of the response and the time of the response. Current limiting enables the service itself to actively discard traffic beyond its load capacity, so as to maintain effective response efficiency under rated load capacity.

Traditional Solution
Funnel

insert image description here

advantage:

  • Ability to force limit egress traffic rate

shortcoming:

  • Unable to accommodate bursty traffic

token bucket
insert image description here

advantage:

  • Statistically maintain a specific average speed
  • Locally allow short bursts of traffic to pass

Existing problems
In the two types of traditional solutions, it is necessary to specify a fixed value to indicate the load that the service can accept, but in the modern microservice architecture, the load capacity of a service is often changing, as follows A few common reasons:

  • Varies as new code performance changes
  • Varies as downstream performance on which the service depends
  • Changes as the performance of the machine (CPU/disk) on which the service is deployed changes
  • Changes as the number of nodes deployed by the service changes
  • Change as business needs change
  • Varies with time of day

By manually declaring the allowable load value of a service, even if this value can be dynamically changed in the configuration center, it is still unsustainable to maintain, and the specific setting of the value is extremely dependent on personal experience and judgment. Even people's own small thoughts will be brought into the selection of this value. For example, the server will conservatively estimate its own capabilities, and the client will declare its own needs too much. In the long run, the final artificial setting value will deviate from the actual situation. .

What is service load

When we initiate a request to a service, we care about two things:

The number of simultaneous concurrent requests that the service can support
The response time of the service
The number of concurrent requests
For the server, there are several indicators that are often confused:

The number of current connections
The number of requests currently accepted
The number of requests currently being processed concurrently

SWC

There is a 1:N relationship between the number of connections and the number of requests. In the implementation of modern Server, the server resources consumed by the connection itself are very small (such as Java Netty implementation, Go Net implementation, etc.), and generally for intranet services, when multiplexing, the number of requests increases. It does not necessarily lead to an increase in the number of connections.

For the sake of traffic shaping, some servers do not necessarily hand over the request to the server response function immediately after receiving the request, but add it to a queue first, and then execute it after the server has idle workers. So there are two types of requests here: accepted requests and processing requests.

And QPS is a statistical indicator, which only indicates how many requests are passed per second.

At the current point in time, the decisive influence on the server load is generally the current number of concurrent requests.

service response time

When an online service responds to a request, the abstraction of the work it does itself falls into the following two categories:

  • Calculation: time depends on CPU frequency, fixed value (not considering overclocking)
  • Waiting: The time depends on the current number of parallel requests, not fixed

Waiting in line for the server to have idle threads/coroutines to respond to requests
Waiting for other threads to release competing resources (maybe locks, or occupied CPU)
Waiting for IO (Storage, Network) to return (regardless of downstream jitter, this time is generally a fixed value )
From the above analysis, we can see that the final response time of a service: RT = working time + waiting time.

Load Capacity Estimation

We can abstract a microservice as an input and output pipe, as shown in the following figure:

insert image description here

The "load capacity" of the water pipe in the middle is actually the volume of the water pipe (pipe diameter * pipe length), which is a very quantifiable indicator.

The situation we encounter online is actually similar to this water pipe. What we need to do now is to find a quantitative method similar to calculating the volume of a water pipe to estimate the load capacity of the service.

Now we observe a service instance within a certain second time window, and we can easily get the following two values:

  • QPS: The number of requests in this second, the unit is req/s.
  • AvgRT: The average request response time in this second, in ms.

According to Little's law (Little's law):

In a stable system (L), the long-run average number of customers is equal to the long-run effective arrival rate (λ), multiplied by the average waiting time of customers in the system (W).

This law is used to calculate the throughput of a stable and resource-constrained system .

In our system, using this rule, Throughput = QPS * (AvgRT / 1000) can be obtained . Among them, AvgRt / 1000 converts milliseconds into seconds.

To understand this formula from another angle, it can also be considered that if we can ensure that the amount of requests (inflight) currently being processed by the server does not exceed this value, the average response time of each request should also be maintained at about AvgRt in theory (inflight speed ~= water exit velocity). This final calculated amount is what we want to get an estimate of the current load of the service. And only when the service starts to experience load pressure, the current load situation can be considered as the load capacity.

Inflight, the relationship between RT and Throughput

For most online services, the relationship between Inflight and RT and Throughput has the following two stages:

  • In the absence of resource constraints (CPU/disk/network/memory), as inflight becomes larger, RT generally does not change significantly, and Throughput begins to
    increase.
  • When the inflight continues to grow and resource competition occurs, the RT will increase as the inflight increases, but the Throughput
    will not change significantly, because limited resources can only do limited things.

According to the above phenomenon, we can draw three coordinate diagrams:
insert image description here

The first two figures reflect the actual phenomenon of the Server, while the third figure eliminates the same abscissa inflight in the first two figures, and looks at the relationship between RT and Throughput separately. The third figure shows the essence of our current limiting work: using RT loss in exchange for throughput improvement.

A more realistic picture is:

insert image description here

The lower the slope here, the more cost-effective this part of RT loss is. Our current limiting strategy is to find this optimal current limiting point.

Engineering Practice
The above analysis is only a derivation based on a theoretical model. In actual engineering applications, the following realities need to be considered.

When to start throttling

At the stage when there is no competition for resources at the beginning, we don't have any need to perform flow control. Therefore, a heuristic indicator is needed to mark that the service has entered the busy state of incoming competition to trigger flow control logic. Generally, we can choose the following common indicators:

OS Load1
CPU Usage
Avg RT Value
Thread Number
QPS

Choice of RT

A formula has been deduced before: RT = working time + waiting time . And the waiting is also equal to queuing time + competition time .

However, RT is only a single request indicator. What is needed to calculate Throughput is a statistical RT value. At this time, the choice of AvgRt, MinRT, or P95RT is a detail but an extremely important thing.

If we are a system with extremely stable performance, such as switches and routers, then packets will not affect each other, so RT of this type of system = working time + queuing time . And we want to reduce the queuing time as much as possible (because it is not necessary, everyone will queue up and cause more congestion), so in this case, RT can choose to use MinRt, because MinRT is closest to the fixed time. This is also the value used by the Google BBR algorithm.

But the difference between business services is that not only will there be competition for CPU and storage between requests, but problems like lock competition may also occur. This part of the problem causes the waiting time of the competition to be very uncertain, and the jitter is frequent and unpredictable. If MinRT is still used, it will lead to an underestimation of the concurrency capability of the service.

An example can help you experience this process more intuitively: a university recruits 1,000 graduate students each year, with a 2-year academic system, but the exam is too difficult, resulting in an average of 3 years to graduate. At this time, it is obviously not appropriate to calculate how many graduate students the university can accommodate at the same time.

But doesn't it mean that we can use AvgRT directly? In fact, it is not. It is difficult to make the current limit just stuck at an accurate point. Even if the card is really accurate, it will lead to low fault tolerance. So current limiting tends to slightly underestimate the load capacity of the system. In their public microservice framework Kratos, Bilibili uses the smallest AvgRT among the AvgRT in each sampling window. Ali's Sentinel uses the smallest RT in the true sense.

After the RT indicator is selected according to the actual business characteristics, it can be multiplied by the previously counted QPS to obtain the Throughput value. At this time, as long as the Inflight > Throughput is discarded directly, an adaptive and scientific measurement standard can be realized. flow policy.

Guess you like

Origin blog.csdn.net/kalvin_y_liu/article/details/130004162