Algorithm and practice of intelligent online blocking

1. Background

Going online is one of the main causes of system failures . In some companies, it can even account for half of the causes of failures. Timely detection, blocking, and rollback of potential failures can effectively reduce system failures.

In fact, many platforms now provide the ability to block changes based on manual configuration rules . Business students can select indicators on the platform and set corresponding thresholds. When the indicators break through the threshold range during or after going online, it is considered that the relevant services are abnormal and need to be blocked. However, the threshold needs to be manually determined and adjusted according to changes in indicators, so it is often difficult to cover a large number of indicators , which limits the effectiveness of this approach.

Therefore, we hope to automatically detect the abnormality of indicators through algorithms, so as to cover a large number of indicators and improve the coverage, convenience and effectiveness of change blocking. This article will systematically introduce the algorithms and engineering practices we apply in intelligent online blocking.

2. Program overview

In general, what we measure is the golden indicator of service. Gold indicators include three categories: flatness, usability, and QPM . When a service provides multiple APIs, each API will have these three golden indicators. Therefore, the golden indicator of a service is the collection of all these API golden indicators.

There may be two types of service exceptions caused by online behavior. The first is that the changed service itself has an exception, and the second is that the changed service itself has no exception, but other services directly related to it have an exception . So when going live, we can troubleshoot the golden indicator of changing the service and other services directly related to it.

Therefore, our solution is to first let users configure an indicator list for each service , and the algorithm checks all the indicators in the list when going online. If abnormality is found, it can be blocked. The list of metrics for each service includes the gold metrics for the service itself, as well as the gold metrics for services that are associated with the service.

In this article, we will mainly focus on availability , because it is the core indicator for identifying service stability and often helps us find faults quickly and accurately.

3. Algorithm principle

So what is the specific method of troubleshooting usability indicators? As an indicator with normalized fluctuations, how do we define the "abnormal fluctuations" of the indicator? These questions will be answered in detail below.

The usability of an API is essentially the success rate or error rate of the API. We assume that the total number of requests for the interface is x and the number of errors is y . The traditional method is to set an empirical threshold for the error rate (ie y/x) , but when the number of requests is small (such as night time), the error rate will fluctuate greatly , as shown in the following figure:

 

 The above three pictures are the number of requests, the number of errors and the error rate of an API. The red box on the left is a real fault, but if we set the error rate threshold based on this fault, we will find that the total number of requests is When it is low, the error rate can easily exceed this threshold , which is another problem of directly manually configuring rules to block changes. The binomial distribution can better solve this problem.

3.1 General method of error rate indicator detection

First, let's introduce the mathematical principles of the binomial distribution, which is a well-known discrete probability distribution. Its mathematical definition is as follows:

In n times of independently repeated Bernoulli experiments (a random experiment performed repeatedly and independently under the same conditions, the characteristic is that the random experiment has only two possible outcomes ), let event A in each experiment The probability of occurrence is p. Let X represent the number of times event A occurs in n trials, and the discrete probability distribution of random variable X is the binomial distribution.

For the monitoring of error rate indicators, the binomial distribution is naturally more suitable, because a request has only two possible states like a coin toss: success/failure. Using the binomial distribution can effectively solve the above problems. We assume that the number of errors y obeys the binomial distribution with parameters x ( total number of requests ) and p0 ( baseline error rate ), then according to the definition of binomial distribution :

P \left\{y; x, p_0 \right\}=\binom{x}{y} p_0^y(1-p_0)^{x-y}P{y;x,p0​}=(yx​)p0y​(1−p0​)x−y

P \left\{y > y_j; x, p_0 \right\}=\sum_{k=y_j+1}^{xj}\binom{x_j}{k} p_0^k(1-p_0)^{x_j-k}P{y>yj​;x,p0​}=∑k=yj​+1xj​(kxj​​)p0k​(1−p0​)xj​−k

In order to simplify the calculation, we can also assume that yj obeys a normal distribution , then according to the properties of the binomial distribution, the mean of the normal distribution is xj*p0 , and the variance is xj*p0*(1-p0), then yj is in the standard positive The value z in the state distribution is:

z=\frac{y_j-x_jp_0}{\sqrt{x_jp_0(1-p_0)}}=\frac{\frac{y_j}{x_j}-p_0}{\sqrt{p_0(1-p_0)}}\sqrt{x_j}z=xj​p0​(1−p0​)​yj​−xj​p0​​=p0​(1−p0​)​xj​yj​​−p0​​xj​​

According to the second formula above, in a real launch, we can select the total number of requests five minutes after the launch as xj and the number of errors as yj . If we know the benchmark error rate p0 of the interface , we can use this Calculate the probability that the number of errors is higher than the current observed value yj within a period of time after going online. If this probability is too small, we can confidently determine that there is an abnormality in going online this time. After simplifying the calculation, z can be compared with the threshold we set based on experience (for example, 8), so as to directly determine whether it is abnormal.

3.2 How to determine the benchmark error rate p0?

3.2.1 Three ways to determine the benchmark error rate p0

The next question is how to determine the benchmark error rate p0 . We will use a total of three methods to determine a p0 respectively.

First, we can intuitively think of using the data before going online to calculate a benchmark error rate. When we build a binomial distribution model based on this error rate p0, we find that the probability of the number of errors being higher than the observed value for a period of time after going online is too low , it is considered that there is an exception on the line.

Second, we found that the error rate of some indicators fluctuates periodically at the daily level , as shown in Figure 1 below:

For the situation in the figure above, if the online launch happens to start at the trough and end at the peak, using the data 20 minutes before the online launch to calculate the baseline error rate will easily misjudge the online launch as abnormal . Therefore, in practice, we will also select the data of the same period yesterday after going online to calculate the second benchmark error rate p0, and judge whether there is any abnormality when going online according to the same logic as the first p0.

Third, we also found that although has a large change before and after going online , even if the error rate has increased, its absolute value is still small (for example, the error rate has increased from 1% to 1000%. 2), in this case we believe that it should be allowed to go online instead of blocked. Therefore, we can combine historical data for a longer period of time to calculate the third benchmark error rate p0. The calculation of this p0 involves KDE and LDA algorithms, which is relatively complicated and will be introduced in detail later .

Furthermore, if we can respectively obtain the total number of requests and the number of errors of the online instance and the instance that has not yet completed online during the online process, we can calculate the base error rate p0 based on the data of the online instance that has not yet been completed, so as to calculate the high number of errors of the online instance that has been completed The probability of the observed value. In this way, real-time online blocking during the online process can be realized .

To sum up, the schematic diagram of using the binomial distribution to judge whether the online behavior is abnormal is as follows:

3.2.2 Determine global p0 using KDE algorithm

When we have a long period of historical data, we can calculate a global benchmark error rate p0 based on these data. A simple idea is to take the average of the entire data set, but in practice we find that many error rate indicators There are more spikes, as shown in Figure 2 below:

If we take the average of the above figure as the base error rate of the binomial distribution to detect, then when the time after the on-line is just at the spike, we are likely to block the on-line by mistake. So we naturally think that we can take the high quantile value of the data set (such as the 99.9 quantile value) as the benchmark error rate, but when the historical data is relatively small, the high quantile value may be taken directly to the data set. Values ​​at or near the maximum value , using such a value as a baseline error rate, also perform poorly.

The KDE algorithm can perfectly solve this problem. Its idea is to treat each point in the data set as a distribution, and add up these distributions to obtain the distribution of the entire original data set . We can take the high quantile value of this distribution as the global threshold. For example, we regard each point as a Gaussian distribution. As the standard deviation of the Gaussian distribution we set becomes larger, the schematic diagram of the overall distribution is as follows:

In statistics, the Kernel Density Estimation algorithm is a non-parametric method for estimating the probability density function (PDF: Probability Density Function) of a random variable . We can use it to estimate the distribution of the data set. In the above example, the assumption of the distribution of each point corresponds to the selection of the "kernel" in the kernel density estimation algorithm, and the standard deviation of the Gaussian distribution corresponds to the step size ( bandwidth ) selection, there are a series of classic algorithms that can find the optimal solution of the step size, interested students can continue to learn through the following link: https://jakevdp.github.io/blog/2013/12/01/kernel- density-estimation/

When we estimate the distribution of an error rate historical data set (PDF) through the KDE algorithm, we can further obtain its cumulative probability density function (CDF: Cumulative Distribution Function) and the inverse function of the cumulative probability density function (ICDF : Inverse Cumulative Distribution Function), and the value of ICDF at 0.999 is the global threshold we want , that is, the third benchmark error rate p0 we mentioned above .

So, why is the value of ICDF at 0.999 equivalent to the 99.9 point position of the data set, and is it a reasonable global threshold? We can use the following two diagrams to understand simply:

The first picture above is the probability density function curve (PDF) of a data set, and the second picture is the cumulative probability density function curve (CDF) and the inverse cumulative probability density function curve (ICDF) of the data set. We know that CDF is the integral of PDF , and the integral of PDF over the entire domain must be 1 , and ICDF is the inverse function of CDF , so the domain of ICDF is [0, 1] . Now, we want to take a global threshold for the above data set. The value of ICDF at 0.999 corresponds to the value of the horizontal axis when the vertical axis of the CDF curve is 0.999, that is, the value of the horizontal axis when the area under the PDF curve is 0.999. value , the diagram is as follows:

The value of the horizontal axis corresponding to the red line in the above figure is the value of ICDF at 0.999. Intuitively, this is indeed a reasonable global threshold.

Overall, the flowchart for calculating the global threshold is as follows:

3.2.3 Use LDA algorithm to remove outliers

When using the KDE algorithm to calculate the global threshold, there will be a problem: if the historical data contains some abnormal data (such as a high error rate caused by a fault), the global threshold we finally obtained through the KDE algorithm will be affected by the exception. The interference of the value becomes abnormally high, which will obviously make our detection effect worse, as shown in the following figure:

At this time, it is necessary to use the LDA algorithm to first segment the historical data and remove outliers in the historical data.

LDA (Linear Discriminant Analysis) translated as linear discriminant analysis, is a statistical method, mainly used for classification and dimensionality reduction . Its application and general form are relatively complicated. Interested students can use Wikipedia as a starting point to learn by themselves: https://en.wikipedia.org/wiki/Linear_discriminant_analysis

In the intelligent online blocking service, we will use the simple form of LDA in one-dimensional form when calculating the global threshold, and cut the time series data sorted by the error rate into two segments , so as to eliminate possible abnormalities in the historical data value cut off. Of course, if we find that LDA cuts too much data, it actually means that the data set has no significant outliers , so we can use the original data set instead of segmentation . The schematic diagram of the above ideas is as follows:

The mathematical principle of LDA is as follows:

Suppose we have dataset:

x_1, x_2, x_3 ...... x_nx1​,x2​,x3​......xn​

Sort the dataset:

x_1 \leq x_2 \leq x_3 ...... \leq x_nx1​≤x2​≤x3​......≤xn​

The data set is divided into two sets by thinking that xi is a split point:

X_L: \left\{x_1, x_2...... x_i \right\}, X_R:\left\{x_{i+1}, x_{i+1}...... x_n \right\}XL​:{x1​,x2​......xi​},XR​:{xi+1​,xi+1​......xn​}

Compute the mean of two sets:

\bar x_l, \bar x_rxˉl​,xˉr​

Compute the sum of the variances of two sets:

s_w = \sum_{1}^{i} (x_k - \bar x_l)^2+ \sum_{i+1}^{n} (x_k - \bar x_r)^2sw​=∑1i​(xk​−xˉl​)2+∑i+1n​(xk​−xˉr​)2

Iterate over i and find the value of i when sw is the smallest. Taking sw as the vertical axis and i as the horizontal axis, the schematic diagram of traversing i on a real data set is as follows:

 

The minimum value of the blue curve in the figure above is the segmentation point we are looking for. The data on the right side of this point can be considered as outliers and need to be cut off .

After removing the outliers and bringing them into the KDE algorithm for calculation, the global threshold will return to a more reasonable range, as shown in the figure below:

3.3 Summary

This section introduces how to use binomial distribution, LDA, and KDE algorithms to effectively and intelligently determine whether the service is abnormal when going online. In fact, the state of the service after it goes online (or the state of the instance that has been launched during the process of going online) is compared with the baseline state we think (before going online, after the same period yesterday after going online, and the global threshold) through the binomial distribution , looking for index fluctuations If no such explanation can be found compared with the three benchmark states, it means that there is indeed an abnormality in this launch .

4. Engineering practice

We have introduced the algorithm principle of intelligent exception blocking above. This section will introduce the engineering practice of intelligent online blocking service. Since the global threshold calculation requires more data and a large amount of calculation, offline training is required. Therefore, the overall architecture of the service It is divided into two parts: offline training and online detection, as shown in the following figure:

Before the service is opened, business students need to configure the corresponding relationship between the service and the API and the usability indicators corresponding to the API . After the configuration is complete, we will train the global threshold mentioned above offline and store it in the database. When there is an online launch, the SRE operation and maintenance platform will call our intelligent online blocking platform and pass the service name, the start and end time of the online Stamp and other parameters, the intelligent online blocking service reads the relevant API index data in the database and the global threshold value calculated offline, and finally judges whether the online behavior is abnormal, and returns the result to the SRE operation and maintenance platform. The detailed flow chart of online detection as follows:

At present, the intelligent online blocking service has been deployed and launched, and is open to some business parties for trial use. We plan to continue to build intelligent online blocking capabilities in terms of indicator coverage, configuration flexibility, and algorithm accuracy in the future. Welcome all students to communicate with us.

5. References

An article introducing KDE programming practices: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

An article introducing the principle of KDE algorithm: https://mglerner.github.io/posts/histograms-and-kernel-density-estimation-kde-2.html?p=28

LDA algorithm Wikipedia: https://en.wikipedia.org/wiki/Linear_discriminant_analysis

Guess you like

Origin blog.csdn.net/qq_42859864/article/details/128707629