Research on the Mechanism of Time Series Anomaly Detection

 ADDOPS team Ji Xinpu  360 cloud computing

Heroine declaration

This article comes from the ADDOPS team. The author of this article Ji Xinpu is mainly responsible for the automation and intelligent operation and maintenance of the 360 ​​HULK cloud platform. This article proposes an efficient LVS traffic anomaly detection algorithm to help ops colleagues more accurately determine abnormal conditions such as sudden increases in business traffic. I hope this article can inspire everyone's understanding of anomaly detection. In the follow-up, there will be a series of articles by the author on machine learning landing operation and maintenance innovation, so stay tuned.

PS: Rich first-line technology and diversified forms of expression are all in the " HULK first-line technology talk ", please pay attention!

Preface

Just after Double Eleven, Ali and JD are crazy to show how awesome their technology is. It is understandable that the two companies have their own set of strategies for different scenarios when dealing with the guagngu festival. Imagine that if Tmall's homepage is not accessible on Double Eleven, then Dad Ma must not lose more than 100 million yuan. In order to prevent such a situation from occurring, in addition to crazy expansion, an ideal anomaly detection mechanism is also very, very important.


There are many scenarios for anomaly detection, such as hardware failure detection and flow abnormality detection. In this blog, we focus on time series anomaly detection. There are many detection algorithms for time series anomalies. The most popular in the industry is the common statistical learning method--3σ principle, which uses the offset of the detection point to detect anomalies. For example, ordinary regression methods use curve fitting methods to detect the degree of deviation between new nodes and the fitted curve, and some people even say that CNN and RNN technologies are applied to the detection of abnormal points.


The method of detecting abnormal lvs traffic through ordinary thresholds is relatively ineffective. This article proposes a new detection algorithm. The following will focus on our experience in the process of practice.


1

data analysis

Obtaining the data of LVS traffic in the past 7 days, we can roughly divide the trend into two types:


One is the data with periodicity as shown in the figure below. In this case, we need to consider the impact of periodicity on the data.

image


The other random data as shown in the figure below does not have periodicity. This situation needs to be detected by a strategy different from periodicity.

image


2

Research on detection mechanism

Since the curve has two trends, periodic and non-periodic, our detection mechanism needs to be able to handle two ways.


Below we will introduce each algorithm in detail.


 

algorithm

Short-term chain ratio (SS)


For time series (referring to a sequence of numbers of the same statistical indicator arranged in the order of their occurrence), the value at time T has a strong dependence on time T-1. For example, there is a lot of traffic at 8:00, and the probability at 8:01 is very high, but if 07:01 has a small impact on 8:01.


First of all, we can make a fuss about the phenomenon that the data in the most recent time window (T) follows a certain trend. For example, if we set T to 7, then we compare the detection value (now_value) with the past 7 points (denoted as i). If it is greater than the threshold, we will add 1 to the count. If the count exceeds the count_num we set, we will consider the point It is an anomaly.image.png

The above formula involves the two parameters threshold and count_num. How to obtain threshold will be introduced in the next section, and count_num can be set according to the needs. For example, if it is sensitive to exceptions, you can set count_num to be smaller, and if you are not sensitive to exceptions, you can Set count_num to a larger value.


Dynamic threshold


There are many methods for dynamic threshold setting in the industry. Today, we will introduce a threshold setting method for our lvs traffic anomaly detection. Usually the threshold setting method refers to the mean, maximum and minimum values ​​in the past period of time, and we also apply this method. Take the average, maximum, and minimum values ​​of the past period of time (such as T window), and then take the minimum of max-avg and avg-min. The reason for taking the minimum value is to set the filter condition more loosely, allowing more values ​​to pass this condition, and reducing some underreported events.image.png


Long-term chain ratio (LS)


The above short-term chain ratio refers to short-term data, and only short-term data is not enough. We also need to refer to the overall trend of data over a longer period of time.


Usually a curve is used to fit the trend to reflect the trend of the curve. If the new data breaks the trend and makes the curve unsmooth, then the point is abnormal. There are many methods of curve fitting, such as regression, moving average, and so on. In this article, we use EWMA, the exponential weighted moving average method to fit the curve. In EWMA, the average value of the next point is corrected from the average value of the previous point plus the actual value of the current point. For each EWMA value, the weight of each data is different, and the most recent data will have the higher weight.


With the average value, we can use the 3-sigma theory to determine whether the new input exceeds the tolerance range. By comparing whether the actual value exceeds this range, you can know whether it can be alarmed.

expAverage = pd.stats.moments.ewma(data, com=50) 

stdDev = pd.stats.moments.ewmstd (data, com = 50)

if abs(data.values[-1] - expAverage.values[-1]) > 3 * stdDev.values[-1] :

    print "Exception"


Year-on-year (chain)


Many monitoring items have a certain period. Among them, one-day period is more common. For example, LVS traffic is lowest at 4 in the morning and highest at 11 in the evening. In order to take into account the periodicity of monitoring items, we selected the data of a monitoring item in the past 14 days. For a certain moment, 14 points will be obtained as reference values, which we denote as xi, where i=1,...,14.


We first consider the static threshold method to determine whether the input is abnormal (sudden increase and sudden decrease). If the input is smaller than the minimum value at the same time in the past 14 days multiplied by a threshold, the input will be considered an abnormal point (abrupt decrease); and if the input is greater than the maximum value at the same time in the past 14 days multiplied by a threshold, The input is considered to be an abnormal point (a sudden increase).

if new_value> max (value at the same time in the past 14 days) * max_threshold:

    print "突增"

if new_value <min (the value at the same time in the past 14 days) * min_threshold:

    print "突减"

The method of static threshold is based on historical experience. In practice, how to give max_threshold and min_threshold is a topic that needs to be discussed. According to the current empirical rules of dynamic threshold, taking the average value is a better idea.


Year-on-year amplitude (CA)


The year-on-year method cannot detect anomalies when encountering the phenomenon shown in the figure below. For example, today is 11.18, the historical curve of the past 14 days will inevitably be much lower than today's curve. So today there was a glitch, the curve fell, and it is still much higher compared to the curve of the past 14 days. Such failures cannot be detected using method two, so how can we improve our method? An intuition is that although the two curves are not the same height, they are "similar in length." So how to use this kind of "looks almost like"? That is the amplitude.


How to calculate the amplitude at time t? We use x(t)-x(t-1) and divide by x(t-1) to express the amplitude. For example, if the traffic at time t is 900bit and the traffic at time t-1 is 1000bit, then it can be calculated that the number of dropped calls is 10%. If we refer to the data of the past 14 days, we will get 14 amplitude values. Using the absolute values ​​of 14 amplitudes as the standard, if the amplitude at time m [m(t) – m(t-1)]/m(t-1) is greater than amplitudethreshold and the amplitude at time m is greater than 0, we consider that time A sudden increase occurs, and if the amplitude at time m is greater than amplitudethreshold and the amplitude at time m is less than 0, it is considered a sudden decrease at that time.image.png


 

Algorithm combination

Four methods are introduced above. Among the four methods, SS and LS are verification methods for non-periodic data, while chain and CA are verification methods for periodic data. So how should these four methods be selected and used? Below we introduce two methods of use:


1. Choose the appropriate method according to the periodicity. This method needs to first verify whether the sequence has periodicity, if it has periodicity, enter the detection method of the left branch, if there is no periodicity, select the detection method of the right branch.image.png

The above involves the question of how to detect the period of the data, the method of difference can be used to detect whether the data is periodic. For example, take the data of the last two days to make the difference. If it is periodic data, the fluctuation can be eliminated after the difference, and then combine the judgment method of the variance threshold to determine the periodicity of the data. Of course, if the data fluctuation range is relatively large, the data can be normalized (such as z-score) before the difference.


2. Do not distinguish between periodicity, and directly use the method of "the minority obeys the majority". This method is easier to understand, so I won't explain it here.image.png


3

to sum up

This article introduces the methods we use in lvs anomaly detection. Maybe these methods are not enough to solve all the scenarios. You also need to continuously enrich the algorithm on this basis to achieve better results. The so-called combination of theory and practice, specific problems require specific analysis before the theory can be applied to practice.

references:


1.https://jiroujuan.wordpress.com/2013/10/09/skyline-anomalous-detect-algorithms/ 

2.http://chuansong.me/n/2032667

3.http://blog.csdn.net/g2V13ah/article/details/78474370


Guess you like

Origin blog.51cto.com/15127564/2667910
Recommended