《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》笔记

以下我为这篇《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》做的阅读笔记 - Jeanva

ABSTRACT

However, even though dozens of anomaly detectors have been proposed over the years, deploying them to a given service remains a great challenge, requiring manually and iteratively tuning detector parameters and thresholds. This paper tackles this challenge through a novel approach based on supervised machine learning.
这篇文章主要解决手动调参和调threshold的问题

With our proposed system, Opprentice (Operators’ apprentice), operators’ only manual work is to periodically label the anomalies in the performance data with a convenient tool.
为了KPI stream的训练还是需要手动标记

Then the features and the labels are used to train a random forest classifier to automatically select the appropriate detector-parameter combinations and the thresholds.
用随机森林自动选择检测器和参数组合和阈值

For three different service KPIs in a top global search engine, Opprentice can automatically satisfy or approximate a reasonable accuracy preference (recall ≥ 0.66 and precision ≥ 0.66).
预设目标，自动满足目标或者找到一个最好的

Keywords

Anomaly Detection; Tuning Detectors; Machine Learning

1. INTRODUCTION

there exists no convenient method to automatically match operators’ practical detection requirements with the capabilities of different detectors
还没有简便方法适配检测器和实际需求

operators are used to specify simple requirements for detection accuracy and manually spot-check anomalies occasionally. As a result, services either settle with simple static thresholds (e.g., Amazon Cloud Watch Alarms [24]), intuitive to operators although unsatisfying in detection performance, or, after time-consuming manual tuning by algorithm designers, end up with a detector specifically tailored for the given service, which might not be directly applicable to other services.
运营习惯于设置简单条件，使得检测的效果一般，方法又不能迁移

the first step for the anomaly detection practitioner is to collect the requirements from the service operators. This step encounters Definition Challenges: it is difficult to precisely define anomalies in reality
异常的本身就很难精确定义

In addition, it is often impossible for the operators to quantitatively define anomalies,
运营很难把异常量化

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI.
需要算法和领域专家合理才能找到一个好的检测器

Our approach relies on two key observations. First, it is straightforward for operators to visually inspect the time series data and label anomaly cases they identified
运营喜欢在图上发现时间序列的异常

The second key observation is that the anomaly severities measured by different detectors can naturally serve as the features in machine learning, so each detector can serve as a feature extractor
不同检测器报告的异常严重程度可以作为特征来学习

Specifically, multiple detectors are applied to the KPI data in parallel to extract features. Then the features and the labels are used to train a machine learning model, i.e., random forests [28], to automatically select the appropriate detector-parameter combinations and the thresholds. The training objective is to maximally satisfy the operators’ accuracy preference.
运营输入label，不同检测器输入特征，用随机森林来训练，直到模型达到运营需要的准确度

More importantly, Opprentice takes operators only tens of minutes to label data.

We believe this is the first anomaly detection framework that does not require manual detector selection, parameter configuration, or threshold tuning.
首发

2. BACKGROUND

2.1 KPIs and KPI Anomalies

Beyond the physical meanings, the characteristics of these KPI data are also different.

Since we have to hide the absolute values, we use the coefficient of variation (\(C_{v}\)) to measure the dispersions, which equals the standard deviation divided by the mean.

2.2 Problem and Goal

The KPI data labeled by operators are the so called “ground truth”.
标记的才是真相

The fundamental goal of anomaly detection is to be accurate, e.g., identifying more anomalies in the ground truth, and avoiding false alarms.
发生尽量多真相，减少假阳

For example, the operators we worked with specified “recall ≥ 0.66 and precision ≥ 0.66” as the accuracy preference, which is considered as the quantitative goal of Opprentice in this paper.
目标被量化了

As anomalies are relatively few in the data, it is difficult for those detectors to achieve both high recall and precision.

precision and recall are often conflicting. The trade-off between them is often adjusted according to real demands. For example, busy operators are more sensitive to precision, as they do not want to be frequently disturbed by many false alarms.
工作忙的运营希望精度高

On the other hand, operators would care more about recall if a KPI, e.g., revenue, is critical, even at the cost of a little lower precision.
异常负面影响大的，recall要设置高

Opprentice has one qualitative goal: being automatic enough so that the operators would not be involved in selecting and combining suitable detectors, or tuning them
我们的定性的目标是不需要参与选择和调试检测器

3. OPPRENTICE OVERVIEW

3.1 Core Ideas

Opprentice approaches the above problem through supervised machine learning. Supervised machine learning can be used to automatically build a classification model from historical data, and then classify future data based on the model.
有监督的学习

we use existing basic anomaly detectors to quantify anomalous level of the data from their own perspectives, respectively.
用检测器根据自己的角度来衡量异常的程度

The results of the detectors are used as the features of the data. The features and operators’ labels together form the training set. Then a machine learning algorithm takes advantage of a certain technique to build a classification model.
检测器的结果作为特征

Opprentice is the first framework that use machine learning to automatically combine and tune existing detectors to satisfy operators’ detection requirements (anomaly definitions and the detection accuracy preference). Furthermore, to the best of our knowledge, this is the first time that different detectors are modeled as the feature extractors in machine learning
我们的算法是首次自动组合和调参基本检测器

3.2 Addressing Challenges in Machine Learning

When learning from such “imbalanced data”, the classifier is biased towards the large (normal) class and ignores the small (anomaly) class We solve this problem in §4.5 through adjusting the machine learning classification threshold (cThld henceforth).
通过阈值解决数据不平衡问题

some of the features would be either irrelevant to the anomalies or redundant with each other. We solve this problem by using an ensemble learning algorithm, i.e., random forests
不相关特征和冗余特征，通过随机森林解决

4. DESIGN

4.1 Architecture

First, before the system starts up, operators specify an accuracy preference (recall ≥ x and precision ≥ y), which we assume does not change in this paper. This preference is later used to guide the automatic adjustment of the cThld.
设定目标拿来调整阈值

Second, the operators use a convenient tool to label anomalies in the historical data at the beginning and label the incoming data periodically (e.g., weekly). All the data are labeled only once
用了一个可视化工具来做标记

4.2 Labeling Tool

the data of the last day and the last week are also shown in light colors

4.3 Detectors

4.3.1 Detectors As Feature Extractors

we represent different detectors with a unified model
data point \(\rightarrow\) severity \(\rightarrow\) {1, 0}
不同的检测器都是一个统一的模型

First, when a detector receives an incoming data point, it internally produces a non-negative value, called severity to measure how anomalous that point is

For example, Holt-Winters [6] uses the residual error (i.e., the absolute difference between the actual value and the forecast value of each data point) to measure the severity
Holt-Winters就是用残差

historical average [5] assumes the data follow Gaussian distribution, and uses how many times of standard deviation the point is away from the mean as the severity.
历史数据就是用标准差的倍数

Afterwards, a detector further needs a threshold to translate the severity into a binary output, i.e., anomaly (1) or not (0). We call this threshold the severity threshold: sThld henceforth.

Since the severity describes the anomalous level of data, it is natural to deem the severity as the anomaly feature.
将严重程度当作特征

We call a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor
检测器的一个参数配置就是一个特征

4.3.2 Choosing Detectors

When choosing detectors, we have two general requirements. First, the detectors should fit the above model, or they should be able to measure the severities of data.

Second, since anomalies should be detected timely, we require that the detectors can be implemented in an online fashion.
This requires that once a data point arrives, its severity should be calculated by the detectors without waiting for any subsequent data.
检测器可以在线，数据来了就可以立即计算严重程度

4.3.3 Sampling Parameters

We have two strategies to sample the parameters of detectors. The first one is to sweep the parameter space.

For example, EWMA (Exponentially Weighted Moving Average) [11], a prediction based detector, has only one weight parameter \(\alpha \in\) [0, 1]. As \(\alpha\) goes up, the prediction relies more upon the recent data than the historical data.
EWMA指数权重移动平均的\(alpha\)值越大越依赖于最近的数据

On the other hand, the parameters of some complex detectors, e.g., ARIMA (Autoregressive Integrated Moving Average) [10], can be less intuitive. To deal with such detectors, we estimate their “best” parameters from the data, and generate only one set of parameters, or one configuration for each detector. Besides, since the data characteristics can change over time, it is also necessary to update the parameter estimates periodically
复杂的算法直接估计最优参数

4.4 Machine Learning Algorithm

4.4.1 Considerations and Choices

in our problem, there are redundant and irrelevant features, caused by using detectors without careful evaluation.
a promising algorithm should be less-parametric and insensitive to its parameters, so that Opprentice can be easily applied to different data sets.
因为特征没有经过评估，适合的算法是那种对参数不敏感的

4.4.2 Random Forest

Preliminaries: decision trees. A decision tree [41] is a popular learning algorithm as it is simple to understand and interpret
At a high level, it provides a tree model with various if-then rules to classify data.

The numbers on branches, e.g., 3 for time series decomposition, are the feature split points.
特征分裂点有对应的数值

In the decision tree, a feature is more important for classification if it is closer to the root.
越靠近根的特征越重要

There are two major problems of decision tree. One is that the greedy feature selection at each step may not lead to a good final classifier; the other is that the fully grown tree is very sensitive to noisy data and features, and would not be general enough to classify future data, which is called overfitting.
决策树两个问题：贪婪算法不是全局最优，过拟合

A Random forest is an ensemble classifier using many decision trees. Its main principle is that a group of weak learners (e.g., individual decision trees) can together form a strong learner
弱分类器的组合能形成一个强分类器

First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time.

All the trees are fully grown in this way without pruning. The random forest then combines those trees by majority vote.
不用剪枝，又省了一个树的深度的参数

By default, the random forest uses 50% as the classification threshold (i.e., cThld).
默认的阈值是50%

4.5 Configuring cThlds

4.5.1 PC-Score: A Metric to Select Proper cThlds

We need to configure cThlds rather than using the default one (e.g., 0.5) for two reasons
这里的cThlds是随机森林的得票率，默认是50%

Configuring cThlds is a general method to trade off between precision and recall [31]. In consequence, we should configure the cThld of random forests properly to satisfy the operators’ preference.
运营对精度和召回的取舍决定了阈值

PR curves is widely used to evaluate the accuracy of a binary classifier [45], especially when the data is imbalanced
A PR curve plots precision against recall for every possible cThld of a machine learning algorithm (or for every sThld of a basic detector).
PR曲线就是画出了所有阈值可能的结果

F-Score based method, which selects the point that maximizes F-Score = \(\frac{2*precision*recall}{ precision+recall}\) ;
F值把两者综合

we develop a simple but effective accuracy metric based on F-Score, namely PC-Score (preferencecentric score), to explicitly take operators’ preference into account when deciding cThlds.
使用PC分值考虑了运营的偏好

we calculate its PC-Score as follows
\(p(x)=\left\{ \begin{array}{ll} \frac{2*r*p}{r+p}+1 &, if\ r>=R\ and\ p>=P\\ \frac{2*r*p}{r+p} &, others \end{array}\right.\)

In order to identify the point satisfying operators’ preference (if one exists), we add an incentive constant of 1 to F-score if r>=R and p>=P. Since F-Score is no more than 1, this incentive constant ensures that the points satisfying the preference must have the PCScore larger than others that do not
加1保证了满足运营条件的PCScore大于不满足运营条件的，然后再在满足条件的里面选最好的，即使没有满足运营条件的，也能选一个F值最大的

4.5.2 EWMA Based cThld Prediction

These cThlds are the best ones we can configure for detecting those data, and are called best cThlds. However, in online detection, we need to predict cThlds for detecting future data
在线学习需要改变cThlds

To this end, an alternative method is k-fold cross-validation.In each test (k tests in total), a classifier is trained using k? 1 of the subsets and tested on the rest one with a cThld candidate. The candidate that achieves the smallest average PC-Score across the k tests is used for future detection.
用k折交叉严重，k次测试PC分值最小的那个作为cThld，最差的那个最好

the best cThlds can differ greatly over weeks. As a result, in the cross-validation, the cThld that achieves the highest average performance over all the historical data might not be similar to the best cThld of the future week.
但是发现cThlds在不同时期不一样

Hence, we adopt EWMA [11] to predict the cThld of the ith week (or the ith test set) based on the historical best cThlds.
\(cThld_{i}^{p}=\left\{ \begin{array}{ll} \alpha*cThld_{i-1}^{b}+(1-\alpha)*cThld_{i-1}^{p} &, i>1\\ 5-fold\ prediction &, others \end{array}\right.\)
使用EWMA方法使用上周的最佳的上周的预测值的加权和

\(cThldb_{i-1}^{b}\) is the best cThld of the \((i - 1)^{th}\) week. \(cThld_{i}^{p}\) is the predicted cThld of the ith week, and also the one used for detecting the ith-week data.

We use \(\alpha\) = 0.8 in this paper to quickly catch up with the cThld variation

5. EVALUATION

5.1 Data sets

These data are labeled by the operators from the search engine using our labeling tool. There are 7.8%, 2.8%, and 7.4% data points are labeled as anomalies for PV, #SR, and SRT, respectively
数据中异常的比例

5.2 Detector and Parameter Choices

Two of the detectors were already used by the search engine we studied before this study. One is namely “Diff”, which simply measures anomaly severities using the differences between the current point and the point of last slot, the point of last day, and the point of last week.
"Diff" 检测器

The other one, namely “MA of diff”, measures severities using the moving average of the difference between current point and the point of last slot. This detector is designed to discover continuous jitters.
"MA of diff"检测器

Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severities. This patch can improve the robustness to missing data and outliers
MAD对缺失数据和离群点更加不敏感

5.3 Accuracy of Random Forests

Alternatively, we use the area under the PR curve (AUCPR) [50] as the accuracy measure. The AUCPR is a single number summary of the detection performance over all the possible thresholds

5.3.1 Random Forests vs. Basic Detectors and Static Combinations of Basic Detectors

The result shows that random forests significantly outperforms the two static combination methods, and perform similarly to or even better than the most accurate basic detector for each KPI
随机森林比静态组合好，在不同的KPI检测中都和最好的单一检测差不同

5.3.2 Random Forests vs. Other Algorithms

We also compare random forests with several other machine learning algorithms: decision trees, logistic regression, linear support vector machines (SVMs), and naive Bayes

The result demonstrates that random forests are quite robust to irrelevant and redundant features in practice
特征越多，随机森林的效果越明显

6. DISCUSSION

Anomaly detection, not troubleshooting. Sometimes, although the operators admit the anomalies in the KPI curve, they tend to ignore them as they know that the anomalies are caused by some normal activities as expected, such as service upgrades and predictable social events.
异常检测不是解决问题

For example, the troubleshooting system may find that the anomalies are due to normal system upgrades and suggest operators to ignore them.

Detection across the same types of KPIs. Some KPIs are of the same type and operators often care about similar types of anomalies for them. Note that, in order to reuse the classifier for the data of different scales, the anomaly features extracted by basic detectors should be normalized.
同类型的KPI标准化后可以重用模型

Dirty data. A well known problem is that detectors are often affected by “dirty data”. Dirty data refer to anomalies or missing points in data, and they can contaminate detectors and cause errors of detectors.
脏数据的问题

We address this problem in three ways. (a) Some of our detectors, e.g., weighted MA and SVD, can generate anomaly features only using recent data. Thus, they can quickly get rid of the contamination of dirty data.

(b) We take advantage of MAD [3, 15] to make some detectors, such as TSD, more robust to dirty data (c) Since Opprentice uses many detectors simultaneously, even if a few detectors are contaminated, Opprentice could still automatically select and work with the remaining detectors.

Learning limitations. Another issue is that a learning based approach is limited by the anomalies within a training set. For example, anomalies can be rare, and new types of anomalies might appear in the future [16]. We solve this problem by incrementally retraining the classifier to gather more anomaly cases and learn emerging types of anomalies
异常太少了，通过在线学习不断改进