Anomaly detection (abnormality detection)

Foreword

       This chapter is followed by the contents of the last chapter, in front of us to introduce a number of questions unsupervised learning, Clustering (cluster) and Dimensionality Reduction (dimensionality reduction) issue, we will introduce another chapter in this the issue of unsupervised learning anomaly detection (anomaly detection) problem.

       Finally, if there is understanding wrong with it, I hope you have educated us, thank you!

Chapter XIII Anomaly detection (anomaly detection)

13.1 motivation problem

       First of all, we want to ask a question, why there is this anomaly detection problem? Now I will use an example to illustrate this to you. Characteristics such as aircraft engines are x1 = heat generated (heat generated), x2 = vibration intensity (the intensity of vibration) and the like, we get a set of such data:

{ x^{(1)},x^{(2)},...,x^{(m)}}, Assuming aircraft engine data we obtained are normal, we now have a new engine data x_{test}, as shown, the data we have obtained the new needs and predicted data in a drawing in FIG. 1 , assuming that our new data between the known data set, we can predict that it is normal if the new data we are far from known data set, then we can predict that it is unusual. This is just a simple demonstration of our anomaly detection process.

                                                                         1 known data and the new data set

      In the front we also said a new look at the data from our data set is predicted to be near normal, far was an exception, then how much distance is near? How far away it is far from it? We know we present data set, as shown in Fig 2 we construct a probability model p (x) the new data to predict the probability, the closer the center of the data set, the greater the possibility of a normal, i.e., we new data of the prediction P ( x_{test}) will be greater, the farther away from the dataset and vice versa, the smaller compared with the normal likelihood, P ( x_{test}) will be smaller, here we are given an abnormal threshold \varepsilon, when we the P ( x_{test}) < \varepsilontime, is an exception; when P ( x_{test})> = \varepsilonthe time, compared to normal.

                                                                             FIG 2 datasets

      Front also tell you how our probability function p (x) to determine whether a new data anomalies, here to give us something about the application anomaly detection in practice. In addition to the front of a judge to give you an example of an aircraft engine is abnormal, but we have detected that Internet users search for some information based upon whether the abnormalities such as scammers and so on, we will construct p (x) according to some features, then according to p (x) < \varepsiloncome into our suspect, such as us normal people often search for words as some of the feature set x, the probability of abnormal words that appear in our search for a known data set will be very small Therefore according to predetermined abnormal threshold, we'll get the object of our suspect, and this is one aspect has recently been applied, though not hundred percent sure who is abnormal that it would narrow the scope, we can detect according to the the results of one investigation again. Another application is the case of the data center computer monitor, we can set some features x1 = memory use (memory usage), x2 = number of disk accesses / sec ( the number of disk access), x3 = CPU load (CPU's load) and so on, then we can construct from these variables a probability model p (x), now suppose you have a computer by calculating its probability the p-(the X-) < \varepsilon, then we can get this computer problems that may arise, the need to go maintenance and so forth.

13.2 Gaussian distribution

       Here to give you a brief introduction Gaussian distribution, because later we will use Gaussian distribution algorithm to detect the abnormality, if you understand the Gaussian distribution can skip this section directly to the next section. About Gaussian distribution, we are too often called distribution. If a data set X \inR & lt, X Gaussian distribution with mean \ muand variance \sigma ^{2}, we referred to as x \ sim N (\ it \ sigma ^ {2}), \sigmacompared with standard deviation, which is a probability model p (x, \ it \ sigma ^ {2}) = \ frac {1} {\ sqrt {2 \ pi} \ sigma} exp (- \ frac {(x- \ mu) ^ 2} {2 \ sigma ^ {2}}), the image shown in FIG., The entire pattern is a symmetrical 3, the intermediate value what we mean \ mu, and we variance is to measure the fat, thin figure.

                                                                           图3 高斯分布

      下面我们来给大家举几个例子,来比较直观的认识\ mu\sigma对整个图形的影响。如图4所示,我们可以看到当\ mu都是0时,高斯分布图形的对称轴就是x=0,而当\sigma减小一半时,整个图形变得更尖锐了,而当\sigma增加一倍时,整个图形变矮胖了,所以我们可以得出\sigma越小图形越瘦高,\sigma越大,图形越矮胖,不过有一点要注意就是整个图形的面积是1,这是概率分布的结论。而如果我们只改变\ mu,则会发现整个图形在平移,移动到我们新的以\ mu为对称轴。

                                                                            图4 改变\ mu\sigma对高斯分布的影响

     现在我们给出我们的数据集{x^{(1)},x^{(2)},...,x^{(m)}},怎样构造出高斯分布了?要想得到高斯分布,首先我们需要得到高斯分布的两个重要参数均值\ mu和方差\sigma ^{2},首先均值\mu =\frac{1}{m}\sum_{i=1}^{m}x^{(i)},然后方差\sigma ^{2}=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu )^2,我们就会得到x的分布x\sim N(\mu ,\sigma ^{2}),好了关于高斯分布就介绍到这里了。

13.3 利用高斯分布进行异常检测算法

       前面已经给大家铺垫了高斯分布的知识,下面我将给大家介绍怎样用高斯分布来实现异常检测算法。对于训练集

{x^{(1)},x^{(2)},...,x^{(m)}},每一个训练集我们假设有n个特征,X\inR^{n},我们假设每一个特征x_{i}都服从高斯分布,则会有x_{1}\sim N(\mu _{1},\sigma _{1}^{2}),x_{2}\sim N(\mu _{2},\sigma _{2}^{2}),...,x_{n}\sim N(\mu _{n},\sigma _{n}^{2}),我们对于每一个特征就会有对应的概率函数p(x_{1};\mu _{1},\sigma _{1}^{2}),p(x_{2};\mu _{2},\sigma _{2}^{2}),...,p(x_{n};\mu _{n},\sigma _{n}^{2}),而我们整个的p(x)=p(x_{1};\mu _{1},\sigma _{1}^{2})p(x_{2};\mu _{2},\sigma _{2}^{2})...p(x_{n};\mu _{n},\sigma _{n}^{2})为每一个概率密度函数相乘,这里用相乘其实是假设了每个特征是独立了,实际上如果有些不独立效果也很好,所以不用纠结这些特征是否都独立。我们对于p(x)的式子可以简写为p(x)=\prod_{j=1}^{n}p(x_{j};\mu _{j},\sigma _{j}^{2})(在这里\prod是求积符号)。

       我们重新给大家梳理下整个算法的过程:

1.选择你认为可以判断为异常类的有参考性的特征x_{i}

2.同过数据集{x^{(1)},x^{(2)},...,x^{(m)}}根据\mu _{j}=\frac{1}{m}\sum_{i=1}^{m}x_{j}^{(i)}\sigma_{j} ^{2}=\frac{1}{m}\sum_{i=1}^{m}(x_{j}^{(i)}-\mu_{j} )^2得到\mu _{1},\mu _{2},...,\mu _{n},\sigma _{1}^{2},\sigma _{2}^{2},...,\sigma _{n}^{2}

3.给定一个新的数据x,我们计算p(x)=\prod_{j=1}^{n}p(x_{j};\mu _{j},\sigma _{j}^{2})=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi }\sigma _{j}}exp(-\frac{(x_{j}-\mu _{j})^2}{2\sigma _{j}^{2}}),如果p(x)<\varepsilon则可以判断为异常。

       下面我们用一个具体的例子来说明下问题。如图5所示,我们都能有一堆数据,并且每个数据有两个特征x1和x2,对于这两个特征都是服从高斯分布,我们分布计算得到了他们的均值和标准差分别为\mu _{1}=5,\sigma _{1}=2;\mu _{2}=3,\sigma _{2}=1,我们画出两个的概率分布图如图6所示。所以我们整个的p(x)就是两者的乘积了,最后的概率分布图如图7所示,每一个点的高度来表示概率。假设我们现在有两个新的数据x_{test}^{(1)},x_{test}^{(2)},我们通过计算得到p(x_{test}^{(1)})=0.0426,p(x_{test}^{(2)})=0.0021,我们还有\varepsilon =0.02,所以我们可以通过p(x_{test}^{(1)})=0.0426\geq \varepsilon ,p(x_{test}^{(2)})=0.0021< \varepsilon,得到x_{test}^{(1)}是正常的,而x_{test}^{(2)}是异常的。而对于图5中的大圆圈外的部分就是对应着我们图7中高度为0的那部分,这也符合我们之前的结论,当p(x)越小时,则是异常的可能性越大,即离数据集越远。

                                                                       图5 异常数据检测例子

                                                                    图6 x1和x2的概率分布图

                                                                                   图7 p(x)的分布图

13.4 发展和评估一个异常检测系统

       前面我们已经给大家介绍了怎样用高斯分布来进行异常检测,那么我们怎样评估这个系统是否合理了?即我们怎样判断我们最终判断为异常的结果是正确的。假设我们有一组数据是具有标签的,即我们知道这些数据是正常的还是异常的,听起来和前面的监督学习类似,但我想说的是这两者还是有区别的,不能混为一谈,在后面我将详细给大家做比较。我们用y=0来表示一个数据是正常的,用y=1来表示一个数据是异常的,接下来我们把数据分成三类和之前一样,我们有训练集:x^{(1)},x^{(2)},...,x^{(m)}用来计算出我们的概率函数p(x),交叉集:(x_{cv}^{(1)},y_{cv}^{(1)}),(x_{cv}^{(2)},y_{cv}^{(2)}),...,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})}),测试集:(x_{test}^{(1)},y_{test}^{(1)}),(x_{test}^{(2)},y_{test}^{(2)}),...,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})

        我们用最初的飞机发动机问题来举例,假设我们现在有10000个好的发动机,20个异常的发动机,我们对数据做如下处理,让训练集为6000个好的发动机(可以理解为y=0),而我们的交叉集和测试集分别为2000个好的发动机(y=0)和10个异常发动(y=1),两者数据不一样。而有的人可能会对数据进行这样的处理:训练集还是为6000个好的发动机,而交叉集和测试集为同样的4000个好的发动机,异常发动机分别两者10个。对于后者的处理方法,我是不建议的,对于数据用同一组这样不好,因为我们是需要用交叉集来进行选出合适的参数,最后用测试集来测试整个系统的合理性如何,所以最好两者的数据不一样比较好。

       所以我们对训练集可以得到p(x),然后针对交叉集或者测试集的x,我们有y=\left\{\begin{matrix} 1 &if & p(x)<\varepsilon (anomaly)\\ 0 &if & p(x)\geq \varepsilon (normal) \end{matrix}\right.,由于这个数据集是明显的斜偏类数据,因为大部分都是正常类数据,而异常类数据很少,所以我们不能用以往的判定方法,在前面章节我也给大家介绍了一些针对斜偏类数据的处理,比如用True positive、false positive、  false negative、true negative,或者用Precision/Recall,或者F1-score等等来进行评估这个是否最好,通过在交叉集中选出最好的\varepsilon

13.5 异常检测法和监督学习法比较

      在前面也给大家说明了,这两者不能混为一谈,那我们什么时候该用异常检测法或者监督学习法了?我们在这里对两者做一个比较。1.对于异常检测,就是我们有一个很明显的感受,就是有很少的y=1这类数据,而是有大量的y=0这类数据,而我们的监督学习不管是y=0或是y=1都有大量的数据。2.异常检测中,我们对于一个新的异常类很难在已知的数据集中找到类似的特征,即我们对于给定的异常数据集很难总结出异常类类似的样貌,很有可能对于我们一个新的异常数据,我们在已知的异常数据集中找不到类似的数据。而我们的监督学习中,由于有大量地y=1数据集,所以我们可以描述出异常类的大致样貌特征,对于给出的新的异常数据我们也比较容易在训练集中找到类似的数据。

       下面我们用前面我们所用到过的例子来对此进行归类,比如Fraud detection、Manufacturing(e.g. aircraft engines)、Monitoring machines in a data center等等都是异常检测类的,而Email spam classification、Weather prediction、Cancer classification等等都是监督学习这一类的。

13.6 选择什么样的特性

        首先对于特征,有的特征并不是高斯分布,所以我们需要简单的做些处理,比如对于如图8左边所示的数据分布,我们可以取一个对数log(x),则可以变成图8右边的数据分布,这个时候我们的数据就接近高斯分布了。

                                                                         图8 对x取对数log(x)变成接近高斯分布
        关于异常检测的错误分析,我们希望的是当我们检测的结果是正常类的时候,我们会有很大的p(x),而我们检测结果是异常类的时候,我们会有很小的p(x),但是有很多时候会出现无论是正常类还是异常类我们都得到了很大的p(x),如图9所示,当我们只有特征x1时,假设我们有一个异常数据,通过p(x),我们发现p(x)>\varepsilon,我们就会误判为是正常数据,而如果我们又添加一个特征x2,如图10所示,我们就会发现这个数据不在正常数据范围内,就可以成功判断为异常数据。所以选择特征很重要,我们应该选择那些容易区分异常数据的值,通常是很大或者很小的值。有时候可以增添x3=x2/x1这样的特征来模拟x2和x1两者的关联变化。

                                                                                   图9 当只有特征x1时

                                                                                           图10 添加了特征x2
  13.7 多元高斯分布

       在前面给大家讲了如何选择我们的特征,现在我们来给大家讲讲多元高斯分布,在前面我们对于有多个特征的时候,我们是分布计算出每个特征的概率密度函数,再相乘得到最终的p(x),在这里我们不再分布计算每个单独的,而是直接计算出p(x)。首先给大家介绍两个变量,\mu \in R^{n}\Sigma \in R^{n\times n}(协方差矩阵),这和前面给大家说过的均值和方差意义一样,只是在这里是矩阵的形式。所以我们的p(x;\mu ,\Sigma )=\frac{1}{(2\pi )^{\frac{n}{2}}\left | \Sigma \right |^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu )^{\top }\Sigma ^{-1}(x-\mu ))

       下面我们用图形的方式,让大家比较直观的了解\mu\Sigma对高斯分布的影响,如图11所示,\mu =\begin{bmatrix} 0\\0 \end{bmatrix},而\Sigma都为对角矩阵,且对角上的值都相等,我们可以发现每个图的峰顶都是在(0,0)这个位置,这是由\mu决定的,而我们\Sigma对角线上的值越大则图形越矮胖,越小则越瘦高,这和前面一元的高斯分布结论是一样的。

                                                                                     图11 \mu =\begin{bmatrix} 0\\0 \end{bmatrix},而改变\Sigma

       如图12所示,当我们改变\Sigma对角元素的值,使对角元素上的值不一样时,通过俯视图会发现图形不再是一个圆,而是一个椭圆,这也很容易理解,\Sigma的每个值代表一个特征变量,所以分布影响各自特征变量的胖瘦。

      

                                                                         图12 使\Sigma对角元素的值不一样

     如图13所示,当我们使\Sigma的每个位置都有值时,即\Sigma不再是一个对角矩阵,这个时候我们会发现,整个图形是倾斜的,不再是一个正的,这时副对角线上的值为正。

                                                                                 图13 使\Sigma不再是一个对角矩阵

      当\Sigma副对角线上的值为负时,我们会发现,图形倾斜的方向相反了,如图14所示。

                                                                        图14 当\Sigma副对角线上的值为负时

       前面给大家说的都是改变\Sigma对高斯分布的影响,最后我们来说说\mu对图形的影响,如图15所示,其实和一元高斯分布一样,改变的是整个图形的整体分布,也就是中心的位置和\mu值一致。

                                                                                          图15 改变\mu的值

13.8 使用多元高斯分布来进行异常检测

       On the one to introduce the multivariate Gaussian distribution, then our aim of course is to use it to detect abnormalities. For a given data set { x^{(1)},x^{(2)},...,x^{(m)}}, we adopted \mu =\frac{1}{m}\sum_{i=1}^{m}x^{(i)}and \Sigma =\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu )(x^{(i)}-\mu )^{\top }to obtain p (x) of two variables \mu, and \Sigmathen by p(x)=\frac{1}{(2\pi )^{\frac{n}{2}}\left | \Sigma \right |^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu )^{\top }\Sigma ^{-1}(x-\mu ))calculating a large p (x), and finally according to p (x) < \varepsilondetermines an abnormality class.

        In the front we find a method to introduce p (x) is to first get p (each feature x_{i};\mu _{i},\sigma _{i}^{2}), then multiplying p (x), then there is any connection between them? For \Sigmaus initially obtained in accordance with \sigma _{1}^{2},\sigma _{2}^{2},...,\sigma _{n}^{2}, constructed \Sigma =\begin{bmatrix} \sigma _{1}^{2} &0 &0 &0 \\0 &\sigma _{2}^{2} &0 &0 \\... & ... & ... &... \\0 & 0&0 &\sigma _{n}^{2} \end{bmatrix}. For both we have a comparison, the original request is for p (x) method is not easy to grasp the relationship between two variables, it needs its own manual to construct new variables such as x3 = x1 / x2 to two variables Contact considered together, and our multivariate Gaussian distribution can be automatically grab the relationship between the two. However, when m is small, less than n, the original approach may well be detected, and our multivariate Gaussian distribution is not, as we request an \Sigmainverse matrix, when m <n, \Sigmairreversible, it is necessary to pay attention to we usually when m> = multivariate Gaussian distribution will be used when 10n. , Each have their own good, you can choose the most appropriate method only when used in practice.

 

 

Guess you like

Origin blog.csdn.net/qq_36417014/article/details/84584285