密度估计: if p ( x ) = { < ε , anomally ≥ ε , normal \text{if}\ \ p(x)=\begin{cases} <\varepsilon,&\text{anomally}\\ ≥\varepsilon,&\text{normal} \end{cases} ifp(x)={
<ε,≥ε,anomallynormal
Anomaly detection example (1) Fraud Detection 反欺诈: x ( i ) x^{(i)} x(i) = features of user i i i's activities Model p ( x ) p(x) p(x) from data Identify unusual users by checking which have p ( x ) < ε p(x)<\varepsilon p(x)<ε (2) Manufacturing (3) Monitoring computers in a data center x ( i ) x^{(i)} x(i) = features of machine i i i
均值:【决定中心】 μ = 1 m ∑ i = 1 m x ( i ) \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} μ=m1i=1∑mx(i)
方差:【决定宽度】 σ 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ) 2 \sigma^2=\frac{1}{m}\sum_{i=1}^m{\left(x^{(i)}-\mu\right)}^2 σ2=m1i=1∑m(x(i)−μ)2
示例
3 Algorithm
Choose features x i x_i xi that you think might be indicative of anomalous examples
Fit parameters μ 1 , ⋅ ⋅ ⋅ , μ n , σ 1 2 , ⋅ ⋅ ⋅ , σ n 2 \mu_1,···,\mu^n,\sigma_1^2,···,\sigma_n^2 μ1,⋅⋅⋅,μn,σ12,⋅⋅⋅,σn2 μ j = 1 m ∑ i = 1 m x j ( i ) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j ) 2 \begin{aligned} \mu_j&=\frac{1}{m}\sum_{i=1}^mx^{(i)}_j\\ \sigma^2_j&=\frac{1}{m}\sum_{i=1}^m{\left(x^{(i)}_j-\mu_j\right)}^2 \end{aligned} μjσj2=m1i=1∑mxj(i)=m1i=1∑m(xj(i)−μj)2
Given new example x x x, compute p ( x ) p(x) p(x) p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ 2 ) = = ∏ j = 1 n 1 2 π σ j exp ( − ( x j − μ j ) 2 2 σ j 2 ) p(x)=\prod_{j=1}^n p(x_j;\mu_j,\sigma^2)==\prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j}}\text{exp}\left(-\frac{
{(x_j-\mu_j)}^2}{2\sigma^2_j}\right) p(x)=j=1∏np(xj;μj,σ2)==j=1∏n2πσj1exp(−2σj2(xj−μj)2)
Anomaly if p ( x ) < ε p(x)<\varepsilon p(x)<ε
4 Developing and Evaluating an Anomaly Detection System 开发和评价一个异常检测系统
根据测试集数据,我们估计特征的平均值和方差并构建 p ( x ) p(x) p(x)函数
对交叉验证集,我们尝试使用不同的 ε \varepsilon ε值作为阈值,并预测数据是否异常,根据 F1 值或者查准率与查全率的比例来选择 ε \varepsilon ε
选出 ε \varepsilon ε后,针对测试集进行预测,计算异常检验系统的 F1 值,或者查准率与查全率之比
5 异常检测与监督学习对比
异常检测
监督学习
Very small number of positive examples ( y = 1 y=1 y=1)
Larger number of positive and negative examples
Many different “types” of anomalies. Hard for any algorithm to learn from positive examples when the anomalies look like
Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set
Future anomalies may look nothing like any of the anomalous examples we’ve seen so far
Fraud detection; Manufacturing; Monitoring machines in a data center
Email spam classification; Weather prediction; Cancer classification
6 Choosing What Features to Use 选择特征
6.1 Non-gaussian Features
将数据转换成高斯分布: (1) 使用对数函数: x = log ( x + c ) x=\text{log}(x+c) x=log(x+c), c c c为非负常数 (2) x = x c x=x^c x=xc,, c c c为 0 ∼ 1 0\sim1 0∼1之间的一个分数
6.2 Error Analysis for Anomaly Detection
Want p ( x ) p(x) p(x) large for normal examples x x x p ( x ) p(x) p(x) small for anomalous examples x x x
Most common problem: p ( x ) p(x) p(x) is comparable (say, both large) for normal and anomalous examples
6.3 Monitoring computers in a data center
Choose features that might take on unusually large or small values in the event of an anomaly
可通过将一些相关的特征进行组合,来获取一些新的更好的特征
7 Multivariate Gaussian Distribution 多元高斯分布
it allows you to capture when you’d expect two different features to be positively correlated or may be negative correlated
7.1 Alogorithm
μ = 1 m ∑ i = 1 m x ( i ) Σ = 1 m ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T p ( x ; μ , Σ ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) \begin{aligned} \mu&=\frac{1}{m}\sum_{i=1}^mx^{(i)}\\ \Sigma&=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right){\left(x^{(i)}-\mu\right)}^T\\ p(x;\mu,\Sigma)&=\frac{1}{
{(2\pi)}^{\frac{n}{2}}{|\Sigma|}^{\frac{1}{2}}}\text{exp}\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right) \end{aligned} μΣp(x;μ,Σ)=m1i=1∑mx(i)=m1i=1∑m(x(i)−μ)(x(i)−μ)T=(2π)2n∣Σ∣211exp(−21(x−μ)TΣ−1(x−μ))
7.2 原高斯分布模型和多元高斯分布模型的比较
原高斯分布模型
多元高斯分布模型
Manually create features to capture anomalies where x 1 , x 2 x_1,x_2 x1,x2 take unusual combinations of values
Automatically captures correlations between features
Computationally cheaper (alternatively, scales better to large)
Computationally more expensive
OK even if m m m is small
Must have m > n m>n m>n or else Σ \Sigma Σ is non-inverible