第10章异常检测

1 Problem Motivation
2 Gaussian ( Normal ) Distribution 高斯（正态）分布
3 Algorithm
4 Developing and Evaluating an Anomaly Detection System 开发和评价一个异常检测系统
5 异常检测与监督学习对比
6 Choosing What Features to Use 选择特征
7 Multivariate Gaussian Distribution 多元高斯分布
- 7.1 Alogorithm
- 7.2 原高斯分布模型和多元高斯分布模型的比较
8 Reference

1 Problem Motivation

识别欺骗
密度估计： $\text{if}\ \ p(x)=\begin{cases} <\varepsilon,&\text{anomally}\\ ≥\varepsilon,&\text{normal} \end{cases}$
Anomaly detection example
(1) Fraud Detection 反欺诈：
$x^{(i)}$ = features of user $i$ 's activities
Model $p (x)$ from data
Identify unusual users by checking which have $p(x)<\varepsilon$
(2) Manufacturing
(3) Monitoring computers in a data center
$x^{(i)}$ = features of machine $i$

2 Gaussian ( Normal ) Distribution 高斯（正态）分布

$x\sim N(\mu,\sigma^2)$
概率密度函数： $p(x,\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma}}\text{exp}\left(-\frac{ {(x-\mu)}^2}{2\sigma^2}\right)$
均值：【决定中心】 $\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}$
方差：【决定宽度】 $\sigma^2=\frac{1}{m}\sum_{i=1}^m{\left(x^{(i)}-\mu\right)}^2$
示例

3 Algorithm

Choose features $x_i$ that you think might be indicative of anomalous examples
Fit parameters $\mu_1,···,\mu^n,\sigma_1^2,···,\sigma_n^2$
$\begin{aligned} \mu_j&=\frac{1}{m}\sum_{i=1}^mx^{(i)}_j\\ \sigma^2_j&=\frac{1}{m}\sum_{i=1}^m{\left(x^{(i)}_j-\mu_j\right)}^2 \end{aligned}$
Given new example $x$ , compute $p (x)$
$p(x)=\prod_{j=1}^n p(x_j;\mu_j,\sigma^2)==\prod_{j=1}^n \frac{1}{\sqrt{2\pi\sigma_j}}\text{exp}\left(-\frac{ {(x_j-\mu_j)}^2}{2\sigma^2_j}\right)$
Anomaly if $p(x)<\varepsilon$

4 Developing and Evaluating an Anomaly Detection System 开发和评价一个异常检测系统

根据测试集数据，我们估计特征的平均值和方差并构建 $p (x)$ 函数
对交叉验证集，我们尝试使用不同的 $\varepsilon$ 值作为阈值，并预测数据是否异常，根据 F1 值或者查准率与查全率的比例来选择 $\varepsilon$
选出 $\varepsilon$ 后，针对测试集进行预测，计算异常检验系统的 F1 值，或者查准率与查全率之比

5 异常检测与监督学习对比

异常检测	监督学习
Very small number of positive examples ( $y = 1$ )	Larger number of positive and negative examples
Many different “types” of anomalies. Hard for any algorithm to learn from positive examples when the anomalies look like	Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set
Future anomalies may look nothing like any of the anomalous examples we’ve seen so far
Fraud detection; Manufacturing; Monitoring machines in a data center	Email spam classification; Weather prediction; Cancer classification

6 Choosing What Features to Use 选择特征

6.1 Non-gaussian Features

将数据转换成高斯分布：
(1) 使用对数函数： $x=\text{log}(x+c)$ ， $c$ 为非负常数
(2) $x=x^c$ ，， $c$ 为 $0\sim1$ 之间的一个分数

6.2 Error Analysis for Anomaly Detection

Want $p (x)$ large for normal examples $x$
$p (x)$ small for anomalous examples $x$
Most common problem：
$p (x)$ is comparable (say, both large) for normal and anomalous examples

6.3 Monitoring computers in a data center

Choose features that might take on unusually large or small values in the event of an anomaly
可通过将一些相关的特征进行组合，来获取一些新的更好的特征

7 Multivariate Gaussian Distribution 多元高斯分布

it allows you to capture when you’d expect two different features to be positively correlated or may be negative correlated

7.1 Alogorithm

$\begin{aligned} \mu&=\frac{1}{m}\sum_{i=1}^mx^{(i)}\\ \Sigma&=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right){\left(x^{(i)}-\mu\right)}^T\\ p(x;\mu,\Sigma)&=\frac{1}{ {(2\pi)}^{\frac{n}{2}}{|\Sigma|}^{\frac{1}{2}}}\text{exp}\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right) \end{aligned}$

7.2 原高斯分布模型和多元高斯分布模型的比较

原高斯分布模型	多元高斯分布模型
Manually create features to capture anomalies where $x_1,x_2$ take unusual combinations of values	Automatically captures correlations between features
Computationally cheaper (alternatively, scales better to large)	Computationally more expensive
OK even if $m$ is small	Must have $m > n$ or else $\Sigma$ is non-inverible

8 Reference

吴恩达机器学习 coursera machine learning
黄海广机器学习笔记

【机器学习】10 异常检测

第10章异常检测

1 Problem Motivation

2 Gaussian ( Normal ) Distribution 高斯（正态）分布

3 Algorithm

4 Developing and Evaluating an Anomaly Detection System 开发和评价一个异常检测系统

5 异常检测与监督学习对比

6 Choosing What Features to Use 选择特征

6.1 Non-gaussian Features

6.2 Error Analysis for Anomaly Detection

6.3 Monitoring computers in a data center

7 Multivariate Gaussian Distribution 多元高斯分布

7.1 Alogorithm

7.2 原高斯分布模型和多元高斯分布模型的比较

8 Reference

猜你喜欢

【机器学习】10 异常检测

第10章 异常检测

1 Problem Motivation

2 Gaussian ( Normal ) Distribution 高斯（正态）分布

3 Algorithm

4 Developing and Evaluating an Anomaly Detection System 开发和评价一个异常检测系统

5 异常检测与监督学习对比

6 Choosing What Features to Use 选择特征

6.1 Non-gaussian Features

6.2 Error Analysis for Anomaly Detection

6.3 Monitoring computers in a data center

7 Multivariate Gaussian Distribution 多元高斯分布

7.1 Alogorithm

7.2 原高斯分布模型和多元高斯分布模型的比较

8 Reference

猜你喜欢

第10章异常检测