Anomaly Detection异常检测

What are Outliers ?
Statistical Methods for Univariate Data
Using Gaussian Mixture Models
Fitting an elliptic envelope
Isolation Forest
Local Outlier Factor
Using clustering method like DBSCAN
什么是离群值？
单变量数据的统计方法
使用高斯混合模型
安装椭圆形信封
隔离森林
局部离群因子
使用DBSCAN之类的聚类方法

Outliers离群值

New data which doesn’t belong to general trend (or distribution) of entire data are known as outliers.
Data belonging to general trend are known as inliners.
Learning models are impacted by presence of outliers.
Anomaly detection is another use of outlier detection in which we find out unusual behaviour.
Data which were detected outliers can be deleted from complete dataset.
Outliers can also be marked before using them in learning methods
不属于整个数据的总体趋势（或分布）的新数据称为异常值。
属于大趋势的数据称为线性。
学习模型会受到异常值的影响。
异常检测是异常检测的另一种用途，在异常检测中我们可以发现异常行为。
可以从完整数据集中删除检测到的异常值的数据。
在学习方法中使用异常值之前，也可以对其进行标记

Statistical Methods for Univariate Data单变量数据的统计方法

Using Standard Deviation Method - zscore
Using Interquartile Range Method - IRQ
使用标准偏差方法-zscore
使用四分位间距法-IRQ

Using Standard Deviation Method使用标准偏差法

If univariate data follows Gaussian Distribution, we can use standard deviation to figure out where our data lies
如果单变量数据遵循高斯分布，我们可以使用标准差来找出数据所在的位置

import numpy as np
data = np.random.normal(size=1000)

Adding More Outliers

data[-5:] = [3.5,3.6,4,3.56,4.2]

from scipy.stats import zscore

Detecting Outliers

data[np.abs(zscore(data)) > 3]

array([3.05605991, 3.5       , 3.6       , 4.        , 3.56      ,
       4.2       ])

Using Interquartile Range使用四分位间距

For univariate data not following Gaussian Distribution IQR is a way to detect outliers
对于不遵循高斯分布的单变量数据，IQR是检测异常值的一种方法

from scipy.stats import iqr

data = np.random.normal(size=1000)
data[-5:]=[-2,9,11,-3,-21]
iqr_value = iqr(data)
lower_threshold = np.percentile(data,25) - iqr_value*1.5
upper_threshold = np.percentile(data,75) + iqr_value*1.5

print(upper_threshold)
print(lower_threshold)

2.743958884560146
-2.7240836639852306

data[np.where(data < lower_threshold)]

array([ -3.15722416,  -2.72563369,  -2.84349424,  -3.        ,
       -21.        ])

data[np.where(data > upper_threshold)]

array([ 2.83383323,  2.7536317 ,  2.98728378,  2.7889204 ,  9.        ,
       11.        ])

Using Gaussian Mixture Models使用高斯混合模型

Data might contain more than one peaks in the distribution of data.
Trying to fit such multi-model data with unimodel won’t give a good fit.
GMM allows to fit such multi-model data.
Configuration involves number of components in data, n_components.
covariance_type controls the shape of cluster
- full : cluster will be modeled to eclipse in arbitrary dir
- sperical : cluster will be spherical like kmeans
- diag : cluster will be aligned to axis
We will see how GMM can be used to find outliers
数据分布中可能包含多个峰。
试图用unimodel拟合这样的多模型数据不会很合适。
GMM允许拟合此类多模型数据。
配置涉及数据中的组件数n_components。
covariance_type控制簇的形状
- full：集群将被建模为在任意目录中蚀
- sperical：簇将像kmeans一样呈球形
- diag：群集将与轴对齐
我们将了解如何使用GMM查找异常值

# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
C2 = np.array([[1., -0.1], [2.7, .2]])
#X = np.r_[np.dot(np.random.randn(n_samples, 2), C)]
          #.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),np.dot(np.random.randn(n_samples, 2), C2)]

import matplotlib.pyplot as plt
%matplotlib inline

X[-5:] = [[4,-1],[4.1,-1.1],[3.9,-1],[4.0,-1.2],[4.0,-1.3]]
plt.scatter(X[:,0], X[:,1],s=5)

在这里插入图片描述

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(X)
pred = gmm.predict(X)
pred[:50]
plt.scatter(X[:,0], X[:,1],s=10,c=pred)

在这里插入图片描述

Fitting Elliptical Envelope拟合椭圆形信封

The assumption here is, regular data comes from known distribution ( Gaussion distribution )
Inliner location & variance will be calculated using Mahalanobis distances which is less impacted by outliers.
Calculate robust covariance fit of the data.
这里的假设是，常规数据来自已知分布（高斯分布）
内衬位置和方差将使用“马哈拉诺比斯距离”来计算，该值不受异常值的影响较小。
计算数据的鲁棒协方差拟合。

from sklearn.datasets import make_blobs
X,_ = make_blobs(n_features=2, centers=2, cluster_std=2.5, n_samples=1000)
plt.scatter(X[:,0], X[:,1],s=10)

在这里插入图片描述

from sklearn.covariance import EllipticEnvelope
ev = EllipticEnvelope(contamination=.1)
ev.fit(X)
cluster = ev.predict(X)
plt.scatter(X[:,0], X[:,1],s=10,c=cluster)

在这里插入图片描述

Isolation Forest隔离森林

Based on RandomForest
Useful in detecting outliers in high dimension datasets.
This algorithm randomly selects a feature & splits further.
Random partitioning produces shorter part for anomolies.
When a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
基于RandomForest
在检测高维数据集中的异常值时很有用。
此算法随机选择一个特征并进一步分割。
随机分区产生的部分更短。
当随机树木的森林为特定样本共同产生较短的路径长度时，很可能是异常情况。

rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

from sklearn.ensemble import IsolationForest
data = np.r_[X_train,X_test,X_outliers]
iso = IsolationForest(behaviour='new', contamination='auto')
iso.fit(data)
pred = iso.predict(data)

plt.scatter(data[:,0], data[:,1],s=10,c=pred)

在这里插入图片描述

Local Outlier Factor局部离群因子

Based on nearest neighbours
Suited for moderately high dimension datasets
LOF computes a score reflecting degree of abnormility of a data.
LOF Calculation
- Local density is calculated from k-nearest neighbors.
- LOF of each data is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density.
- An abnormal data is expected to have smaller local density.
LOF tells you not only how outlier the data is but how outlier is it with respect to all data
基于最近的邻居
适用于中等高维数据集
LOF计算反映数据异常程度的分数。
LOF计算
-从k最近邻居计算局部密度。
-每个数据的LOF等于他的k个最近邻居的平均局部密度与其自身局部密度之比。
-异常数据的局部密度较小。
LOF不仅告诉您数据的异常值，而且还告诉您所有数据的异常值

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=25,contamination=.1)
pred = lof.fit_predict(data)
s = np.abs(lof.negative_outlier_factor_)

plt.scatter(data[:,0], data[:,1],s=s*10,c=pred)

在这里插入图片描述

Outlier Detection using DBSCAN使用DBSCAN进行异常值检测

DBSCAN is a clustering method based on density
Groups data which are closer to each other.
Doesn’t use distance vector calculation method
Data not close enough to any cluster is not assigned any cluster & these can be anomalies
eps controls the degree of considering a data part of cluster
DBSCAN是基于密度的聚类方法
将彼此接近的数据分组。
不使用距离矢量计算方法
没有足够接近任何集群的数据没有被分配任何集群，这些可能是异常的
eps控制考虑集群数据部分的程度

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=.3)
dbscan.fit(data)
plt.scatter(data[:,0], data[:,1],s=s*10,c=dbscan.labels_)

在这里插入图片描述

sljwy

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

私信关注

python异常值检测

Anomaly Detection异常检测

Outliers离群值

Statistical Methods for Univariate Data单变量数据的统计方法

Using Standard Deviation Method使用标准偏差法

Using Interquartile Range使用四分位间距

Using Gaussian Mixture Models使用高斯混合模型

Fitting Elliptical Envelope拟合椭圆形信封

Isolation Forest隔离森林

Local Outlier Factor局部离群因子

Outlier Detection using DBSCAN使用DBSCAN进行异常值检测

猜你喜欢