python异常值检测

Anomaly Detection异常检测

  • What are Outliers ?
  • Statistical Methods for Univariate Data
  • Using Gaussian Mixture Models
  • Fitting an elliptic envelope
  • Isolation Forest
  • Local Outlier Factor
  • Using clustering method like DBSCAN
  • 什么是离群值?
  • 单变量数据的统计方法
  • 使用高斯混合模型
  • 安装椭圆形信封
  • 隔离森林
  • 局部离群因子
  • 使用DBSCAN之类的聚类方法

Outliers离群值

  • New data which doesn’t belong to general trend (or distribution) of entire data are known as outliers.
  • Data belonging to general trend are known as inliners.
  • Learning models are impacted by presence of outliers.
  • Anomaly detection is another use of outlier detection in which we find out unusual behaviour.
  • Data which were detected outliers can be deleted from complete dataset.
  • Outliers can also be marked before using them in learning methods
  • 不属于整个数据的总体趋势(或分布)的新数据称为异常值。
  • 属于大趋势的数据称为线性。
  • 学习模型会受到异常值的影响。
  • 异常检测是异常检测的另一种用途,在异常检测中我们可以发现异常行为。
  • 可以从完整数据集中删除检测到的异常值的数据。
  • 在学习方法中使用异常值之前,也可以对其进行标记

Statistical Methods for Univariate Data单变量数据的统计方法

  • Using Standard Deviation Method - zscore
  • Using Interquartile Range Method - IRQ
  • 使用标准偏差方法-zscore
  • 使用四分位间距法-IRQ

Using Standard Deviation Method使用标准偏差法

  • If univariate data follows Gaussian Distribution, we can use standard deviation to figure out where our data lies
  • 如果单变量数据遵循高斯分布,我们可以使用标准差来找出数据所在的位置
import numpy as np
data = np.random.normal(size=1000)
  • Adding More Outliers
data[-5:] = [3.5,3.6,4,3.56,4.2]
from scipy.stats import zscore
  • Detecting Outliers
data[np.abs(zscore(data)) > 3]
array([3.05605991, 3.5       , 3.6       , 4.        , 3.56      ,
       4.2       ])

Using Interquartile Range使用四分位间距

  • For univariate data not following Gaussian Distribution IQR is a way to detect outliers
  • 对于不遵循高斯分布的单变量数据,IQR是检测异常值的一种方法
from scipy.stats import iqr
data = np.random.normal(size=1000)
data[-5:]=[-2,9,11,-3,-21]
iqr_value = iqr(data)
lower_threshold = np.percentile(data,25) - iqr_value*1.5
upper_threshold = np.percentile(data,75) + iqr_value*1.5
print(upper_threshold)
print(lower_threshold)
2.743958884560146
-2.7240836639852306
data[np.where(data < lower_threshold)]
array([ -3.15722416,  -2.72563369,  -2.84349424,  -3.        ,
       -21.        ])
data[np.where(data > upper_threshold)]
array([ 2.83383323,  2.7536317 ,  2.98728378,  2.7889204 ,  9.        ,
       11.        ])

Using Gaussian Mixture Models使用高斯混合模型

  • Data might contain more than one peaks in the distribution of data.
  • Trying to fit such multi-model data with unimodel won’t give a good fit.
  • GMM allows to fit such multi-model data.
  • Configuration involves number of components in data, n_components.
  • covariance_type controls the shape of cluster
    • full : cluster will be modeled to eclipse in arbitrary dir
    • sperical : cluster will be spherical like kmeans
    • diag : cluster will be aligned to axis
  • We will see how GMM can be used to find outliers
  • 数据分布中可能包含多个峰。
  • 试图用unimodel拟合这样的多模型数据不会很合适。
  • GMM允许拟合此类多模型数据。
  • 配置涉及数据中的组件数n_components。
  • covariance_type控制簇的形状
    • full:集群将被建模为在任意目录中蚀
    • sperical:簇将像kmeans一样呈球形
    • diag:群集将与轴对齐
  • 我们将了解如何使用GMM查找异常值
# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
C2 = np.array([[1., -0.1], [2.7, .2]])
#X = np.r_[np.dot(np.random.randn(n_samples, 2), C)]
          #.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),np.dot(np.random.randn(n_samples, 2), C2)]
import matplotlib.pyplot as plt
%matplotlib inline
X[-5:] = [[4,-1],[4.1,-1.1],[3.9,-1],[4.0,-1.2],[4.0,-1.3]]
plt.scatter(X[:,0], X[:,1],s=5)

在这里插入图片描述

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(X)
pred = gmm.predict(X)
pred[:50]
plt.scatter(X[:,0], X[:,1],s=10,c=pred)

在这里插入图片描述

Fitting Elliptical Envelope拟合椭圆形信封

  • The assumption here is, regular data comes from known distribution ( Gaussion distribution )
  • Inliner location & variance will be calculated using Mahalanobis distances which is less impacted by outliers.
  • Calculate robust covariance fit of the data.
  • 这里的假设是,常规数据来自已知分布(高斯分布)
  • 内衬位置和方差将使用“马哈拉诺比斯距离”来计算,该值不受异常值的影响较小。
  • 计算数据的鲁棒协方差拟合。
from sklearn.datasets import make_blobs
X,_ = make_blobs(n_features=2, centers=2, cluster_std=2.5, n_samples=1000)
plt.scatter(X[:,0], X[:,1],s=10)

在这里插入图片描述

from sklearn.covariance import EllipticEnvelope
ev = EllipticEnvelope(contamination=.1)
ev.fit(X)
cluster = ev.predict(X)
plt.scatter(X[:,0], X[:,1],s=10,c=cluster)

在这里插入图片描述

Isolation Forest隔离森林

  • Based on RandomForest
  • Useful in detecting outliers in high dimension datasets.
  • This algorithm randomly selects a feature & splits further.
  • Random partitioning produces shorter part for anomolies.
  • When a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
  • 基于RandomForest
  • 在检测高维数据集中的异常值时很有用。
  • 此算法随机选择一个特征并进一步分割。
  • 随机分区产生的部分更短。
  • 当随机树木的森林为特定样本共同产生较短的路径长度时,很可能是异常情况。
rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
from sklearn.ensemble import IsolationForest
data = np.r_[X_train,X_test,X_outliers]
iso = IsolationForest(behaviour='new', contamination='auto')
iso.fit(data)
pred = iso.predict(data)
plt.scatter(data[:,0], data[:,1],s=10,c=pred)

在这里插入图片描述

Local Outlier Factor局部离群因子

  • Based on nearest neighbours
  • Suited for moderately high dimension datasets
  • LOF computes a score reflecting degree of abnormility of a data.
  • LOF Calculation
    • Local density is calculated from k-nearest neighbors.
    • LOF of each data is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density.
    • An abnormal data is expected to have smaller local density.
  • LOF tells you not only how outlier the data is but how outlier is it with respect to all data
  • 基于最近的邻居
  • 适用于中等高维数据集
  • LOF计算反映数据异常程度的分数。
  • LOF计算
    -从k最近邻居计算局部密度。
    -每个数据的LOF等于他的k个最近邻居的平均局部密度与其自身局部密度之比。
    -异常数据的局部密度较小。
  • LOF不仅告诉您数据的异常值,而且还告诉您所有数据的异常值
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=25,contamination=.1)
pred = lof.fit_predict(data)
s = np.abs(lof.negative_outlier_factor_)
plt.scatter(data[:,0], data[:,1],s=s*10,c=pred)

在这里插入图片描述

Outlier Detection using DBSCAN使用DBSCAN进行异常值检测

  • DBSCAN is a clustering method based on density
  • Groups data which are closer to each other.
  • Doesn’t use distance vector calculation method
  • Data not close enough to any cluster is not assigned any cluster & these can be anomalies
  • eps controls the degree of considering a data part of cluster
  • DBSCAN是基于密度的聚类方法
  • 将彼此接近的数据分组。
  • 不使用距离矢量计算方法
  • 没有足够接近任何集群的数据没有被分配任何集群,这些可能是异常的
  • eps控制考虑集群数据部分的程度
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=.3)
dbscan.fit(data)
plt.scatter(data[:,0], data[:,1],s=s*10,c=dbscan.labels_)

在这里插入图片描述

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/sinat_23971513/article/details/105243561