Data might contain more than one peaks in the distribution of data.
Trying to fit such multi-model data with unimodel won’t give a good fit.
GMM allows to fit such multi-model data.
Configuration involves number of components in data, n_components.
covariance_type controls the shape of cluster
full : cluster will be modeled to eclipse in arbitrary dir
sperical : cluster will be spherical like kmeans
diag : cluster will be aligned to axis
We will see how GMM can be used to find outliers
数据分布中可能包含多个峰。
试图用unimodel拟合这样的多模型数据不会很合适。
GMM允许拟合此类多模型数据。
配置涉及数据中的组件数n_components。
covariance_type控制簇的形状
full:集群将被建模为在任意目录中蚀
sperical:簇将像kmeans一样呈球形
diag:群集将与轴对齐
我们将了解如何使用GMM查找异常值
# Number of samples per component
n_samples =500# Generate random sample, two components
np.random.seed(0)
C = np.array([[0.,-0.1],[1.7,.4]])
C2 = np.array([[1.,-0.1],[2.7,.2]])#X = np.r_[np.dot(np.random.randn(n_samples, 2), C)]#.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
X = np.r_[np.dot(np.random.randn(n_samples,2), C),np.dot(np.random.randn(n_samples,2), C2)]
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.covariance import EllipticEnvelope
ev = EllipticEnvelope(contamination=.1)
ev.fit(X)
cluster = ev.predict(X)
plt.scatter(X[:,0], X[:,1],s=10,c=cluster)
Isolation Forest隔离森林
Based on RandomForest
Useful in detecting outliers in high dimension datasets.
This algorithm randomly selects a feature & splits further.
Random partitioning produces shorter part for anomolies.
When a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
基于RandomForest
在检测高维数据集中的异常值时很有用。
此算法随机选择一个特征并进一步分割。
随机分区产生的部分更短。
当随机树木的森林为特定样本共同产生较短的路径长度时,很可能是异常情况。
rng = np.random.RandomState(42)# Generate train data
X =0.3* rng.randn(100,2)
X_train = np.r_[X +2, X -2]# Generate some regular novel observations
X =0.3* rng.randn(20,2)
X_test = np.r_[X +2, X -2]# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20,2))
from sklearn.ensemble import IsolationForest
data = np.r_[X_train,X_test,X_outliers]
iso = IsolationForest(behaviour='new', contamination='auto')
iso.fit(data)
pred = iso.predict(data)
plt.scatter(data[:,0], data[:,1],s=10,c=pred)
Local Outlier Factor局部离群因子
Based on nearest neighbours
Suited for moderately high dimension datasets
LOF computes a score reflecting degree of abnormility of a data.
LOF Calculation
Local density is calculated from k-nearest neighbors.
LOF of each data is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density.
An abnormal data is expected to have smaller local density.
LOF tells you not only how outlier the data is but how outlier is it with respect to all data
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=25,contamination=.1)
pred = lof.fit_predict(data)
s = np.abs(lof.negative_outlier_factor_)
plt.scatter(data[:,0], data[:,1],s=s*10,c=pred)
Outlier Detection using DBSCAN使用DBSCAN进行异常值检测
DBSCAN is a clustering method based on density
Groups data which are closer to each other.
Doesn’t use distance vector calculation method
Data not close enough to any cluster is not assigned any cluster & these can be anomalies
eps controls the degree of considering a data part of cluster
DBSCAN是基于密度的聚类方法
将彼此接近的数据分组。
不使用距离矢量计算方法
没有足够接近任何集群的数据没有被分配任何集群,这些可能是异常的
eps控制考虑集群数据部分的程度
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=.3)
dbscan.fit(data)
plt.scatter(data[:,0], data[:,1],s=s*10,c=dbscan.labels_)