去极值Detect Outliers的几种方案：MAD、3sigma

异常值检测Detect Outliers

In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from othervise well-structured data.

There are several ways to detect anomalies.

Detect Outlier这个概念，更多是用在machine learning处理数据时。

去极值是一个更广泛的概念，极值是异常值的一种，先找出极值（异常值），再去掉极值，算是一个完整的“去极值”过程。
Detect Anomalies
1.Standard Deviation

For a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.
2.Boxplots

Interquartile Range
3.DBScan Clustring

DBScan is a clustering algorithm that’s used cluster data into groups.
4.Isolation Forest

Isolation Forest is an unsupervised learning algorithm that belongs to the ensemble decision trees family.
5.Robust Random Cut Forest

Random Cut Forest (RCF) algorithm is Amazon’s unsupervised algorithm for detecting anaomalies.
6.Minimum Covariance Determinant
7.Local Outlier Factor

The local outlier factor(LOF) is a technique that attempts to harness the idean of nearest neighbors for outlier detection.
8.One-Class SVM
9.Z-Score
Anomaly Detection vs. Outlier detection

Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting abnormal observations.
Univariate vs. Multivariate

Univariate outlier can be found when looking at a distribution of values in a single feature space.

Multivariate outliers can be found in a n-dimensional space (of n-features).
MAD vs. 3sigma
MAD

Median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data.

This indicator was initially developed by statisticians but is relatively unknown in psychology.

MAD is defined as the median of the absolute deviations from the data’s median.
$MAD=median(|X_i-\hat X|)$
$\hat X=median(X)$

The median (M) is , like the mean, a measure of central tendency but offers the advantage of being very insensitive to the presence of outliers. One indicator of this insensitivity is the “breakdown point”
$3\sigma$

This method of the mean plus or minus three SD is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range.

Three problems can be identified when using the mean as the central tendency indicator:
1. It asumes that the distribution is normal (outliers included).
2. The mean and standard deviation are strongly impacted by outliers.
3. It’s very unlikely to detect outliers in small samples
References

去极值Detect Outliers的几种方案：MAD、3sigma

异常值检测Detect Outliers

Detect Anomalies

1.Standard Deviation

2.Boxplots

3.DBScan Clustring

4.Isolation Forest

5.Robust Random Cut Forest

6.Minimum Covariance Determinant

7.Local Outlier Factor

8.One-Class SVM

9.Z-Score

Anomaly Detection vs. Outlier detection

Univariate vs. Multivariate

MAD vs. 3sigma

MAD

References

猜你喜欢