去极值Detect Outliers的几种方案:MAD、3sigma

  • 异常值检测Detect Outliers

    In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from othervise well-structured data.

    There are several ways to detect anomalies.

    Detect Outlier这个概念,更多是用在machine learning处理数据时。

    去极值是一个更广泛的概念,极值是异常值的一种,先找出极值(异常值),再去掉极值,算是一个完整的“去极值”过程。

  • Detect Anomalies

  • 1.Standard Deviation

    For a data distribution is approximately normal then about 68% of the data values lie within one standard deviation of the mean and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.

  • 2.Boxplots

    Interquartile Range

  • 3.DBScan Clustring

    DBScan is a clustering algorithm that’s used cluster data into groups.

  • 4.Isolation Forest

    Isolation Forest is an unsupervised learning algorithm that belongs to the ensemble decision trees family.

  • 5.Robust Random Cut Forest

    Random Cut Forest (RCF) algorithm is Amazon’s unsupervised algorithm for detecting anaomalies.

  • 6.Minimum Covariance Determinant
  • 7.Local Outlier Factor

    The local outlier factor(LOF) is a technique that attempts to harness the idean of nearest neighbors for outlier detection.

  • 8.One-Class SVM
  • 9.Z-Score
  • Anomaly Detection vs. Outlier detection

    Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting abnormal observations.

  • Univariate vs. Multivariate

    Univariate outlier can be found when looking at a distribution of values in a single feature space.

    Multivariate outliers can be found in a n-dimensional space (of n-features).

  • MAD vs. 3sigma

  • MAD

    Median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data.

    This indicator was initially developed by statisticians but is relatively unknown in psychology.

    MAD is defined as the median of the absolute deviations from the data’s median.
    M A D = m e d i a n ( ∣ X i − X ^ ∣ ) MAD=median(|X_i-\hat X|) MAD=median(XiX^)
    X ^ = m e d i a n ( X ) \hat X=median(X) X^=median(X)

    The median (M) is , like the mean, a measure of central tendency but offers the advantage of being very insensitive to the presence of outliers. One indicator of this insensitivity is the “breakdown point”

  • 3 σ 3\sigma 3σ

    This method of the mean plus or minus three SD is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range.

    Three problems can be identified when using the mean as the central tendency indicator:

    1. It asumes that the distribution is normal (outliers included).
    2. The mean and standard deviation are strongly impacted by outliers.
    3. It’s very unlikely to detect outliers in small samples
  • References

  1. 5 Ways to Detect Outliers/Anomalies That Every Data Scientist Should Know (Python Code)
  2. Ways to Detect and Remove the Outliers
  3. How to Remove Outliers for Machine Learning
  4. 4 Automatic Outlier Detection Algorithms in Python
  5. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median
  6. A Brief Overview of Outlier Detection Techniques
  7. 1.3.5.17.Detection of Outliers

猜你喜欢

转载自blog.csdn.net/The_Time_Runner/article/details/109171606
MAD
今日推荐