20 Common Data Value Anomaly Inspection Methods


        Data value anomalies refer to the existence of some values ​​in the data set that do not match other data values. These outliers may be due to data entry errors, measurement bias, or other unknown causes. Abnormal data values ​​have adverse effects on the results of data analysis and modeling, so they need to be inspected and dealt with.

Classification of inspection methods

Common data value anomaly inspection methods can be classified according to different classification standards. The following are some of the classification methods:

  • Classification based on statistical methods and machine learning methods:

Test methods based on statistical methods: Z-Score test, Grubbs test, Dixon test, boxplot test, etc.

Inspection methods based on machine learning methods: Isolation Forest, One-Class SVM, LOF, ABOD, HBOS, COF, CBLOF, etc.

  • Classification based on data distribution:

Test methods based on normal distribution: Z-Score test, Grubbs test, Dixon test, etc.

Test methods based on non-normal distribution: boxplot test, Isolation Forest, One-Class SVM, LOF, ABOD, HBOS, COF, CBLOF, etc.

  • Classification based on distance and density:

Test methods based on distance: Grubbs test, Dixon test, Isolation Forest, etc.

Density-based inspection methods: LOF, ABOD, HBOS, COF, CBLOF, etc.

It should be noted that these classification methods are not fixed, and different methods may be classified according to different standards. At the same time, there are certain overlaps and intersections between these methods, and some methods may have multiple classification features at the same time. Therefore, in the specific application, it is necessary to comprehensively consider various factors to select a suitable anomaly detection method.

20 methods for testing data value anomalies

1. Box plot test

A boxplot is a way to visualize the distribution of data. A boxplot can display the minimum, maximum, median, first quartile, and third quartile of a data set. In the boxplot, any data point outside the 1.5 interquartile range (IQR) is considered an outlier.

2. Grubbs test

Grubbs test is a statistical method used to detect the presence of outliers in a data set. This method assumes that the data set is normally distributed and calculates a statistic based on this. A data point is considered an outlier if its statistical value is significantly greater than other data points.

3. Z-score test
Z-score test is a standard deviation-based method for detecting outliers in a data set. The method first calculates the mean and standard deviation of the dataset, and then calculates the Z-score for each data point. A data point is considered an outlier if its Z-score exceeds 3.

4. Tukey's test
Tukey's test is a method based on the median and the interquartile range, which is used to detect whether there are outliers in the data set. This method calculates a statistic and if the statistic of a data point exceeds a certain threshold, the data point is considered an outlier.

5. Cook's distance test
Cook's distance test is a method for detecting outliers in a data set, especially for multiple linear regression models. This method calculates the degree of influence of each data point on the regression coefficient, and if a data point has a significantly greater influence on the regression coefficient than other data points, the data point is considered an outlier.

6. Mahalanobis distance test
The Mahalanobis distance test is a method for detecting whether there are outliers in a multivariate data set. This method calculates the distance of each data point from the sample mean based on the sample mean and covariance matrix. A data point is considered an outlier if its distance is significantly greater than other data points.

7. Hampel test
Hampel test is a method based on median and absolute deviation, which is used to detect whether there are outliers in the data set. This method first calculates the median and absolute deviation of the data set, and then calculates the absolute difference between each data point and the median. If the absolute difference of a data point exceeds a specific threshold, the data point considered to be outliers.

8. LOF (Local Outlier Factor) Test
LOF test is a density-based method for detecting outliers in a data set. This method calculates a local outlier factor for each data point based on the density around that data point. A data point is considered an outlier if its local outlier factor is significantly larger than other data points.

9. Isolation Forest (Isolation Forest) test
Isolation Forest test is a random forest-based method for detecting outliers in the data set. This method divides the data set into multiple subspaces, and then gradually separates the outliers in the subspaces by randomly selecting features and thresholds. A data point is considered an outlier if it was separated out by random separation significantly more than other data points.

10. HBOS (Histogram-Based Outlier Detection) Test
HBOS test is a histogram-based method for detecting outliers in a data set. The method first divides the data set into intervals and counts the number of data points in each interval. The frequency of the interval in which each data point falls is then calculated and used as the score for that data point. A data point is considered an outlier if its score is significantly lower than the others.


11. One-class SVM (Support Vector Machine) test
One-class SVM test is a method based on support vector machine, which is used to detect whether there are outliers in the data set. The method utilizes support vector machines to model the dataset, and then uses each data point in the dataset as test data to make predictions. A data point is considered an outlier if its predicted value is significantly lower than the others.

12. Local Correlation Integral (local correlation integral) test
Local Correlation Integral test is a method based on local correlation to detect whether there are outliers in the data set. The method first calculates the correlation between each data point and other data points in the data set, and then calculates the local correlation integral value around each data point. A data point is considered an outlier if its local correlation integral value is significantly lower than other data points.

13. Ridge Regression (ridge regression) test
Ridge Regression test is a regression model-based method for detecting outliers in a data set. This method models the dataset using a ridge regression model and detects outliers based on the prediction error of the model. A data point is considered an outlier if its prediction error is significantly higher than the others.

14. Robust PCA (Robust Principal Component Analysis) test
Robust PCA test is a method based on principal component analysis to detect whether there are outliers in the data set. The method models the dataset with a robust principal component analysis model and uses the model's residuals to detect outliers. A data point is considered an outlier if its residuals are significantly larger than the others.

15. MCD (Minimum Covariance Determinant) test
The MCD test is a method based on a robust covariance matrix, which is used to detect whether there are outliers in the data set. The method models the dataset with a robust covariance matrix and uses the model's Mahalanobis distance to detect outliers. A data point is considered an outlier if its Mahalanobis distance is significantly greater than other data points.


16. LOF (Local Outlier Factor) Test
LOF test is a local density-based method for detecting outliers in a data set. The method first calculates the local density of each data point, and then calculates the local outlier factor of each data point relative to its neighbors. A data point is considered an outlier if its local outlier factor is significantly higher than other data points.

17. ABOD (Angle-based Outlier Detection) test
ABOD test is an angle-based method for detecting whether there are outliers in the data set. The method first calculates the angle of each data point relative to other data points, and then calculates the average angular deviation of each data point. A data point is considered an outlier if its mean angular deviation is significantly larger than other data points.

18. HBOS (Histogram-based Outlier Score) test
HBOS test is a histogram-based method for detecting outlier points in the data set. This method first divides the data set into several intervals, and then calculates the distribution of each data point in each interval. Finally, an outlier score is calculated based on the distribution of data points. A data point is considered an outlier if its outlier score is significantly higher than other data points.

19. COF (Connectivity-based Outlier Factor) test
COF test is a connectivity-based method used to detect whether there are outliers in the data set. This method first calculates the reachable distance and number of reachable points for each data point, and then calculates the COF score for each data point. A data point is considered an outlier if its COF score is significantly higher than the others.

20. CBLOF (Clustering-based Local Outlier Factor) test
CBLOF test is a clustering-based method for detecting whether there are outliers in the data set. The method first clusters the data set, and then calculates the local outlier factor of each data point relative to its cluster. A data point is considered an outlier if its local outlier factor is significantly higher than other data points.

The above are the twenty common data value anomaly detection methods. It should be noted that in practical applications, different data sets and abnormal situations may require different inspection methods. At the same time, the inspection method also needs to be selected and adjusted in combination with specific domain knowledge and data characteristics.

Guess you like

Origin blog.csdn.net/tuposky/article/details/130436190