Outlier detection methods (Z-score, DBSCAN, isolated forest)

 _ _ Deep learning machine learning portal classic (bloggers permanent free instructional video series)

https://study.163.com/course/courseMain.htm?courseId=1006390023&share=2&shareId=400000000398149

Micro-channel scan two-dimensional code, free resources to learn more python

The quality of data preprocessing, largely determines the quality of the analysis results of the model. (Garbage In Garbage Out!)

Wherein outliers (Outliers) is detected in the whole data pre-processing, a very important part. Methods are different. For example, there are methods based on classical statistics - over three times the standard deviation of the data as an outlier and so on.

Since the outlier test, and de-emphasis, different missing values, which with a certain degree of subjectivity. So, I would like to ask Daniel, usually you'll like what kinds of values or methods which detect abnormal believe.

Author: Ali cloud Yunqi community
link: https: //www.zhihu.com/question/38066650/answer/549125707
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Common outlier detection of four methods, respectively Numeric Outlier, Z-Score, DBSCA and Isolation Forest

When training a machine learning algorithm or the application of statistical techniques, error values ​​or outliers can be a serious problem, they usually cause the results of measurement error or abnormal system conditions, and therefore does not have the characteristics described in the underlying system. In fact, the best practice is prior to further analysis, it should be an outlier removal process.

In some instances, the outlier may be provided about the entire local anomaly information system; therefore, outlier detection is a valuable process, because in this project, can provide additional information about the data set.

There are many techniques can detect outliers, and you can choose whether to concentrate removed from the data. In this blog post, the analysis will show KNIME platform in the four most commonly used technique outlier detection.

Data sets and outlier detection problem

This paper used to test and compare the proposed outlier detection techniques of data collection from airlines the data set, the data set includes information on US domestic flights between 2007 and 2012, for example, departure time, arrival time, departure airport, purpose airports, air time, departure delays, flight delays, flight number and so on. Some columns may contain outliers.

From the original data set, a random sample extract 1,500 flights in 2007 and 2008 from Chicago O'Hare Airport (ORD).

In order to demonstrate the selected outlier detection technology is how it works, will focus on identifying outliers average arrival airport delays, these outliers are given on all flights Airport calculations. We are looking for an unusual show that the average arrival delay time of the airport.

Four kinds of outlier detection technique

Digital outliers | Numeric Outlier

Digital methods outlier dimensional feature space is a simple nonparametric outlier detection methods, the abnormal value is calculated by IQR (InterQuartile Range).

Calculating the first and third quartile (Q1, Q3), it is located outside the outlier data points interquartile range xi:

Author: Ali cloud Yunqi community
link: https: //www.zhihu.com/question/38066650/answer/549125707
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Use quartile multiplier value k = 1.5, the upper and lower range limits are typically box whisker FIG. This technique is used KNIME Analytics Platform Numeric Outliers node built workflow implemented (see FIG. 1).

Z-score

Z-score is a one-dimensional parameter or abnormality detecting method for a low-dimensional feature space. This technique assumes that the data is Gaussian distribution, the distribution of the data points are outliers tail, thus away from the average value of the data. The distance depends on the distance calculated using the formula of normalized data point zi threshold value is set Zthr:

Wherein xi is a data point, μ is the average of all the points xi, δ is the standard deviation of all points xi.

Then, after normalization process also normalize abnormal value, the absolute value is greater than Zthr:

Zthr value is generally set to 2.5, 3.0 and 3.5. This technique is the use of the line filter KNIME workflow implemented in the nodes (see FIG. 1).

 
 
Author: Ali cloud Yunqi community
link: https: //www.zhihu.com/question/38066650/answer/549125707
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

DBSCAN

This technique is based DBSCAN clustering method , DBSCAN is one-dimensional or multidimensional feature space nonparametric, outlier detection methods based on density.

In DBSCAN clustering technique, all the data points are defined as core points (Core Points), boundary point (Border Points), or noise points (Noise Points).

  • Core points having at least a minimum data points comprising points (MinPts) is within a distance ℇ;
  • Boundary point is within a distance of adjacent points ℇ core point, but less than the minimum number of points comprises points comprising (MinPts);
  • Every other data point is the point of noise, also identified as an outlier;

Accordingly, the abnormality detection comprises the minimum number of points depends on the desired distance and ℇ selected distance metric, such as the Euclidean or Manhattan distance. This technique is used DBSCAN node KNIME workflow implemented in FIG.

Isolated forest | Isolation Forest

This method is a non-parametric method or multi-dimensional feature space for large data sets, it is an important concept in which the number of isolated.

Isolated the desired number is the number of split isolated data points. This division number is determined by the following steps:

  • Randomly selected to be separated points "a";
  • Select random data points between the minimum and maximum values ​​of "b", and the "a" different;
  • If the "b" value is lower than the value of "a", the value "b" becomes the new lower limit;
  • If "b" is greater than the value of "a", the value "b" becomes the new upper limit;
  • As long as there is data other than the point "a" between the upper and lower limits, the process is repeated;

Compared with the non-isolated outlier, it requires fewer division to isolate outliers, i.e., an abnormal value as compared with non-outlier having a lower isolation. Thus, if the number of isolated data points below the threshold, the data point is defined as an abnormal value.

Threshold is defined based on the estimated percentage of data outliers, which is the starting point outlier detection algorithm. Isolated forest technology related to image interpretation, you can find detailed information on this .

By using a few lines of code in the Python Script Python that the technology can be achieved.

from sklearn.ensemble import IsolationForest
import pandas as pd

clf = IsolationForest(max_samples=100, random_state=42)
table = pd.concat([input_table['Mean(ArrDelay)']], axis=1)
clf.fit(table)
output_table = pd.DataFrame(clf.predict(table))```python
Author: Ali cloud Yunqi community
link: https: //www.zhihu.com/question/38066650/answer/549125707
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Python Script node is KNIME Python Integration part, which allows us to code written in Python / import to KNIME workflow.

In KNIME workflow

KNIME Analytics Platform is an open-source software for scientific data, ranging from data ingest and mixed data, data visualization of all data needs, from machine learning algorithms to data applications, from reporting to deploy, and so on. It is based on a graphical user interface for visual programming, making it very intuitive and easy to use, significantly reduces the learning time.

Further, it is designed to open to different data formats, data types, data sources, data and internet external tool (e.g. R and the Python), further comprising a plurality of extensions for analyzing unstructured data, such as text, images or graphics .

KNIME Analytics Platform calculation unit is a small color blocks, called "nodes." A pipeline connected to a node of the assembly, data processing applications. Pipeline also known as "workflow."

In view of all these properties, it is selected to implement the foregoing described herein four outlier detection technique. Figure 1 shows a work flow outlier detection technique. work process:

  • 1. Read data samples of Read data metanode;
  • 2. The data preprocessing and calculates the average within each airport arrival Preproc delay element node;
  • 3. The next meta-node named density of delay, the data are normalized, and the normalized average density of the standard normal density and delay the arrival comparison;
  • 4. using four selected to detect outliers;
  • 5. Use KNIME and Open Street Maps integration, display a map of the United States outliers Airport MapViz yuan node.
Figure 1: embodiment of four kinds of outlier detection workflow: digital outliers, Z-score, DBSCAN isolated and forest

Detected anomaly value

In Figure 2-5, the airport can be seen that the outlier detected by different techniques. among them. Blue circles indicate no abnormal behavior airport, and the airport has a red square indicates abnormal behavior. The average arrival delay time is defined as the size of the note.

一些机场一直被四种技术确定为异常值:斯波坎国际机场(GEG)、伊利诺伊大学威拉德机场(CMI)和哥伦比亚大都会机场(CAE)。斯波坎国际机场(GEG)具有最大的异常值,平均到达时间非常长(180分钟)。然而,其他一些机场仅能通过一些技术来识别、例如路易斯阿姆斯特朗新奥尔良国际机场(MSY)仅被孤立森林和DBSCAN技术所发现。

对于此特定问题,Z-Score技术仅能识别最少数量的异常值,而DBSCAN技术能够识别最大数量的异常值机场。且只有DBSCAN方法(MinPts = 3/ℇ= 1.5,欧几里德距离测量)和孤立森林技术(异常值的估计百分比为10%)在早期到达方向发现异常值。


图2:通过数字异常值技术检测到的异常值机场 图3:通过z-score技术检测到的异常机场 图4:DBSCAN技术检测到的异常机场 图5:孤立森林技术检测到的异常机场

  

总结

本文在一维空间中描述并实施了四种不同的离群值检测技术:2007年至2008年间所有美国机场的平均到达延迟。研究的四种技术分别是Numeric Outlier、Z-Score、DBSCAN和Isolation Forest方法。其中一些用于一维特征空间、一些用于低维空间、一些用于高维空间、一些技术需要标准化和检查维度的高斯分布。而有些需要距离测量,有些需要计算平均值和标准偏差。有三个机场,所有异常值检测技术都能将其识别为异常值。但是,只有部分技术(比如,DBSCAN和孤立森林)可以识别分布左尾的异常值,即平均航班早于预定到达时间到达的那些机场。因此,应该根据具体问题选择合适的检测技术。

参考


https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149 (video bloggers teaching Home)

 
 
 
 
 
 
 

Guess you like

Origin www.cnblogs.com/webRobot/p/11965155.html