Outlier detection based on two-step clustering

Please indicate the source when reprinting: http://www.cnblogs.com/tiaozistudy/p/anomaly_detection.html

This article mainly describes the principle of the outlier detection algorithm in IBM SPSS Modeler 18.0 and the method of using the "abnormal" node (see Figure 1). The idea of ​​outlier detection algorithm in SPSS Modeler is mainly based on cluster analysis. As shown in Figure 2, the sample points in the figure can be clustered into three categories first. The three sample points $A$, $B$ and $C$ should belong to the class closest to them respectively, but they are different from other samples in the relative class. point, these three points are far away from their respective classes, so it can be determined as an outlier based on this.

 

Figure 1: The "Exception" node

 

Figure 2: Schematic diagram of outlier detection

1. Outlier detection algorithm idea

According to the above analysis, your cluster point detection algorithm is mainly divided into three stages: the first stage, clustering, that is, clustering the sample points into several categories; the second stage, calculation, that is, on the basis of the first stage clustering, according to the distance Calculate the abnormality measurement indicators of all sample points; the third stage, diagnosis, that is, on the basis of the abnormality measurement indicators of the second stage, determine the final outliers, and analyze the reasons for the abnormality of the sample points, that is, analyze the outliers In which variable direction the point renders the exception. The three stages are discussed below:

1.1. The first stage: clustering

This stage mainly realizes the clustering of all sample points by means of the two-step clustering algorithm (refer to the related content of the two-step clustering algorithm) . The two-step clustering algorithm is mainly divided into two steps: the first step is to condense a large number of scattered data samples into a manageable number of sub-clusters by constructing a clustering feature (CF) tree; Clusters start using the agglomerative hierarchical clustering method, merging subclusters one by one, up to the desired number of clusters.

The two-step clustering algorithm can process outliers. First, the potential outliers are screened out before the CF tree is thinned (rebuilding), and the misidentified outliers are re-inserted into the CF tree after the CF tree thinning step.

  • Screening for potential outliers. Before the thinning of the CF tree, find the meta entry that contains the most data samples from all leaf entries in the current CF tree, and record the number of data samples contained in the meta entry ($N_{\max} $) , according to the pre-determined scale parameter $\alpha$; if the number of data samples contained in a leaf element item is less than $\alpha N_{\max} $, the leaf element item is set as a potential outlier, from the current CF tree remove.
  • Misidentified outlier insertion. After the CF tree is thinned, the potential outliers are processed one by one. If the potential outliers can be absorbed into the CF tree without increasing the volume of the current CF tree, the potential outliers are considered to be misidentified outliers and will be inserted into the current CF tree. CF tree. After completing the insertion of all data points in the dataset $\mathfrak D $ into the CF tree, they are still potential outliers and are regarded as the final outliers.

After completing the insertion of all data samples in the dataset into the CF tree, the elements that are still potential outliers are regarded as the final outliers. These outliers will be assigned to the clustering results of the second step agglomerative method.

Related to this stage in the "Exception" node is the relevant parameter design of its "Expert" option (see Figure 3):

  • Adjustment coefficient. Specify a number greater than 0 to adjust the weight of continuous variables and categorical variables when calculating distances. The larger the value, the greater the weight of continuous variables.
  • The number of peer groups is automatically calculated. Indicates that the automatically judged sample points should be clustered into several categories, and the minimum and maximum allowed number of clusters must be specified separately.
  • Specifies the number of peer groups. Check this option to directly specify the number of clusters.
  • noise level. It corresponds to $\alpha$ above, and the noise level ranges from 0 to 0.5.
  • Impute missing values. If this option is checked, the variable mean will be used to replace missing values ​​for continuous variables, and for categorical variables, missing values ​​will be treated as a valid new category; if this option is not checked, any samples with missing values ​​will be removed from the analysis eliminated in.

Figure 3: The Expert option for the Exceptions node

1.2. Phase 2: Computation

The task of the second stage is to calculate the abnormality measurement index of the sample on the basis of the first stage clustering. The calculation of the abnormality measurement index is based on the log-likelihood distance (refer to the related content of the log-likelihood distance) . For the sample point $s$, the outlier detection algorithm computes the following metrics.

(1) Find the cluster $C_j$ to which the sample point $s$ belongs. The GDI (Group Deviation Index) of the sample point $s$ is obtained by calculating the log-likelihood distance between $\{s\}$ and $C_j \setminus \{s\}$:

\begin{equation}\label{Eq.1}
GDI_s = d(\{s\}, C_j \setminus \{s\}) = \zeta_{C_j \setminus \{s\}} + \zeta_{\{s\}} - \zeta_{C_j}
\end{equation}

in

\begin{equation*}
\zeta_{C_j} = -N_j \left ( \frac12 \sum_{k=1}^{D_1} \ln (\hat \sigma^2_{jk} + \hat \sigma^2_k) + \sum_{k=1}^{D_2} \hat E_{jk} \right )
\end{equation*}

\begin{equation*}
\hat E_{jk} = -\sum_{l=1}^{\epsilon_k} N_{jkl}/N_j \ln (N_{jkl}/N_j)
\end{equation*}

For the specific meaning of the symbols in the formula, see "Log-Likelihood Distance". The GDI reflects the increase in the difference/spread within the cluster $C_j$ caused by the sample point $s$ being added to the cluster $C_j$. Therefore, the larger the GDI, the more likely the sample point is an outlier.

(2) According to the definition of $\zeta_{C_j}$, $\zeta_{C_j}$ can be divided into linear combinations of the values ​​of each variable:

\begin{equation*}
\zeta_{C_j} = \sum_k^D \zeta_{C_j}^{(k)}
\end{equation*}

in

\begin{equation*}
\zeta_{C_j}^{(k)} =
\begin{cases}
- \frac{N_j}2 \ln (\hat \sigma^2_{jk} + \hat \sigma^2_k) , & \text{variable}k\text{is continuous} \\
-\sum_{l=1}^{\epsilon_k} N_{jkl} \ln (N_{jkl}/N_j), & \text{ The variable }k\text{is typed}
\end{cases}
\end{equation*}

Further define the variable difference index VDI (Variable Deviation Index) of the sample point $s$ on the variable $k$:

\begin{equation}\label{Eq.2}
VDI_s^{(k)} = \zeta_{C_j \setminus \{s\}}^{(k)} + \zeta_{\{s\}}^{(k)} - \zeta_{C_j}^{(k)}
\end{equation}

Therefore, $GDI_s = \sum_k VDI_s^{(k)}$, and the variable difference indicator VDI represents the "contribution" of each variable on the group difference indicator GDI.

(3) Calculate the anomaly index AI (Anomaly Index).

For the sample point $s$, its AI is defined as:

\begin{equation}\label{Eq.3}
AI_s = \frac{GDI_s}{\frac{1}{|C_j|} \sum_{t \in C_j} GDI_t}
\end{equation}

AI is a relative indicator that is more intuitive than GDI. It is the ratio of the intra-cluster difference caused by the sample point $s$ to the average value of the difference caused by other sample points in the cluster $C_j$. The larger the value is, the sample point $s The more likely $ is an outlier.

(4) Calculate the variable contribution index VCM (Variable Contribution Measures).

For the sample point $s$, the contribution index of the variable $k$ is defined as

\begin{equation}\label{Eq.4}
VCM_s^{(k)} = \frac{VDI_s^{(k)}}{GDI_s}
\end{equation}

VCM is a relative indicator, which is more intuitive than VDI, and reflects the proportion of each cluster variable "contribution" to the difference within the group. The larger the value, the more likely the corresponding variable will cause the sample point $s$ to be outlier.

1.3. Stage 3: Diagnosis

In the second stage, the GDI, VDI, AI, and VCM of all sample points are calculated. In this stage, outliers will be determined based on the ranking results of these indicators, and the reasons for the abnormality will be analyzed.

  1. Sort AI in descending order, and the sample points in the top m positions may be outliers. At the same time, the AI ​​value of the $m$ position is the criterion for judging outliers. Those larger than this value are outliers, and those smaller than this are non-outliers.
  2. For outliers, sort the VDIs in descending order, and the variables in the top 1 are the main reason that the point may be abnormal.

Related to the second and third stages in the "abnormal" node is the relevant parameter design of its "model" option (see Figure 4):

  • Minimum anomaly index level. Specify the threshold value of the abnormal index AI. For the sample point $s$, if the $AI_s$ calculated according to formula (3) is greater than the threshold value, the sample point $s$ is defined as an outlier.
  • The percentage of the most abnormal records in the training data. Specifies what percentage of the samples are outliers.
  • The most unusual number of records in the training data. Specifies how many samples are outliers.
  • Number of exception fields to report ($l$). For outliers $s$, for all variables $k$, take $l$ variables with the largest $VDI_s^{(k)}$ value (calculated by equation (2)) and report them in the model results.

Figure 4: Model options for the Exceptions node

2. Example stream

Figure 5: Anomaly detection example flow

Referring to the data and streams in "SPSS Modeler Data Mining Methods and Applications", the stream shown in Figure 5 is constructed. Open the model nugget in the stream (shown in Figures 6 and 7).

The results of the "Model" option in Figure 6 show that all sample data are clustered into two categories (called "peer groups"), the first category contains 498 samples and 5 outliers were found; the second category contains 169 1 sample and found 1 outlier. For the 5 outliers in the first category, find the 3 variables with the largest VCM values ​​(see Equation (4)) respectively, forming the above table in Figure 6. The 3 variables with the largest VCM values ​​corresponding to all 5 outliers contain "basic cost", in other words, the variable "basic cost" has a larger contribution to the formation of all 5 outliers, and 5 outliers The average VCM value of the points is 0.165. Another variable to pay attention to is the "free part", although this variable only has a large contribution to the cause of the 3 outliers, its average VCM value reaches 0.325, which means that, It has a very large contribution to the causes of these 3 outliers.

The "Summary" option in Figure 7 gives the AI ​​threshold for judging outliers, that is, "Outlier Index Cutoff: 1.52328". For the 6 outliers found this time, the AI ​​values ​​calculated by formula (3) are not less than this value, and the AI ​​values ​​of other sample points are all less than this value.

Figure 6: Model options for the Exception model nugget

 

Figure 7: "Summary" option for "Anomalies" model nugget

In the "Settings" option of the model nugget, select "Discard Records" - "Non-Anomalous" and then run the "Anomalous Data" table to get the output of outliers as shown in Figure 8. There are a total of 5 types of new variables in this table:

Table 1: Description of the new variables

new variable name illustrate
$O-Anomaly Outlier or not, T: yes, F: no
$O-AnomalyIndex The abnormal index AI of this sample is calculated by formula (3)
$O-PeerGroup The peer group to which the sample is assigned (in the cluster)
$ O-Field-n The name of the nth largest variable in VCM, VCM is calculated by formula (4)
$O-FieldImpact-n VCM value corresponding to the nth largest variable in VCM

 Taking the flow in Figure 5 as an example, because the "Number of abnormal fields to be reported" in Figure 4 is set to 3, the maximum value of $n$ in $O-Field-n and $O-FieldImpact-n can only be obtained. 3. Looking at the first row in Figure 8, the sample's $AI=1.530$, belonging to the first cluster, the variable with the largest VCM contribution is "income", and the sample's $VCM=0.200 under the variable "income" $.

Figure 8: Outlier output result table

references

[1] Xue Wei, Chen Huange. SPSS Modeler data mining method and application [M]. Beijing: Electronic Industry Press. 2014.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325481035&siteId=291194637