Anomaly detection-linear model

I. Introduction

The attributes in the actual data are usually highly correlated. This dependency provides the ability to predict attributes from each other. The concepts of prediction and anomaly detection are closely related. After all, outliers are not consistent with the expected value (or predicted value) of a particular model. Linear models focus on using inter-attribute dependencies to achieve this goal. In the classic statistical literature, this process is called regression modeling.
Regression modeling is a parametric form of correlation analysis. Some forms of correlation analysis try to predict dependent variables from other independent variables, while other forms of correlation analysis summarize the entire data in the form of latent variables. An example of the latter is the principal component analysis method. These two forms of modeling are very useful in different scenarios of outlier analysis. The former is more useful for complex data types such as time series, while the latter is more useful for regular multidimensional data types.
The main assumption of the linear model is that (normal) data is embedded in a low-dimensional subspace. Therefore, data points that do not conform to this embedding model are considered outliers. In the linear method, the goal is to find a low-dimensional subspace where the behavior of anomalous points is very different from other points.

Two, linear regression model

In linear regression, the observations in the data are modeled by a set of linear equations. Specifically, the different dimensions in the data are related to each other using a set of linear equations, and the coefficients in them need to be learned in a data-driven manner. Since the number of observations is usually much larger than the dimensionality of the data, this system of equations is an over-determined system of equations that cannot be solved accurately (ie, zero error). Therefore, the coefficients learned by these models minimize the deviation of the squared error of the data points from the value predicted by the linear model. The precise selection of the error function determines whether a particular variable is treated specially (that is, the error of the predictor variable value), or whether the variable is treated uniformly (that is, the error distance from the estimated low-dimensional plane). These different choices of error function will not lead to the same model. In fact, especially in the presence of outliers, these models are very different in nature.
Regression analysis is generally regarded as an important application in the field of statistics. In the classic example of this application, the value of a specific dependent variable needs to be learned from a set of independent variables. This is a common situation in time series analysis. Independent variables are also called explanatory variables. This is a common theme of contextual data types, where some attributes (for example, time, spatial location, or adjacent sequence values) are considered independent, while other attributes (for example, temperature or environmental measurements) are considered related. For simple multidimensional data types, all dimensions are handled in the same way, and the best-fit linear relationship between all attributes is estimated.
Consider a domain, such as temporal and spatial data, where attributes are divided into context and behavioral attributes. In this case, a specific behavioral attribute value is usually predicted as a linear function of the behavioral attribute in its context neighborhood to determine the deviation from the expected value. This is achieved by constructing a multi-dimensional data set from time or space data. In this data set, a specific behavioral attribute value (such as the temperature at the current time) is regarded as the dependent variable, and its related neighborhood behavior value ( For example, the temperature in the previous window) is regarded as an independent variable. Therefore, the estimated deviation is used to quantify the outlier points. In this case, the outlier is defined based on the error of the predicted dependent variable, and the anomaly in the relationship between the independent variables is considered less important. Therefore, the focus of the optimization process is to minimize the prediction error of the dependent variable in order to build a model of normal data. Values ​​that deviate from this model are marked as outliers.
What we usually call anomaly detection does not give any special treatment to any variables. The definition of outliers is based on the overall distribution of the underlying data points. Therefore, a more general regression modeling is required: The method treats all variables and determines the best regression level by minimizing the projection error of the data to this level. In this case, suppose we have a set of variables X1, X2,..., Xd, and the corresponding regression plane is as follows:
Insert picture description here
For the convenience of subsequent calculations, the following constraints are imposed on the parameters:
Insert picture description here
L2 norm is used as the target function:
Insert picture description here

Three, PCA

The least squares formula is simply trying to find a (d-1)-dimensional hyperplane and get a best-fitting data value and error along the orthogonal direction. Principal component analysis is a generalization of this problem. Specifically, it can find the optimal representation hyperplane of any dimension. In other words, the PCA method can determine the k-dimensional hyperplane (for any value of k <d), thereby minimizing the squared projection error on the remaining (dk) dimension. The optimal solution of least squares is a special case of principal component analysis, which is obtained by setting k = d-1.
The main properties of principal component analysis related to anomaly detection are as follows:

  • If the first k eigenvectors are selected (according to the largest k eigenvalues), the k-dimensional super-flat defined by these eigenvectors is in all the super-flats of dimension k, and all data points to its average The distance is as small as possible on the flat surface.
  • If the data is converted into the axis system corresponding to the orthogonal eigenvector, the difference of the converted data along each eigenvector dimension is equal to the corresponding eigenvalue. In this new expression, the variance of the converted data is zero.
  • Since the difference of the transformed data along the eigenvectors with small eigenvalues ​​is very low, significant deviations of the transformed data along these directions from the average may indicate outliers.
    Insert picture description here
    Principal component analysis and dependent variable regression can deal with the existence of a few outliers more stably. This is because the principal component analysis calculates the error based on the optimal super flat, rather than a specific variable. When more outliers are added to the data, the change in the optimal super-flat is usually not so large that it affects the selection of outliers. Therefore, this method is more likely to select the correct outliers because the regression model is more accurate from the start.

Four, examples

1. Data visualization

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
##Load training set and test set;
path ='./'
f= open(path+'breast-cancer-unsupervised-ad.csv')
Train_data = pd.read_csv(f)
## Brief observation data(head()+shape)
Train_data.head()
Insert picture description here
## Brief observation data(tail()+ shape)
Train_data.tail()
Insert picture description here
##Use describe() to familiarize
yourself with the relevant statistics of the data Train_data.describe()
Insert picture description here
##Use info() to familiarize
yourself with the data type Train_data.info()
<class'pandas.core.frame.DataFrame '>
RangeIndex: 367 entries, 0 to 366
Data columns (total 31 columns):

Column Non-Null Count Dtype


0 f0 367 non-null float64
1 f1 367 non-null float64
2 f2 367 non-null float64
3 f3 367 non-null float64
4 f4 367 non-null float64
5 f5 367 non-null float64
6 f6 367 non-null float64
7 f7 367 non-null float64
8 f8 367 non-null float64
9 f9 367 non-null float64
10 f10 367 non-null float64
11 f11 367 non-null float64
12 f12 367 non-null float64
13 f13 367 non-null float64
14 f14 367 non-null float64
15 f15 367 non-null float64
16 f16 367 non-null float64
17 f17 367 non-null float64
18 f18 367 non-null float64
19 f19 367 non-null float64
20 f20 367 non-null float64
21 f21 367 non-null float64
22 f22 367 non-null float64
23 f23 367 non-null float64
24 f24 367 non-null float64
25 f25 367 non-null float64
26 f26 367 non-null float64
27 f27 367 non-null float64
28 f28 367 non-null float64
29 f29 367 non-null float64
30 label 367 non-null object
dtypes: float64(30), object(1)
memory usage: 89.0+ KB
numeric_features = [‘f’ + str(i) for i in range(30)]
##相关性分析
numeric = Train_data[numeric_features]
correlation = numeric.corr()
f , ax = plt.subplots(figsize = (14, 14))
sns.heatmap(correlation,square = True)
plt.title(‘Correlation of Numeric Features with Price’,y=1,size=16)
plt.show()
Insert picture description here
##Visualization of the distribution of each digital feature
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col=“variable”,col_wrap=6, sharex=False, sharey= False)
g = g.map(sns.distplot, “value”, hist=False, rug=True)
Insert picture description here
sns.set() #Because
30 features generate a correlation graph between two pairs, there are 30x30 and a total of 900 subgraphs , The sub-pictures look very dense, and they are more configurable.
#So here only shows the correlation between the first 6 features.
sns.pairplot(Train_data[numeric_features[:6]],size = 2 ,kind ='scatter',diag_kind='kde')
plt.savefig('correlation.png')
plt.show()
Insert picture description here
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, # Dimension of embedding space (embedded space means result space)
init='pca', # Initialization of embedding space, receiving
string'random ','pca' or a numpy array random_state= 0)
result = tsne.fit_transform(numeric)
x_min, x_max = np.min(result, 0), np.max(result, 0)
result = (result - x_min) / (x_max - x_min)
label = Train_data[‘label’]
fig = plt.figure(figsize = (7, 7))
#f , ax = plt.subplots()
color = {‘o’:0, ‘n’:7}
for i in range(result.shape[0]):
plt.text(result[i, 0], result[i, 1], str(label[i]),
color=plt.cm.Set1(color[label[i]] / 10.),
fontdict={‘weight’: ‘bold’, ‘size’: 9})
plt.xticks([])
plt.yticks([])
plt.title(‘Visualization of data dimension reduction’)
plt.show()
Insert picture description here

2. Use the pyod library to generate an example and use the pca module of the library for detection

from pyod.models.pca import PCA
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
#Generate sample data
X_train, y_train, X_test, y_test =
generate_data(n_train=1000, # number of training points
n_test=250, # number of testing points
n_features=2,
contamination=0.05, # percentage of outliers
random_state=29)
#train one_class_svm detector
clf_name = ‘PCA’
clf = PCA()
clf.fit(X_train)
#get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
#get the prediction on the test data
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test) # outlier scores
#evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,y_test_pred, show_figure=True, save_figure=False)

On Training Data:
PCA ROC:0.997, precision @ rank n:0.94

On Test Data:
PCA ROC:0.9793, precision @ rank n:0.75
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_43595036/article/details/112783064