Motivation problem
Model
case to give a given training set, how to detect a certain input x is abnormal?
First, to establish a model based on the training data set, when a given value of the data, the data is identified as abnormal, indicating that it has been identified as normal when far from the overall data center.
Fraud identification is most commonly used to identify abnormal areas, a series of feature vector representation of the user i, such as logins, number of clicks a certain page, post and other times, according to model these characteristics, and then identify fraud based on a threshold value. Similarly, the product is further configured to identify abnormal detection.
Gaussian distribution
Anomaly detection algorithm based on the Gaussian distribution
Suppose each of the sample data corresponding to a Gaussian distribution characteristic, the model is equal to the joint distribution of these distributions. In general, based on statistical probabilities associated multiplier hypothesis of independence, but in practice, if the sample size is large enough, independence is not so important.
First selection feature may be required, the fitting characteristic parameters, i.e., mean and variance of the distribution obtained for each feature, can be represented by a feature vector; build a model with all the features of the joint distribution; given new sample point x, in accordance with Calcd model, to see there is no smaller than the threshold value ε.
There are two features on the map data, wherein each fitting parameter, p = height would be represented as a three-dimensional upper of FIG.
Development and evaluation of an anomaly detection system
Making feature choice, if you want to know whether we should add a new feature, an evaluation index value becomes very important, then performing feature selection when you can calculate the added features and without either case, when added when this feature, returns a numerical index, the algorithm can be used to determine whether the effect is improved.
Suppose there are 10 000 normal samples and 20 abnormal samples, evaluated according to the above manner. The set of training feature vectors calculated parameters, the model structure, the proportion of the sample have different classification methods, but do not at the same time as the test set validation set.
First constructed model, the establishment of a Gaussian distribution rub book, then take the model by linking, because the sample is actually with the label, that is, with the label y, then, y for each sample in the training set feature can to help us determine the quality of the model. After the model, the algorithm evaluates centralized authentication, verifies one sample set value x is input to the model, the prediction based on the threshold value tag validation set of samples, the normal point was greater than a threshold, the threshold value is less than the outliers. And then compare the actual label samples, calculating an evaluation index such as accuracy, recall, F-score like.
For selection threshold [epsilon] of the model, can try different [epsilon], and then select the F-score corresponding to the maximum of [epsilon].
Now that we have with data labels, why not apply linear regression, logistic regression and other methods of identifying outliers it?
VS supervised learning anomaly detection
Abnormality detection suitable for positive samples (y = 1) a very small number, and the negative sample (y = 0) a very large number of samples. Because this positive samples positive sample was too small to find the cause of all exceptions, if carried out supervised learning, it can not learn all the knowledge, and there may be a new strange will happen in the future, these anomalies are now unobservable to , but it can not be modeled. In contrast, anomaly detection is a large number of negative examples to model the sample so that any deviation from the model can be identified as abnormal, and what reason do Abnormal Is Mentioned before telling supervised learning when crossing example, classification of spam, it is because we have a number of spam very much, can conclude that a common feature of spam, thereby facilitating learning algorithms and modeling.
Thus, when the number of negative samples i.e. outliers very little time, the negative sample may be modeled using the data anomaly detection method, the data points deviates from the normal are considered outlier; i.e., when a negative sample outlier a very large number of times, supervised learning algorithm can learn effectively, so this time you can choose supervised learning algorithm to identify abnormal points.
Select dysfunction algorithm to use
When performing anomaly detection, we believe that the distribution of the data follow a Gaussian distribution, then the parameters estimated from the training set, the model constructed by linking then multiply, and then verify centralized authentication. But, in fact, a lot of features is not consistent with the distribution of Gaussian distribution, then we can transform be adjusted to the Gaussian distribution (in fact, not adjusted when the number of samples sufficient number of cases can, but if you make adjustments, the model results certainly better). There are many ways to adjust, the parameter values may be shown above logarithmic, square root, etc., by adjusting the exponent parameters, data distribution tends to be Gaussian distribution.
We want to get the model of greater value in the positive sample, the smaller negative value in the sample. We can take this method, first conducted to establish the initial model, and in the final analysis model, when the poor performance of the model when analyzing what possible reason is that, based on these reasons go to select the appropriate feature. A common problem is when a single feature, the amount of normal and abnormal points points are great, this time, you can add new features to an abnormality detection.
We can determine the problem, construct their own characteristics.
Multivariate Gaussian distribution
An abnormality detection extending
upper left corner of FIG green dot abnormal data, typically in a lower CPU load time, memory usage should be low, but the different points. When considered separately CPU load and memory usage when two features, two coordinates shown on the right, the abnormal point exception did not show it, the CPU load point of view, value is less than this point, there are many; the terms of memory usage greater than this point, there are many. Thus using the abnormality detection algorithm can not identify the outliers This is because the Gaussian abnormality detection time, in accordance with the magenta line to divide the left, closer to the point inside the circle more normal, internal principle circle the point is not normal. This ignores the relationship between different features.
In order to improve this anomaly recognition algorithm is insufficient, there is an improved anomaly detection algorithm, multivariate Gaussian distribution.
A multivariate Gaussian distribution
multivariate Gaussian model is not the time distribution respectively of each feature considered as a Gaussian distribution, but is integrated into a distribution, the distribution parameter indicates the covariance matrix of the sample. As the parameter changes, distribution changes as shown in the sample:
when the change characteristic variance while
when only a change in the variance of the feature vector
when the two feature vectors highly correlated
size in the sub-diagonal elements of the covariance matrix represent two the correlation coefficient of the features, and therefore, the larger the value, the greater the correlation of two characteristics, the sample distribution as shown in FIG. Similarly, when the correlation coefficient is negative, a negative correlation indicates two features, the sample distribution is as follows:
wherein a negative correlation
when changing the mean time, the peak of the distribution will change, i.e., the mean change is to move the entire distribution center:
mean change multivariate Gaussian distribution
Multivariate Gaussian distribution anomaly detection
Multivariate Gaussian distribution parameter estimation
in multivariate Gaussian distribution, the parameters to be estimated is the mean vector and the sigmoid function.
Multivariate Gaussian distribution model
after the parameters determined, the model may be established according to the above formula, given a new sample x, when it is smaller than the threshold value ε will be identified when the abnormality.
Univariate comparison with the Gaussian model of
univariate Gaussian distribution is actually a feature of the sample independently of each other when special circumstances multivariate Gaussian distribution
Compared with traditional multivariate Gaussian distribution with a Gaussian distribution
in a conventional Gaussian distribution, if hand-related features establish a relationship between capture abnormal relationship, it is possible using conventional Gaussian abnormality detection, if this is not established their own identification relationship, then it suitable for use multivariate Gaussian distribution, it will automatically capture the relationship between features; the use of traditional training set smaller when Gaussian distribution is possible, to use multivariate Gaussian distribution, then it requires training data to a large amount, the amount of training data sets wherein m is much greater than the number n, generally m> 10n, better, otherwise there will be singular. Further advantage, the traditional simple Gaussian distribution may be calculated, and the calculated amount of the multivariate Gaussian distribution increases with the number of the characteristics.
If you had a singular matrix in the use of multivariate Gaussian distribution may be a problem with the following two aspects: one is the amount of data is too small, it does not reach far exceeds the number of feature requirements; on the other hand is characterized by the presence of redundancy, that is characteristic of there is a linear relationship between.
References Andrew Ng machine learning - Anomaly Detection
Anomaly Detection Andrew Ng machine learning notes of
Andrew Ng machine learning Chinese version notes: Anomaly Detection (Anomaly Detection)