Stanford Machine Learning - Lecture 11. Anomaly Detection

I have been watching the video explanation of Mr. Andrew in the Machine Learning of Standford Open Course https://class.coursera.org/ml/class/index

At the same time, I study with the series of articles by Rachel Zhang , a well-known blogger of csdn .

However, the blogger's blog only writes about "Tenth Lecture on Data Dimensionality Reduction" http://blog.csdn.net/abcjennifer/article/details/8002329, and there are three more lectures later, the content is more applied, namely anomaly detection, large Data machine learning, photo OCR. For the completeness of learning, I will supplement the content of the next three lectures, and the writing method will be borrowed from blogger Rachel Zhang .

 

This column (Machine learning) includes single-parameter linear regression, multi-parameter linear regression, Octave Tutorial, Logistic Regression, Regularization, neural network, machine learning system design, SVM (Support Vector Machines), clustering, dimensionality reduction , Anomaly Detection, Large-Scale Machine Learning, etc. Most of the content comes from Mr. Andrew's explanations in Machine Learning in the Standford Open Course and references from other books. ( https://class.coursera.org/ml/class/index )

 

 

Lecture 11. Anomaly Detection - Anomaly Detection

 

===============================

(1) What is anomaly detection?

(2), Gaussian distribution (normal distribution)

(3), anomaly detection algorithm

(4) Design an anomaly detection system

 

(5) How to select or design features

 

(6) Anomaly Detection VS Supervised Learning

(7) Multivariate Gaussian distribution and its application in anomaly detection

 

 

=====================================

 

(1) What is anomaly detection?

 

Take a chestnut first:

Aircraft engines are normally in an OK state to prevent abnormalities. When a new engine arrives, a model needs to be used to test its OK or anomaly. Problems like this are anomaly detection problems.

x1 and x2 are two characteristics of the engine, and the data distribution is shown in Figure 1 below.

If the learned model is p(x), which is used to describe the probability that the sample x belongs to not anomaly, and a probability threshold is set, then when it is less than this threshold, it is abnormal. As shown in Figure 2 below.

 

Figure 1 illustrates the anomaly detection problem

 

Figure 2 Density estimation principle

 

 

 

(2), Gaussian distribution (normal distribution)

The Gaussian distribution is also called the normal distribution, which we are all familiar with. The training samples for anomaly detection are all non-abnormal samples. We assume that the characteristics of these samples obey the Gaussian distribution, and on this basis, estimate a probability model, so as to use the model to estimate the possibility that the samples to be tested belong to non-abnormal samples.

What is a Gaussian distribution? As shown below:

That is, a Gaussian distribution contains two model parameters: mean, variance (or standard deviation)

An example is given to illustrate the relationship between these two model parameters and the Gaussian distribution curve, as shown in the following figure:

Then, the larger the standard deviation, the flatter the curve, the more divergent the data distribution, and the more concentrated the data distribution.

The estimation methods of mean and standard deviation are shown in the following two formulas:

 

(3), anomaly detection algorithm

Assuming that each dimension of the training data is subject to a Gaussian distribution, then we design the training process of the anomaly detection algorithm as follows:

The corresponding prediction process:

 

(4) Design an anomaly detection system

To design an anomaly detection system, we need to consider:

1. Data: In the problem of anomaly detection, there are only normal samples in the training set, but in order to test indicators such as system performance, we also need some abnormal samples. Of course, that means we need a batch of labeled data, where the normal sample label is 0, and the abnormal sample label is 1. Note that only samples with label 0 are used during training

2. Data grouping: we need three sets of data: train, cross validation, and test

3. Evaluation system: In the problem of anomaly detection, most of the data is normal, so the samples of 0 and 1 are seriously unbalanced. At this time, the performance of the system cannot be simply described by the classification error rate or the correct rate. As mentioned earlier, for such skewed classes, concepts such as precision, recall, and F-measure need to be used to measure.

Next, the first question "data" has been described clearly above. Regarding the second question "data grouping", let's take a chestnut:

If there are 10,000 normal samples and 20 abnormal samples in all our aircraft engine data, the recommended grouping ratio is as follows:

  • train set: 6000 normal samples
  • cross validation set: 2000 normal samples, 10 abnormal samples
  • test set: 2000 normal samples, 10 abnormal samples

So about the third question "system evaluation":

  • Obtain a probability distribution model on the train set
  • On the validation/test set, the label of the predicted sample (0 or 1)
  • 计算evaluation metrics,可能包括:true positive, false positive, true negative, false negative, precision, recall, F1-score
  • In addition, the above method of systematic evaluation can also be used to guide the probability threshold of classification, that is, the selection of epsilon in the above figure.

 

(5) How to select or design features

Let's talk about how the features in the anomaly detection problem should be selected or designed.

This part will introduce two aspects: one is data transformation, and the other is adding more discriminative features

1. Data transformation

We know that the above anomaly detection system is based on the assumption that each dimension of data obeys a Gaussian distribution. So what if the original data does not follow a Gaussian distribution? The way is to first perform some kind of transformation on the original data, which is actually equivalent to designing a new feature. An example is shown below:

Top left image: The histogram of the statistical data x, it is found that it basically conforms to the Gaussian distribution

Bottom left: It is found that the histogram of the data does not conform to a Gaussian distribution

Bottom right: After performing log(x) transformation on x, the statistical histogram basically conforms to the Gaussian distribution

There are many transformations similar to log(x), and several functions such as those shown on the upper right can be tried in experiments.

The following is an example in the experiment to illustrate how to do data transformation. As shown in the next few pictures, the code is Octave's:

The initial data looks like this:

50 in hist(x,50) is a parameter of the hist function, so don't worry too much here

Ok, let's try to find a function to do the transformation, it becomes the following:

It is more and more like a Gaussian distribution, try to adjust the parameters of the function:

Great, this is closer to a Gaussian distribution.

Try other functions again, and find that it may be done at once, haha:

Assign the transformed data conforming to the Gaussian distribution to the new xNew, and use xNew to estimate the two parameters of the Gaussian distribution. You're done!

 

2. Add more discriminative features

In addition, since we rely on the probability threshold to distinguish normal and abnormal samples, we certainly hope that the probability value of abnormal samples is small and the probability value of normal samples is large. The problem that is easy to encounter at this time is that if the predicted probability value of a test sample is not too small or too small, just around the threshold, the probability of the prediction result being wrong is relatively large. As shown below:

At the position of the green X sample, the predicted probability value is quite large for both normal and abnormal samples, and it is difficult to give a correct judgment.

At this time, if we have features of another dimension, which are more distinguishable between normal and abnormal samples near the position of green X, then we may be able to make a correct judgment on green X. As shown in the figure below, after we add the x2 feature, we find that the probability p(x2) of the green X sample in the feature dimension of x2 is very small, and the product of p(x1) is naturally small. Thus, when the feature x1 is indistinguishable, the feature x2 helps the model to successfully identify the sample.

 

To sum up, pay attention to two points when selecting features:

1. When the feature data does not conform to the Gaussian distribution, try to transform the data with a variety of functions through the histogram distribution of the statistical data, so that the histogram distribution characteristics conform to the Gaussian model.

2. When the current feature discrimination is not enough, you can design and add more discriminative features to help the model be more discriminative.

(6) Anomaly Detection VS Supervised Learning

Anomaly detection has been introduced here, and some people may ask such questions: both positive and negative samples are required for training, and then the label is predicted for the test sample. What is the difference and connection between anomaly detection and supervised learning? What kinds of problems are suitable to be solved by models learned from anomaly detection? At the same time, what kind of problems should use supervised learning? In this section, we will analyze the characteristics and differences between the two. Knowing their differences, you will naturally know what kind of problem scenarios the two methods are suitable for.

The features and differences between the two are listed below:

1. In terms of data volume:

 

Anomaly detection: As mentioned earlier, the data for anomaly detection belongs to the skewed class, that is, the number of positive and negative samples is very unbalanced, there are many negative samples (normal), few positive samples (anomaly), and most of them are normal data.

Supervised learning: There are many positive and negative samples, because the learning algorithm needs to see as many positive and negative samples as possible at the same time to make the learned model more discriminative.

2. Data distribution:

 

Anomaly detection: As mentioned earlier, the algorithm of anomaly detection is based on the assumption that each dimension of the training data is subject to a Gaussian distribution, and the data involved in training only have negative samples. On the contrary, the positive sample (anomaly) data is various, the distribution is very uneven, and the similarity between different positive samples is very small, which is the many different "types" of anomalies that Ng said. This is actually not difficult to understand. The popular truth is: the normal is the same, and the abnormal has its own uniqueness.

Supervised learning: The data of positive and negative samples are evenly distributed, and the data of the same category have strong similarity and correlation.

3. Model training:

Anomaly detection: In view of the above-mentioned characteristics of data volume and data distribution, positive samples (anomaly) do not have a good reference for anomaly detection algorithms, because they are small in number and different in style, and are not suitable for learning algorithms. Negative samples (normal) are different, they conform to Gaussian distribution, and it is easy to learn the distribution characteristics of features. Why is the labeled positive sample (anomaly) used when designing the anomaly detection system? Note: They are not involved in training, nor in Gaussian model estimation, but are only used for evaluation on validation or test sets.

Supervised learning: Both positive and negative samples participate in the training process, and the reason is also the characteristics of the data analyzed above.

In summary, it can be seen that the key to distinguishing between anomaly detection and supervised learning models is to understand the characteristics of the data at hand, including data volume, data distribution, etc.

 

(7) Multivariate Gaussian distribution and its application in anomaly detection

1. Motivation

Why do we have the concept of multivariate Gaussian distribution? Next, in order to introduce motivation, we give a more reliable chestnut using multivariate Gaussian distribution than using the above Gaussian distribution model.

As shown in the figure below, when x1 and x2 have a linear relationship as shown in the left figure (you can ignore the linear relationship here first, and will be explained in detail later), p(x1) and p(x2) of the green X sample are both in Within the threshold range, that is, they are not small enough to be judged as abnormal samples (as shown in the figure on the right), then their product is naturally likely not to meet the judgment conditions of abnormal samples. Specifically, as shown in the figure on the left, the farther the purple circle goes, the smaller the probability of belonging to a normal sample, and the highest probability of the center point. At this time, we see that the distance between the green X test sample and several red X training samples from the center of the circle is similar, that is, they have similar probability values, then the green point will be judged as a normal sample like the red.

Second, the multivariate Gaussian model formula

Unlike the Gaussian distribution model introduced above, which assumes that each dimension of data is independent of each other, the multivariate Gaussian distribution considers all dimensions of data as a whole.

There are also two parameters, the mean and the covariance matrix. The formula is shown in the following figure, where |Sigma| represents the determinant of the covariance matrix Sigma.

Note that the linear relationship between x1 and x2 we mentioned above is actually contained in the covariance matrix, so that is to say, the difference between the multivariate Gaussian distribution and the independent univariate Gaussian distribution is that the former considers different The relationship between data distributions, which the latter considers them to be unrelated and independent.

The following example illustrates the connection between the multivariate Gaussian distribution model and the data relationship:

(1) As shown in the figure below, when the diagonal elements of the covariance matrix are equal, the expansion range of the model surface in the three-dimensional space is also equal in both dimensions

 

(2) As shown in the figure below, when the diagonal elements are unequal, the size of the expansion range of the surface in the two dimensions is differentiated. As shown in the figure on the right, the probability value decreases faster along the x2 direction than along the x1 direction.

 

(3) The transformation of the following pictures is also very important, but it is not explained in detail here. You can understand the connection between the multivariate Gaussian model and the data relationship by observing the values ​​of the covariance matrix and the mean vector, combined with the corresponding graphs.

 

 

 

3. Anomaly detection algorithm modeled with multivariate Gaussian model

The algorithm is shown in the following figure:

Looking at the right image of the above figure, it can be seen that when the multivariate Gaussian model is used, if there is a linear relationship between the two dimensions of the data, this relationship will be easily captured and correctly guide the judgment of the test sample.

 

Fourth, the connection between the multivariate Gaussian model and multiple independent univariate Gaussian models

It is not difficult to find that, in fact, the latter is a special form of the former, and its relationship is as follows:

If and only if the covariance matrix is ​​a diagonal matrix, and the diagonal elements are equal to the variance of each dimension, then the multivariate Gaussian model at this time is equivalent to multiple independent univariate Gaussian models. As shown below:

That is to say, in this special case, the features of each dimension are independent of each other, and there is no linear or nonlinear relationship, so the three-dimensional surface only has different expansion widths in different dimensions, not like the above several The example produces a 45° inclination.

 

After knowing the connection between the two, we have to think, when to use a multivariate Gaussian model and when to assume an independent univariate Gaussian model? The difference between the two is summarized as follows (the former refers to the multivariate Gaussian model, and the latter refers to the univariate Gaussian model):

1. Feature selection:

For the same set of known data, assuming that {x1, x2}, x1 and x2 are not well differentiated, if the latter model is selected, as we introduced in the feature selection section, we may need to manually Design a more discriminative feature (such as x1/x2); but if the former model is used, the multivariate Gaussian model will automatically learn the relationship between x1 and x2, so there is no need to manually design features

2. Computing efficiency:

The latter is computationally faster because it does not involve complex matrix operations such as inversion of the covariance matrix. Therefore, when the sample dimension n is very large, it is generally considered to be very large when n=100,000 or 10,000, and the latter model is more used. Otherwise the covariance matrix will be of size n*n and the inversion operation will be very slow.

3. Regarding the sample size m and the sample dimension n:

To sum up, when m is relatively small, the latter model is more suitable, because a small number of samples are not easy to learn the complex relationship between the data; when m>>n (such as m=10n) and the covariance matrix is ​​invertible, the former is used. Multivariate Gaussian distribution model.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324686207&siteId=291194637