[Quick reading of the paper] DEEP SEMI-SUPERVISED ANOMALY DETECTION

This paper on anomaly detection is one of several articles that my summer camp instructor provided for me to read. My PPT and draft are as follows.

(PS: This is the first time I have come across this type of paper, and my reading has been rough, so there are many things I don’t understand or are wrong about. Please forgive me).

1. Background

The papers I will introduce are all based on anomaly detection, so I want to introduce what anomaly detection is.

In layman's terms, the purpose of anomaly detection is to pick out samples from the data set that he considers to be abnormal (such as samples that do not meet expectations or do not follow a common pattern). As shown in the figure on the right, blue is normal data and orange is For abnormal data, what we have to do is to distinguish them. For example, SVM classification separates different classes by dividing hyperplanes. As shown in the figure, those inside the circle are normal and those outside the circle are abnormal.

So for unlabeled data sets, the traditional method is to use unsupervised methods to learn compactness, and clustering methods are very difficult to learn anomalies, because abnormal data will not be clustered together like normal data, and there is obvious specialty. On the contrary, the characteristics of normal data should be learned, that is, one class learning, and as long as it cannot be judged as normal, it should be classified as abnormal.

Therefore, we must have a premise during training, that is, most of the data is normal data, and normal data has unknown learnable structures, otherwise there will be no way to learn because there are no label constraints.

2. Starting point and key points

The first paper, titled Semi-Supervised, indicates that this paper uses semi-supervised learning instead of traditional unsupervised learning.

The researchers in this article pointed out that in real-life problems, the data set does not have no labels at all, but there are still a small number of labels. This leads to the idea that we can use this part of labeled data through semi-supervised learning. Then having labels means that there are not only samples marked as normal, but also samples marked as abnormal data.

As we just mentioned, existing clustering methods have poor learning effects on abnormal samples. Because of this, traditional semi-supervised AD methods only apply samples marked as normal, or use abnormalities in a narrow sense. samples, then this leads to the question of how to invent a universal method that can utilize all labeled samples.

Judging from this, this article should achieve two goals. One is to use both labeled and unlabeled data, and the other is to deal with labeled abnormal samples.

3. Related work

The author introduces the idea of ​​information theory in the related work section, which is often used in anomaly detection.

Mutual information, a measure of the correlation between two random variables.

Supervised learning hopes that the potential representation Z has a small correlation with X and a large correlation with Y, so that the model can learn towards the label Y, as shown in (1).

However, because unsupervised learning has no labels, there is no Y, and the formula needs to be rewritten, and the theory of information maximization is proposed, as shown in (2).

The left side hopes to maximize the correlation between X and its potential representation Z, and the right side is the regular term regardless of it. The meaning of this formula is that he hopes to learn a compact structure, and any other data is an anomaly. Instead of like supervised classification, it may be necessary to learn separately to determine whether it is a normal class or an abnormal class. Autoencoders are the application of this theory. They hope to learn an identity, and then by setting a threshold for the output of the model, the boundaries between normal and abnormal can be delineated.

The method proposed by the author is based on the second formula, but because this time we have Y, which means there are some labels, so the author rewrote it and changed the subsequent regular term to R(Z,Y) , in subsequent actual operations, this regularization term is based on entropy, as shown in (3).

Next, he introduced the Deep SVVD method, and the method proposed in the article was also improved from SVVD. This is an unsupervised learning method. He found the distance between a neural network and a known hypersphere center c, He hopes to learn a transformation that minimizes the volume of a closed hypersphere in the output space Z, centered at c.

In layman's terms, after training, the model will tend to map normal sample output to places closer to c, while abnormal samples will be mapped to places farther away, and then it can be divided into normal samples by setting a distance threshold. Two types of exceptions.

4. Methods and experiments

Then the author points out here that the operation of geometrically minimizing the volume of the hypersphere can actually be interpreted as the process of entropy minimization.

According to the changes to the information theory formula just now, it can be applied by minimizing entropy. This extends the formula to accept labeled data at the same time for training.

This new item must ensure that the underlying distribution of the normal data has low entropy. There is high entropy for the underlying distribution of abnormal data. Hence the current formula.

For marked normal data, that is, y^=+1, applying a quadratic loss to the distance from the mapping point to the center c, our minimization model will tend to map it closer to the center. For marked abnormal data, That is, y^= -1, which penalizes the inverse of the distance, so that the anomaly must be mapped farther from the center.

This is therefore consistent with the common hypothesis of unusual lack of concentration. In this way, his loss function includes both the unlabeled part and the positively or negatively labeled part, and all the data is utilized.

In the experimental stage, he tested several parameters such as the proportion of labeled data, the proportion of pollution, and the number of anomaly classes. It boils down to the fact that for more complex, labeled, and contaminated data, the proposed method is better than Traditional methods have good results.

 

Guess you like

Origin blog.csdn.net/weixin_42569673/article/details/107979072