2023 Huashu Cup Mathematical Modeling Ideas - Case: Anomaly Detection

Question ideas

(Share on CSDN as soon as the competition questions come out)

https://blog.csdn.net/dc_sinor?type=blog

1. Introduction – About Anomaly Detection

Anomaly detection (outlier detection) in the following scenarios:

  • data preprocessing
  • Virus Trojan detection
  • Industrial manufacturing product testing
  • Network traffic detection

Wait, there is an important role. Since in the above scenarios, the amount of abnormal data is a small part, classification algorithms such as SVM and logistic regression are not applicable because:

The supervised learning algorithm is suitable for a large number of positive samples and a large number of negative samples, there are enough samples for the algorithm to learn its characteristics, and the new samples in the future are consistent with the distribution of training samples.

The following is the scope of application of algorithms related to anomaly detection and supervised learning:

abnormal detection

  • credit card fraud
  • Abnormal detection of manufacturing products
  • Data Center Machine Anomaly Detection
  • intrusion detection

supervised learning

  • Spam Identification
  • categories of news

2. Anomaly detection algorithm

insert image description here
insert image description here

import tushare
from matplotlib import pyplot as plt
 
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()

In the past three months, if the transaction volume is greater than 200,000, it can be considered abnormal

insert image description here
insert image description here

2. Box plot analysis

import tushare
from matplotlib import pyplot as plt
 
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()

insert image description here
It can be generally known that if the trading volume of the stock is less than 20,000, or the trading volume is greater than 80,000, you should be more vigilant!

3. Distance/density based

A typical algorithm is: "Local Outlier Factor Algorithm-Local Outlier Factor", which introduces "k-distance, the k-th distance", "k-distance neighborhood, the k-th distance neighborhood", "reach-distance, reachable Distance", and "local reachability density, local reachable density" and "local outlier factor, local outlier factor" to find outliers.

Visually feel it, as shown in Figure 2, for the points of the C1 set, the overall spacing, density, and dispersion are relatively uniform, and can be considered as the same cluster; for the points of the C2 set, they can also be considered as a cluster. Points o1 and o2 are relatively isolated and can be considered as abnormal points or discrete points. The question now is how to realize the generality of the algorithm, which can satisfy the outlier identification of C1 and C2, which have very different density distributions. LOF can achieve our goal.

insert image description here
insert image description here

4. Based on the idea of ​​division

A typical algorithm is "Isolation Forest, Isolation Forest", the idea is:

Suppose we use a random hyperplane to split (split) the data space (data space), one cut can generate two subspaces (imagine cutting a cake into two with a knife). After that, we continue to use a random hyperplane to cut each subspace, and the cycle continues until there is only one data point in each subspace. Intuitively speaking, we can find that those clusters with high density can be cut many times before they stop cutting, but those points with low density can easily stop in a subspace very early.

The algorithm flow of this is the process of using the hyperplane to divide the subspace, and then building a similar binary tree:

insert image description here

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-8, high=8, size=(20, 2))

# fit the model
clf = IsolationForest(max_samples=100*2, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-8, 8, 50), np.linspace(-8, 8, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red')
plt.axis('tight')
plt.xlim((-8, 8))
plt.ylim((-8, 8))
plt.legend([b1, b2, c],
           ["training observations",
            "new regular observations", "new abnormal observations"],
           loc="upper left")
plt.show()

insert image description here

Guess you like

Origin blog.csdn.net/dc_sinor/article/details/132008609