2023 Higher Education Society Cup Mathematical Modeling Ideas - Case: Anomaly Detection

Question ideas

(Share on CSDN as soon as the competition questions come out)

https://blog.csdn.net/dc_sinor?type=blog

1. Introduction – About anomaly detection

Anomaly detection (outlier detection) in the following scenarios:

  • Data preprocessing
  • Virus Trojan detection
  • Industrial manufacturing product testing
  • Network traffic detection

Wait, it plays an important role. Since in the above scenario, the amount of abnormal data is only a small part, classification algorithms such as SVM and logistic regression are not applicable because:

The supervised learning algorithm is suitable for a large number of positive samples and a large number of negative samples. There are enough samples for the algorithm to learn its characteristics, and the distribution of new samples in the future is consistent with the training samples.

The following is the scope of application of algorithms related to anomaly detection and supervised learning:

abnormal detection

  • credit card fraud
  • Manufacturing product abnormality detection
  • Data center machine anomaly detection
  • Intrusion detection

supervised learning

  • Spam identification
  • categories of news

2. Anomaly detection algorithm

insert image description here
insert image description here

import tushare
from matplotlib import pyplot as plt
 
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()

In the past three months, if the trading volume is greater than 200,000, it can be considered that something abnormal has occurred (the volume, um, you have to pay attention to the risks...)

insert image description here
insert image description here

2. Box plot analysis

import tushare
from matplotlib import pyplot as plt
 
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()

insert image description here
Generally speaking, we can know that if the trading volume of the stock is less than 20,000, or the trading volume is greater than 80,000, you should be more vigilant!

3. Based on distance/density

A typical algorithm is: "Local Outlier Factor Algorithm-Local Outlier Factor". This algorithm can reach distance", as well as "local reachability density, local reachability density" and "local outlier factor, local outlier factor" to find outliers.

Use a visual intuition to feel it, as shown in Figure 2. For the points in the C1 set, the overall spacing, density, and dispersion are relatively uniform, and can be considered to be the same cluster; for the points in the C2 set, they can also be considered to be a cluster. Points o1 and o2 are relatively isolated and can be considered as abnormal points or discrete points. The current question is how to realize the versatility of the algorithm to meet the requirements of outlier identification for sets such as C1 and C2 with very different density dispersion. LOF can achieve our goals.

insert image description here
insert image description here

4. Based on the idea of ​​division

A typical algorithm is "Isolation Forest, Isolation Forest", the idea is:

Suppose we use a random hyperplane to split the data space. One cut can generate two subspaces (imagine cutting a cake in half with a knife). Then we continue to use a random hyperplane to cut each subspace, and loop until there is only one data point in each subspace. Intuitively speaking, we can find that clusters with high density can be cut many times before stopping, but points with low density can easily stop in a subspace very early.

The algorithm flow of this is the process of using the hyperplane to divide the subspace, and then building a similar binary tree:

insert image description here

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-8, high=8, size=(20, 2))

# fit the model
clf = IsolationForest(max_samples=100*2, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-8, 8, 50), np.linspace(-8, 8, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red')
plt.axis('tight')
plt.xlim((-8, 8))
plt.ylim((-8, 8))
plt.legend([b1, b2, c],
           ["training observations",
            "new regular observations", "new abnormal observations"],
           loc="upper left")
plt.show()

insert image description here

Modeling information

Data Sharing: The Strongest Modeling Data
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/dc_sinor/article/details/132583820