Article directory
Question ideas
(Share on CSDN as soon as the competition questions come out)
https://blog.csdn.net/dc_sinor?type=blog
1. Introduction – About anomaly detection
Anomaly detection (outlier detection) in the following scenarios:
- Data preprocessing
- Virus Trojan detection
- Industrial manufacturing product testing
- Network traffic detection
Wait, it plays an important role. Since in the above scenario, the amount of abnormal data is only a small part, classification algorithms such as SVM and logistic regression are not applicable because:
The supervised learning algorithm is suitable for a large number of positive samples and a large number of negative samples. There are enough samples for the algorithm to learn its characteristics, and the distribution of new samples in the future is consistent with the training samples.
The following is the scope of application of algorithms related to anomaly detection and supervised learning:
abnormal detection
- credit card fraud
- Manufacturing product abnormality detection
- Data center machine anomaly detection
- Intrusion detection
supervised learning
- Spam identification
- categories of news
2. Anomaly detection algorithm
import tushare
from matplotlib import pyplot as plt
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()
In the past three months, if the trading volume is greater than 200,000, it can be considered that something abnormal has occurred (the volume, um, you have to pay attention to the risks...)
2. Box plot analysis
import tushare
from matplotlib import pyplot as plt
df = tushare.get_hist_data("600680")
v = df[-90: ].volume
v.plot("kde")
plt.show()
Generally speaking, we can know that if the trading volume of the stock is less than 20,000, or the trading volume is greater than 80,000, you should be more vigilant!
3. Based on distance/density
A typical algorithm is: "Local Outlier Factor Algorithm-Local Outlier Factor". This algorithm can reach distance", as well as "local reachability density, local reachability density" and "local outlier factor, local outlier factor" to find outliers.
Use a visual intuition to feel it, as shown in Figure 2. For the points in the C1 set, the overall spacing, density, and dispersion are relatively uniform, and can be considered to be the same cluster; for the points in the C2 set, they can also be considered to be a cluster. Points o1 and o2 are relatively isolated and can be considered as abnormal points or discrete points. The current question is how to realize the versatility of the algorithm to meet the requirements of outlier identification for sets such as C1 and C2 with very different density dispersion. LOF can achieve our goals.
4. Based on the idea of division
A typical algorithm is "Isolation Forest, Isolation Forest", the idea is:
Suppose we use a random hyperplane to split the data space. One cut can generate two subspaces (imagine cutting a cake in half with a knife). Then we continue to use a random hyperplane to cut each subspace, and loop until there is only one data point in each subspace. Intuitively speaking, we can find that clusters with high density can be cut many times before stopping, but points with low density can easily stop in a subspace very early.
The algorithm flow of this is the process of using the hyperplane to divide the subspace, and then building a similar binary tree:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 1, X - 3, X - 5, X + 6]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-8, high=8, size=(20, 2))
# fit the model
clf = IsolationForest(max_samples=100*2, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-8, 8, 50), np.linspace(-8, 8, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red')
plt.axis('tight')
plt.xlim((-8, 8))
plt.ylim((-8, 8))
plt.legend([b1, b2, c],
["training observations",
"new regular observations", "new abnormal observations"],
loc="upper left")
plt.show()
Modeling information
Data Sharing: The Strongest Modeling Data