According to Andrew Ng notes at Stanford's "machine learning" video, does not go into details through knowledge Li Hang "statistical learning methods" to obtain the list only outline.
1 anomaly detection
Modeling the data to form probability distribution function \ (P (X) \) ; Check \ (p (x_ {test} ) \) values
eg
- Fraud detection: the user can identify abnormal behavior
- Industry
- Computer monitoring data center
1.1 Gaussian / normal \ (x ~ N (\ mu , \ sigma ^ 2) \)
\ (\ MU \) : the average value, the control center position of the bell curve
\ (\ Sigma ^ 2 \) : variance, the control width of the bell curve
\ (P (x, \ it \ sigma ^ 2) = \ frac {1} {\ sqrt {2 \ pi}} exp (- \ frac {(x- \ mu) ^ 2} {2 \ sigma ^ 2 }) \)
Parameter Estimation
\(\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}\)
\(\sigma^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)^2\)
Density Estimation
\(p(x)=p(x_1;\mu_1,\sigma_1^2)p(x_2;\mu_2,\sigma_2^2)p(x_3;\mu_3,\sigma_3^2)\cdots p(x_n;\mu_n,\sigma_n^2)=\prod_{j=1}^np(x_j;\mu_j,\sigma_j^2)\)
1.2 anomaly detection algorithm
- Selection feature <Section 1.4 >
- Parameter Estimation \ (\ mu_j, \ sigma_j ^ 2 \) or \ (\ MU, \ Sigma \) <see 1.5 >
- Given the new samples for density estimation, if the ratio \ (\ epsilon \) is small, abnormal
Development and evaluation
Training set: 60% no abnormal sample, the estimated mean and variance of features and build \ (p (x) \) function
Cross-validation set: 20% Sample no abnormality abnormality + 50% sample, using cross-validation set selection \ (\ \ Epsilon) , according to \ (of F_1 \) value to select
Test set: no abnormality 20% + 50% sample of abnormal samples
Metrics:
- TP,FN,FP,TN
- Accuracy rate / recall
- \(F_1-score\)
1.3 Anomaly Detection vs. supervised learning
abnormal detection | Supervised learning |
---|---|
Very small amount of positive type (abnormal data \ (. 1 Y = \) ), a large amount of negative type ( \ (Y = 0 \) ) | While a large number of positive and negative class to class |
Many different kinds of exceptions. According to a very small amount of forward class data to train the algorithm is very difficult. | There are enough positive class instance, sufficient for training algorithm. |
Abnormalities may have mastered the exception of a very different future encounter. | Future Forward class instance may encounter very similar to the training set |
1.4 Select feature
Wherein the non-Gaussian distribution: by a logarithmic function, square root method to modify characteristics so close to Gaussian distribution pattern
Normal and abnormal samples similar to: identify the problem, create a new feature
More than 1.5 yuan Gaussian distribution
Not on \ (p (x_i) \) modeling, instead of a one-time \ (p (x) \) modeling; parameter \ (\ MU \) is a \ (n-\) -dimensional vector, \ (\ Sigma \) a \ (n × n \) covariance matrix
\ (P (x, \ it \ sigma ^ 2) = \ frac {1} {(2 \ pi) ^ {\ frac {n} {2}} | \ Sigma | ^ {\ frac {1} {2 }}} exp (- \ frac {(x- \ mu) ^ T \ Sigma ^ {- 1} (x- \ mu)} {2}) \)
Parameter Estimation
\(\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}\)
\(\Sigma=\frac{1}{m}\sum_{i=1}^n(x^{(i)}-\mu)(x^{(i)}-\mu)^T\)
It can be found before the Gaussian distribution, which is a special case where the multivariate Gaussian distribution, i.e., \ (\ Sigma \) non-diagonal elements of the matrix are zero
Advantages: automatically captures the relationship between features → The original model will need to manually create a new feature
Disadvantages: calculated costly; \ (m> n-\) , otherwise irreversible covariance matrix ( \ (m \ GE 10N \) )