Outlier detection machine learning

1.iForest (Independent Forest) algorithm

This is the recommended method for detecting an abnormal value when the sample data is large

Principle Analysis: iForest also by a large number of forest tree composition. iForest the tree called isolation tree, referred to iTree. iTree trees and decision trees are not the same, its construction process is also simpler than the tree, because it is a completely random process. Specific implementation process is as follows: First, assuming a total of N data, to build a iTree, the N data from uniform sampling (sampling without replacement typically) the ψ samples out of the training samples The tree. In the sample, wherein a randomly selected, and all values within the range of this feature (between maximum and minimum) a randomly selected value of the binary sample is divided, dividing the sample value is less than the node left, dividing the value greater than or equal to the right node. This resulted in a split condition, and the left and right sides of the data set, and then repeat the left and right respectively on both sides of the process data sets, direct termination condition is reached. Terminating two conditions, one is the data itself can not be divided (including only one sample, or all of the same sample), the other is a height of the tree reaches log2 (ψ). It differs from the decision tree, iTree in which algorithms have limited the height of the tree. Of course, we do not limit can, but the algorithm for efficiency reasons, only needs to reach log2 (ψ) depth can be.

Second, to build a good all iTree tree, we can predict the test data. Is the predicted process iTree test data corresponding to the conditional branch along the tree go down, until the leaf node, and the process through which the recording path length h (x), i.e. from the root node through the intermediate node, number of sides finally reach the leaf nodes traversed (path length). Finally, the h (x) into (particularly Formula Reference https://www.cnblogs.com/pinard/p/9314198.html), each anomaly score calculation data to be tested (Anomaly Score) scores close to 1 if the , which is a higher possibility of abnormal point; if the score is smaller than 0.5, it may be determined substantially as normal data; if all the scores are in the vicinity of 0.5, then the data does not include significant abnormal samples.

 

In sklearn uses:

 

Algorithm essentially no configuration parameters can be used directly, usually following (parameters significantly simpler than Random Forests):
n_estimators: The default is 100, the number of the tree configuration iTree

 

max_samples: The default is 265, configured sample size

 

max_features: default all features, high-dimensional data, only the selected partial feature

 

from sklearn.ensemble import IsolationForest

 

ilf=IsolationForest()
ilf.fit(X)
s=ilf.predict(X)

 

Containing only returns an array of elements 1 and -1, -1 may be an outlier point

2. outlier detection algorithms commonly OneClassSVM

This is recommended outlier detection method when few samples

It is a non-supervised learning method, here only to explain a particular idea SVDD, for SVDD, we expect all samples are positive category is not unusual, but it uses a super-sphere rather than a super-plane division to do the algorithm spherical boundary around the data obtained in the feature space, it is desirable to minimize the volume of the hypersphere, so as to minimize the influence of outliers data. Based on the new data point z is within a class, if z is the distance to the center of radius less than or equal to r, the point is not abnormal, if outside the hypersphere, are outliers. In the usage of sklearn

from sklearn.svm import OneClassSVM

assign = OneClassSVM ()
classer.fit (X)
yhat = classer.predict (X)
print (yhat)

Note: Due to its non-supervised learning method, so do not pass the list of categories. Return data and iForest as only contains a list of 1, -1, -1 may be outliers, X is training samples. After data detection screening

 

Guess you like

Origin www.cnblogs.com/dyl222/p/11122226.html