引言

在异常检测领域中，我们常常需要决定新观察的点是否属于与现有观察点相同的分布（则它称为inlier），或者被认为是不同的（称为outlier）。在这里，必须做出两个重要的区别：

异常值检测，outlier detection：

训练数据包含异常值，这些异常值被定义为远离其他异常值的观察值。因此，异常检测估计器试图适应训练数据最集中的区域，忽略不正常的观察。

新颖点检测，novelty detection：

训练数据不受异常值的污染，我们有兴趣检测新观察是否是异常值。在这种情况下，异常值也被称为新颖点（a novelty）。

异常值检测和新颖性检测都属于异常检测，都是用来检测异常的、不常见的一些观察值。

异常值检测是一种无监督的方法，新颖点检测是一种半监督的异常检测方法。在异常值检测的情况下，异常值不能形成密集的簇，因为异常检测的估计器假设异常值总是位于低密度区域。相反，在新颖性检测的背景下，新颖点可以形成密集的簇，只要它们处于训练数据的低密度区域中。

scikit-learn框架提供了一套机器学习工具，可用于新颖性或异常值检测。这个策略是通过从数据中以无人监督的方式学习对象来实现的：

estimator.fit(X_train)

新的观测值，可以用如下方法来做个“预测”，即判断：

estimator.predict(X_test)

这里，正常的点（内部点，inliers）的标签是1，异常点是-1。预测方法利用估计器计算的原始评分函数的阈值。可以通过score_samples方法访问该评分函数，而阈值可以通过contamination参数来控制。

decision_function方法也可以从评分函数中定义，负值是异常值，非负值是内部点：

estimator.decision_function(X_test)

需要注意的是， neighbors.LocalOutlierFactor只有一个fit_predict方法，不支持predict, decision_function 和score_samples方法！因为这个估计器原本是用于异常值的检测，可以通过negative_outlier_factor_属性访问训练样本的异常分数。如果你真的想使用neighbors.LocalOutlierFactor进行新奇检测，即预测标签或计算新看不见的数据的异常得分，你可以在拟合估算器之前将新颖性参数设置为True来实例化估算器。在这种情况下，fit_predict不可用。

`异常值检测（Outlier` Detection）

下图是scikit-learn中离群检测算法的比较。局部异常因子（LOF）不显示黑色的决策边界，因为当用于异常值检测时，它没有predict方法应用于新的数据！

ensemble.IsolationForest（孤立森林）和neighbors.LocalOutlierFactor（局部异常因子）在这里的数据集上表现得相当好。已知svm.OneClassSVM对异常值敏感，因此对异常值检测的效果不佳。最后，covariance.EllipticEnvelope假设数据是高斯数并学习椭圆。

`新颖性检测（`Novelty Detection`）`

新颖性检测一般是这样一种场景：考虑具有p维特征的、具有相同分布的n个观察值构成的数据集。现在考虑我们再向该数据集添加一个新的观察值。若新观察值与其他观察值的结果有不同，我们可以怀疑它是否regular，即它是不是来自同一个分布。或者恰恰相反，若它另与其他值非常相似，我们无法将其与原始观察值区分开来。这是新颖性检测工具和方法所解决的问题。

一般来说，新颖点检测将学习一个粗略的边界，界定初始观测分布的轮廓，绘制于p维空间空间中。然后，如果新来的观察值位于边界划分的子空间内，则认为它们来自与初始观测相同的种类。否则，如果他们位于边境之外，我们可以说他们是不正常的。

One-Class SVM是一种新颖性检测的手段，在sklearn框架下，它可以从svm模块中调用到（svm.OneClassSVM）。它需要选择内核和标量参数来定义边界。通常选择RBF内核，尽管没有确切的公式或算法来设置其带宽参数。这是scikit-learn实现中的默认值。ν参数，也称为one-class SVM的边界，对应于在边界外发现新的但有规律的观察的概率。

`sklearn.svm`.OneClassSVM

类信息：

class sklearn.svm.OneClassSVM(kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, random_state=None)[source]

一种用于异常值检测的无监督方法。

其参数的具体含义请查看官方使用文档：

它的属性有：

最常用的两个方法，训练和预测：

由于是无监督方法，在fit的时候，不需要喂给它label！

程序实例：

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
# Generate train data
X = 0.3 * np.random.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))

# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

# plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')

s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s)
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s)
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s)
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
           ["learned frontier", "training observations",
            "new regular observations", "new abnormal observations"],
           loc="upper left",
           prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
    "error train: %d/200 ; errors novel regular: %d/40 ; "
    "errors novel abnormal: %d/40"
    % (n_error_train, n_error_test, n_error_outliers))
plt.show()

参考链接：

https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection

See One-class SVM with non-linear kernel (RBF) for visualizing the frontier learned around some data by a svm.OneClassSVM object.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM

Estimating the support of a high-dimensional distribution Schölkopf, Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.