sklearn:sklearn.preprocessing中的Standardization、Scaling、 Normalization简介、使用方法之详细攻略
目录
Standardization&Scaling、 Normalization简介
1、Standardization, or mean removal and variance scaling
1.1、Scaling features to a range
1.3、Scaling data with outliers
Standardization&Scaling、 Normalization简介
参考文章:https://scikit-learn.org/stable/modules/preprocessing.html
The In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers. |
一般来说,学习算法受益于数据集的标准化。如果数据集中存在一些异常值,则更适合使用健壮的标量或转换器。在比较不同标量对数据和异常值的影响时,重点介绍了不同标量、转换器和规格化器在包含边缘异常值的数据集上的行为。 |
1、Standardization, or mean removal and variance scaling 标准化,或均值去除和方差标度
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. |
数据集的标准化Standardization 是许多在scikit-learn中实现的机器学习评估器的共同需求;如果单个特征与标准正态分布数据(均值为零,单位方差为零的高斯分布)没有多少相似之处,它们可能会表现得很糟糕。 在实践中,我们经常忽略分布的形状,只是通过去除每个特征的平均值来将数据转换为中心,然后通过将非常量特征除以它们的标准差来对其进行缩放。 |
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. The function |
例如,学习算法的目标函数中使用的许多元素(如支持向量机的RBF核或线性模型的l1和l2正则化器)都假设所有特征都以0为中心,并且具有相同的顺序的方差。如果一个特征的方差比其他特征的方差大几个数量级,那么它就可能控制目标函数,使estimator 无法按照预期正确地从其他特征中学习。
|
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
X_scaled = preprocessing.scale(X_train)
print(X_scaled )
Scaled data has zero mean and unit variance:
X_scaled.mean(axis=0)
X_scaled.std(axis=0)
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline : |
|
The scaler instance can then be used on new data to transform it the same way it did on the training set: It is possible to disable either centering or scaling by either passing |
然后,可以将scaler实例用于新数据,以与训练集相同的方式对其进行转换: 可以通过向StandardScaler的构造函数传递with_mean=False或with_std=False来禁用定心或缩放。 |
scaler = preprocessing.StandardScaler().fit(X_train)
print(scaler)
print(scaler.mean_)
print(scaler.scale_)
print(scaler.transform(X_train))
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)
1.1、Scaling features to a range 缩放功能到一个范围
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data. |
另一种标准化方法是将特征缩放到给定的最小值和最大值之间,通常是在0和1之间,或者将每个特征的最大绝对值缩放到单位大小。这可以分别使用 使用这种缩放的动机包括对非常小的特征标准差的鲁棒性和在稀疏数据中保持零项。 |
1.2、Scaling sparse data 缩放稀疏数据
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.
unintentionally. |
以稀疏数据为中心会破坏数据中的稀疏结构,因此很少有明智的做法。然而,缩放稀疏输入是有意义的,特别是当特征在不同的尺度上时。
|
Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the |
注意,scalers 同时接受压缩的稀疏行和压缩的稀疏列格式(请参阅 最后,如果期望集中的数据足够小,则使用稀疏矩阵的toarray方法显式地将输入转换为数组是另一种选择。 |
1.3、Scaling data with outliers 用离群值对数据进行缩放
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data. |
如果您的数据包含许多异常值,那么使用数据的均值和方差进行缩放可能不会很好地工作。在这些情况下,你可以使用 |
References: Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data? |
引用: 关于定心和缩放数据的重要性的进一步讨论可以在这个常见问题解答中找到:Should I normalize/standardize/rescale the data? |
1.4、Scaling vs Whitening 缩放比例与白化
It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features. To address this issue you can use |
由于下游模型可以进一步对特征的线性无关性做出一些假设,因此有时仅对特征进行单独的居中和标度是不够的。 要解决这个问题,可以使用带有 |
1.5、Centering kernel matrices 中心核矩阵
If you have a kernel matrix of a kernel K that computes a dot product in a feature space defined by function ϕ, a KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by ϕ followed by removal of the mean in that space. |
如果你有一个内核的内核K矩阵计算特征空间的内积函数定义的ϕ, a |
2、Normalization 归一化
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. This assumption is the base of the Vector Space Model often used in text classification and clustering contexts. |
这种假设是文本分类和聚类中常用的向量空间模型的基础。 |
The function The This class is hence suitable for use in the early steps of a |
normalize函数提供了一种快速、简单的方法,可以使用l1或l2规范在单个类数组数据集上执行此操作: 预处理模块进一步提供了一个实用程序类的规范化器,它使用TransformerAPI实现了相同的操作(即使fit方法在这种情况下是无用的:这个类是无状态的,因为这个操作独立地处理样本)。 因此,这个类适合在 |
The normalizer instance can then be used on sample vectors as any transformer: Note: L2 normalization is also known as spatial sign preprocessing. |
然后,normalizer实例可以作为任何transformer在样本向量上使用: 注:L2归一化也称为空间符号预处理。 |
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
print(X_normalized)
normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
print(normalizer)
normalizer.transform(X)
normalizer.transform([[-1., 1., 0.]])
Sparse input
For sparse input the data is converted to the Compressed Sparse Rows representation (see |
稀疏的输入
对于稀疏输入,将数据转换为压缩的稀疏行表示形式(请参阅 |