Data analysis: normalization, standardization and centralization / zero-mean

Concept 1
  decimal 1) into data (0,1) or (1,1) between: normalized. Primarily for convenience of data processing proposed by, the data is mapped to the range of 0 to 1 process, more convenient and fast. 2) The dimensionless expression becomes dimensionless expression, or the order of different units to facilitate the indicator can be compared and weighted. Normalization is a way to simplify the calculations, is about to have a dimension of expression, through transformation into a dimensionless expression become a scalar.
  Standardization: In machine learning, we may have to deal with different types of data, such as audio and image pixel values, such information may be high-dimensional, standardized data after each feature will cause the value of the average becomes 0 ( the feature value of each average are subtracted in the characteristics of the raw materials), standard deviation becomes 1, this method is widely used in many machine learning algorithms (for example: support vector machines, neural networks, and logistic regression) .
  Centralization: zero mean, standard deviation No
  distinction normalized and standardized: normalized feature values are converted to the same dimensions of the sample data are mapped to [0, 1] or [-1, 1] interval, only the variables determined by the extreme values, because the interval scaling law is a normalized one. Standardization is the column matrix for processing data in accordance with the characteristics, required by the z-score method, into a standard normal distribution, and overall sample distribution related to each sample point can affect standardization. In that they are the same can be canceled due to an error due to different dimensions; is a kind of linear transformation of the vector X is then translated according to the compression ratio.
  The difference between standardization and centralization: the standardization of raw score is then divided by subtracting the average standard deviation, the center of the original score by subtracting the average. Therefore, the general process is the center of the first and then standardized.
  Dimensionless: My understanding is that in some way we can remove the actual process unit, thus simplifying the calculation.

2 Why normalization / standardization?
  As previously mentioned, the normalization / standardization is essentially a linear transformation, linear transformation has many good properties that determine the data after change will not result in "failure", but can improve the performance of data, these properties are normalization / standardization of the premise. For example, there is a very important property: linear transform does not change the value of the original data sorted.
(1) Some models require solving
  1) in solving optimization problems using a gradient descent method, normalization / standardization may accelerate the solving speed gradient descent, namely to enhance the convergence velocity model. As shown on the left, non-normalized / standardized contour formed during partial ellipse, is likely to go "and" iterative shaped path (vertical long axis), leading to many times of iterations to converge. The two right and normalized features, will be corresponding rounded contour can be faster convergence when solved gradient descent.

 

 
Oval contour and a circular contour

 

 
Gradient descent trajectory

  2) Some classifier needs to calculate the distance (e.g., Euclidean distance) between the sample, e.g. KNN. If a characteristic value range is very large, then the distance calculation depends on this feature, so inconsistent with the actual situation (for example, then the reality is that small value range of features is more important).

(2) dimensionless
  such as house numbers and revenue, because we know from the business layer, as the importance of both, so all of them normalized. This process is made from the operational level.

(3) avoid numerical problems
  too large number can cause numerical problems.

3 data preprocessing
3.1 normalized
(. 1) Normalization Min-Max
   X '= (X - x_min) / (x_max - x_min)

(2) The average normalized
   X '= (X - [mu]) / (MaxValue - the MinValue)
  (. 1) and (2) there is a defect when new data is added, may lead to changes in the max and min, it is necessary to redefine .

(3)非线性归一化
  1)对数函数转换:y = log10(x)
  2)反余切函数转换:y = atan(x) * 2 / π
  (3)经常用在数据分化比较大的场景,有些数值很大,有些很小。通过一些数学函数,将原始值进行映射。该方法包括 log、指数,正切等。需要根据数据分布的情况,决定非线性函数的曲线,比如log(V, 2)还是log(V, 10)等。

3.2 标准化
(1)Z-score规范化(标准差标准化 / 零均值标准化)
  x' = (x - μ)/σ

3.3 中心化
  x' = x - μ

4 什么时候用归一化?什么时候用标准化?
  (1)如果对输出结果范围有要求,用归一化。
  (2)如果数据较为稳定,不存在极端的最大最小值,用归一化。
  (3)如果数据存在异常值和较多噪音,用标准化,可以间接通过中心化避免异常值和极端值的影响。
  某知乎答主的回答提到了他个人经验:一般来说,我个人建议优先使用标准哈。对于输出有要求时再尝试别的方法,如归一化或者更加复杂的方法。很多方法都可以将输出范围调整到[0, 1],如果我们对于数据的分布有假设的话,更加有效的方法是使用相对应的概率密度函数来转换。让我们以高斯分布为例,我们可以首先计算高斯误差函数(Gaussian Error Function),此处定为er fc(·),那么可以用下式进行转化:

 
 


  这篇博客提到他的经验:1) 在分类、聚类算法中,需要使用距离来度量相似性的时候、或者使用PCA技术进行降维的时候,第二种方法(Z-score standardization)表现更好。2) 在不涉及距离度量、协方差计算、数据不符合正太分布的时候,可以使用第一种方法或其他归一化方法。比如图像处理中,将RGB图像转换为灰度图像后将其值限定在[0 255]的范围。

 

5 哪些模型必须归一化/标准化?
(1)SVM
  不同的模型对特征的分布假设是不一样的。比如SVM 用高斯核的时候,所有维度共用一个方差,这不就假设特征分布是圆的么,输入椭圆的就坑了人家,所以简单的归一化都还不够好,来杯白化才有劲。比如用树的时候就是各个维度各算各的切分点,没所谓。

(2)KNN
  需要度量距离的模型,一般在特征值差距较大时,都会进行归一化/标准化。不然会出现“大数吃小数”。

(3)神经网络
  1)数值问题
  归一化/标准化可以避免一些不必要的数值问题。输入变量的数量级未致于会引起数值问题吧,但其实要引起也并不是那么困难。因为tansig(tanh)的非线性区间大约在[-1.7,1.7]。意味着要使神经元有效,tansig( w1x1 + w2x2 +b) 里的 w1x1 +w2x2 +b 数量级应该在 1 (1.7所在的数量级)左右。这时输入较大,就意味着权值必须较小,一个较大,一个较小,两者相乘,就引起数值问题了。
  假如你的输入是421,你也许认为,这并不是一个太大的数,但因为有效权值大概会在1/421左右,例如0.00243,那么,在matlab里输入 421·0.00243 == 0.421·2.43,会发现不相等,这就是一个数值问题。

  2)求解需要
  a. 初始化:在初始化时我们希望每个神经元初始化成有效的状态,tansig函数在[-1.7, 1.7]范围内有较好的非线性,所以我们希望函数的输入和神经元的初始化都能在合理的范围内使得每个神经元在初始时是有效的。(如果权值初始化在[-1,1]且输入没有归一化且过大,会使得神经元饱和)
  b. 梯度:以输入-隐层-输出这样的三层BP为例,我们知道对于输入-隐层权值的梯度有2ew(1-a^2)*x的形式(e是誤差,w是隐层到输出层的权重,a是隐层神经元的值,x是输入),若果输出层的数量级很大,会引起e的数量级很大,同理,w为了将隐层(数量级为1)映身到输出层,w也会很大,再加上x也很大的话,从梯度公式可以看出,三者相乘,梯度就非常大了。这时会给梯度的更新带来数值问题。
  c. 学习率:由(2)中,知道梯度非常大,学习率就必须非常小,因此,学习率(学习率初始值)的选择需要参考输入的范围,不如直接将数据归一化,这样学习率就不必再根据数据范围作调整。 隐层到输出层的权值梯度可以写成 2ea,而输入层到隐层的权值梯度为 2ew(1-a^2)x ,受 x 和 w 的影响,各个梯度的数量级不相同,因此,它们需要的学习率数量级也就不相同。对w1适合的学习率,可能相对于w2来说会太小,若果使用适合w1的学习率,会导致在w2方向上步进非常慢,会消耗非常多的时间,而使用适合w2的学习率,对w1来说又太大,搜索不到适合w1的解。如果使用固定学习率,而数据没归一化,则后果可想而知。
  d.搜索轨迹:已解释
  
(4)PCA

参考:
标准化和归一化什么区别? - 知乎:https://www.zhihu.com/question/20467170
R--数据标准化、归一化、中心化处理:https://zhuanlan.zhihu.com/p/33727799
特征工程中的[归一化]有什么作用? - 知乎:https://www.zhihu.com/question/20455227
神经网络为什么要归一化:http://nnetinfo.com/nninfo/showText.jsp?id=37

作者:brucep3
链接:https://www.jianshu.com/p/95a8f035c86c
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

Guess you like

Origin www.cnblogs.com/lixiaozhi/p/11712572.html