Normalization and standardization of difference

Author: Komatsu
link: https: //www.zhihu.com/question/20467170/answer/463834078
Source: know almost
copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.
 

Normalization: ranges of values ​​specific data is scaled, but it does not change the characteristic of a linear transformation of data distribution.

1.min-max normalized: the zoom range to a value (0,1), but did not change the distribution of data;

normalized min-max

2. z-score normalization: scaling value range to the vicinity of 0, but did not change the distribution of data;

z-score normalized

Standardization: conversion of the distribution of the data, so that a nonlinear profile that matches a characteristic (such as normal) conversion.

  1. box-cox standardization:

labeling of box-cox

 


Author: shadowless Caprice 
time: January 2016. 
Source: https: //zhaokv.com/machine_learning/2016/01/normalization-and-standardization.html
Disclaimer: All rights reserved Please indicate the source

In machine learning and data mining, often we hear the two terms: normalizing (Normalization) and Standardization (Standardization). What are they specifically? Bring any benefits? Specifically how to use? This article specifically address these issues.

First, what is

1. Normalized

Commonly used method is the original data by a linear transformation data are mapped to [0,1], the transformation function is:

x′=x−minmax−minx′=x−minmax−min

Minmin wherein the sample is a minimum, the maximum sample is MaxMax, note that in the maximum and minimum scene data stream is varied. Further, the maximum and minimum easily influenced by outliers, so this method less robust, accurate only for the traditional small scene data.

2. Standardization

Commonly used method is z-score normalized, processed through the data mean 0 and standard deviation 1, the processing method is:

x′=x−μσx′=x−μσ

其中μμ是样本的均值,σσ是样本的标准差,它们可以通过现有样本进行估计。在已有样本足够多的情况下比较稳定,适合现代嘈杂大数据场景。

二、带来什么

归一化的依据非常简单,不同变量往往量纲不同,归一化可以消除量纲对最终结果的影响,使不同变量具有可比性。比如两个人体重差10KG,身高差0.02M,在衡量两个人的差别时体重的差距会把身高的差距完全掩盖,归一化之后就不会有这样的问题。

标准化的原理比较复杂,它表示的是原始值与均值之间差多少个标准差,是一个相对值,所以也有去除量纲的功效。同时,它还带来两个附加的好处:均值为0,标准差为1。

均值为0有什么好处呢?它可以使数据以0为中心左右分布(这不是废话嘛),而数据以0为中心左右分布会带来很多便利。比如在去中心化的数据上做SVD分解等价于在原始数据上做PCA;机器学习中很多函数如SigmoidTanhSoftmax等都以0为中心左右分布(不一定对称)。

标准差为1有什么好处呢?这个更复杂一些。对于xixi与xi′xi′两点间距离,往往表示为

D(xi,xi′)=∑j=1pwj⋅dj(xij,xi′j);∑j=1pwj=1D(xi,xi′)=∑j=1pwj⋅dj(xij,xi′j);∑j=1pwj=1

其中dj(xij,xi′j)dj(xij,xi′j)是属性jj两个点之间的距离,wjwj是该属性间距离在总距离中的权重,注意设wj=1,∀jwj=1,∀j并不能实现每个属性对最后的结果贡献度相同。对于给定的数据集,所有点对间距离的平均值是个定值,即

D¯=1N2∑i=1N∑i′=1ND(xi,xi′)=∑j=1pwj⋅d¯jD¯=1N2∑i=1N∑i′=1ND(xi,xi′)=∑j=1pwj⋅d¯j

是个常数,其中

d¯j=1N2∑i=1N∑i′=1Ndj(xij,xx′j)d¯j=1N2∑i=1N∑i′=1Ndj(xij,xx′j)

可见第jj个变量对最终整体平均距离的影响是wj⋅d¯jwj⋅d¯j,所以设wj∼1/d¯jwj∼1/d¯j可以使所有属性对全数据集平均距离的贡献相同。现在设djdj为欧式距离(或称为二范数)的平方,它是最常用的距离衡量方法之一,则有

dj¯=1N2∑i=1N∑i′=1N(xij−xi′j)2=2⋅varjdj¯=1N2∑i=1N∑i′=1N(xij−xi′j)2=2⋅varj

其中varjvarj是Var(Xj)Var(Xj)的样本估计,也就是说每个变量的重要程度正比于这个变量在这个数据集上的方差。如果我们让每一维变量的标准差都为1(即方差都为1),每维变量在计算距离的时候重要程度相同。

三、怎么用

在涉及到计算点与点之间的距离时,使用归一化或标准化都会对最后的结果有所提升,甚至会有质的区别。那在归一化与标准化之间应该如何选择呢?根据上一节我们看到,如果把所有维度的变量一视同仁,在最后计算距离中发挥相同的作用应该选择标准化,如果想保留原始数据中由标准差所反映的潜在权重关系应该选择归一化。另外,标准化更适合现代嘈杂大数据场景。

Guess you like

Origin blog.csdn.net/bl128ve900/article/details/94509983