线性回归--特征缩放

特征缩放是把数据 (各个特征) 变换到同一个尺度。两种常见的缩放方法:

标准化
归一化

标准化

标准化是对列中的每个值减去均值后再除以方差,即数据被转换为均值为0,标准差为1。在Python中,假设在 df 中有一列叫 height。可以用以下语句,创建一个标准化的高度:

df["height_standard"] = (df["height"] - df["height"].mean()) / df["height"].std()

这将创建一个新的 “标准化” 列 。新列中的每一项都是原列值减去列均值后再除以列方差,新的标准化值可以解释为,原高度与平均高度之间相差多少个标准差。这是最常见的一种特征缩放技术。
归一化

第二种特征缩放方法是著名的归一化。归一化将数据压缩到0和1之间。仍使用上面标准化的例子,可以用下面的 Python 语句归一化数据:

df["height_normal"] = (df["height"] - df["height"].min()) / (df["height"].max() - df['height'].min())

什么时候做特征缩放?

在许多机器学习算法中,数据缩放对预测结果的影响很大。尤其是在以下两个具体案例中:

使用基于距离的特征做预测
加入正则化

基于距离的特征

在后面的课程中,你将看到一种基于距离点的常见监督学习技术支持向量机 (SVMs)。另一个用基于距离的方法是k近邻算法 (也称 k-nn)。当使用两种技术中的任何一种时,如果不对数据做特征缩放,可能会导致完全不同(也可能会误导)的预测结果。

因此,用这些基于距离的技术做预测时,必须先进行特征缩放。
正则化

当你开始在模型中使用正则化时,你将再次需要做特征缩放。特征的尺度对正则化线性回归技术中,正则化对特定系数的惩罚的影响很大。如果一个特征的取值区间是从0 到10,而另一个特征的取值区间是 0 到1000000, 不做特征缩放预处理,就应用正则化将不公平地惩罚小尺度的特征。相比大尺度特征,小尺度特征需要用更大的系数,才能对结果产生相同的影响(思考怎样让两个数 aaa 和 bbb 满足交换律,即 ab=baab = baab=ba)。因此,在两个特征的净误差的增量相同情况下,正则化会删除那个系数大的小尺度特征,因为这将最大程度地减少正则化项。

这再次说明,正则化前要先做特征缩放。

关于使用正则化时特征缩放的重要性的一篇有用的文章。

这个文章中提到,特征缩放可以加快机器学习算法的收敛速度,这是扩展机器学习应用的一个重要的考虑因素。

用下面的小测验练习特征缩放。

执行以下步骤:

  1. 加载数据

    数据保存在 ‘data.csv’ 文件中。注意数据文件有一个标题行。
    把数据拆分为6个预测器变量(前6列)存储在X和1个结果变量(最后一列)存储在y。

  2. 特征缩放之标准化

    创建一个sklearn 的 StandardScaler 的实例,并把它赋值给变量scaler。
    用 .fit_transform() 方法计算预测器特征数组的缩放参数,它还返回标准化值中的预测变量。把这些标准化值存储在X_scaled。

  3. 用Lasso 正则化线性回归拟合数据

    创建一个sklearn的 Lasso 类的实例,并把它赋值给变量lasso_reg。你不需要设置任何参数值: 使用练习的默认值。
    用 Lasso 对象的 .fit() 方法去拟合回归模型。确保你是在上一步生成的标准化的数据上做拟合(X_scaled),不要用原始数据。

  4. 检验回归模型的系数

    用 Lasso 对象的 .coef_ 属性获取拟合回归模型的系数。
    把回归模型的系数存储在变量reg_coef

点击测试按钮,运行练习后,将会打印出这些参数,请根据你的观察结果回答后面的问题。

# TODO: Add import statements
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# Assign the data to predictor and outcome variables
# TODO: Load the data
train_data = pd.read_csv('data.csv', header = None)
X = train_data.iloc[:,:-1]
y = train_data.iloc[:,-1]

# TODO: Create the standardization scaling object.
scaler = StandardScaler()

# TODO: Fit the standardization parameters and scale the data.
X_scaled = scaler.fit_transform(X)

# TODO: Create the linear regression model with lasso regularization.
lasso_reg = Lasso()

# TODO: Fit the model.
lasso_reg.fit(X_scaled, y)

# TODO: Retrieve and print out the coefficients from the regression model.
reg_coef = lasso_reg.coef_
print(reg_coef)
1.25664,2.04978,-6.23640,4.71926,-4.26931,0.20590,12.31798
-3.89012,-0.37511,6.14979,4.94585,-3.57844,0.00640,23.67628
5.09784,0.98120,-0.29939,5.85805,0.28297,-0.20626,-1.53459
0.39034,-3.06861,-5.63488,6.43941,0.39256,-0.07084,-24.68670
5.84727,-0.15922,11.41246,7.52165,1.69886,0.29022,17.54122
-2.86202,-0.84337,-1.08165,0.67115,-2.48911,0.52328,9.39789
-7.09328,-0.07233,6.76632,13.06072,0.12876,-0.01048,11.73565
-7.17614,0.62875,-2.89924,-5.21458,-2.70344,-0.22035,4.42482
8.67430,2.09933,-11.23591,-5.99532,-2.79770,-0.08710,-5.94615
-6.03324,-4.16724,2.42063,-3.61827,1.96815,0.17723,-13.11848
8.67485,1.48271,-1.31205,-1.81154,2.67940,0.04803,-9.25647
4.36248,-2.69788,-4.60562,-0.12849,3.40617,-0.07841,-29.94048
9.97205,-0.61515,2.63039,2.81044,5.68249,-0.04495,-20.46775
-1.44556,0.18337,4.61021,-2.54824,0.86388,0.17696,7.12822
-3.90381,0.53243,2.83416,-5.42397,-0.06367,-0.22810,6.05628
-12.39824,-1.54269,-2.66748,10.82084,5.92054,0.13415,-32.91328
5.75911,-0.82222,10.24701,0.33635,0.26025,-0.02588,17.75036
-7.12657,3.28707,-0.22508,13.42902,2.16708,-0.09153,-2.80277
7.22736,1.27122,0.99188,-8.87118,-6.86533,0.09410,33.98791
-10.31393,2.23819,-7.87166,-3.44388,-1.43267,-0.07893,-3.18407
-8.25971,-0.15799,-1.81740,1.12972,4.24165,-0.01607,-20.57366
13.37454,-0.91051,4.61334,0.93989,4.81350,-0.07428,-12.66661
1.49973,-0.50929,-2.66670,-1.28560,-0.18299,-0.00552,-6.56370
-10.46766,0.73077,3.93791,-1.73489,-3.26768,0.02366,23.19621
-1.15898,3.14709,-4.73329,13.61355,-3.87487,-0.14112,13.89143
4.42275,-2.09867,3.06395,-0.45331,-2.07717,0.22815,10.29282
-3.34113,-0.31138,4.49844,-2.32619,-2.95757,-0.00793,21.21512
-1.85433,-1.32509,8.06274,12.75080,-0.89005,-0.04312,14.54248
0.85474,-0.50002,-3.52152,-4.30405,4.13943,-0.02834,-24.77918
0.33271,-5.28025,-4.95832,22.48546,4.95051,0.17153,-45.01710
-0.07308,0.51247,-1.38120,7.86552,3.31641,0.06808,-12.63583
2.99294,2.85192,5.51751,8.53749,4.30806,-0.17462,0.84415
1.41135,-1.01899,2.27500,5.27479,-4.90004,0.19508,23.54972
3.84816,-0.66249,-1.35364,16.51379,0.32115,0.41051,-2.28650
3.30223,0.23152,-2.16852,0.75257,-0.05749,-0.03427,-4.22022
-6.12524,-2.56204,0.79878,-3.36284,1.00396,0.06219,-9.10749
-7.47524,1.31401,-3.30847,4.83057,1.00104,-0.19851,-7.69059
5.84884,-0.53504,-0.19543,10.27451,6.98704,0.22706,-29.21246
6.44377,0.47687,-0.08731,22.88008,-2.86604,0.03142,10.90274
6.35366,-2.04444,1.98872,-1.45189,-1.24062,0.23626,4.62178
6.85563,-0.94543,5.16637,2.85611,4.64812,0.29535,-7.83647
1.61758,1.31067,-2.16795,8.07492,-0.17166,-0.10273,0.06922
3.80137,1.02276,-3.15429,6.09774,3.18885,-0.00163,-16.11486
-6.81855,-0.15776,-10.69117,8.07818,4.14656,0.10691,-38.47710
-6.43852,4.30120,2.63923,-1.98297,-0.89599,-0.08174,20.77790
-2.35292,1.26425,-6.80877,3.31220,-6.17515,-0.04764,14.92507
9.13580,-1.21425,1.17227,-6.33648,-0.85276,-0.13366,-0.17285
-3.02986,-0.48694,0.24329,-0.38830,-4.70410,-0.18065,15.95300
3.27244,2.22393,-1.96640,17.53694,1.62378,0.11539,-4.29743
-4.44346,-1.96429,0.22209,15.29785,-1.98503,0.40131,4.07647
-2.61294,-0.24905,-4.02974,-23.82024,-5.94171,-0.04932,16.50504
3.65962,1.69832,0.78025,9.88639,-1.61555,-0.18570,9.99506
2.22893,-4.62231,-3.33440,0.07179,0.21983,0.14348,-19.94698
-5.43092,1.39655,-2.79175,0.16622,-2.38112,-0.09009,6.49039
-5.88117,-3.04210,-0.87931,3.96197,-1.01125,0.08132,-6.01714
0.51401,-0.30742,6.01407,-6.85848,-3.61343,-0.15710,24.56965
4.45547,2.34283,0.98094,-4.66298,-3.79507,0.37084,27.19791
0.05320,0.27458,6.95838,7.50119,-5.50256,0.06913,36.21698
4.72057,0.17165,4.83822,-1.03917,4.11211,-0.14773,-6.32623
-11.60674,-1.15594,-10.23150,0.49843,0.32477,-0.14543,-28.54003
-7.55406,0.45765,10.67537,-15.12397,3.49680,0.20350,11.97581
-1.73618,-1.56867,3.98355,-5.16723,-1.20911,0.19377,9.55247
2.01963,-1.12612,1.16531,-2.71553,-5.39782,0.01086,21.83478
-1.68542,-1.08901,-3.55426,3.14201,0.82668,0.04372,-13.11204
-3.09104,-0.23295,-5.62436,-3.03831,0.77772,0.02000,-14.74251
-3.87717,0.74098,-2.88109,-2.88103,3.36945,-0.30445,-18.44363
-0.42754,-0.42819,5.02998,-3.45859,-4.21739,0.25281,29.20439
8.31292,2.30543,-1.52645,-8.39725,-2.65715,-0.30785,12.65607
8.96352,2.15330,7.97777,-2.99501,2.19453,0.11162,13.62118
-0.90896,-0.03845,11.60698,5.39133,1.58423,-0.23637,13.73746
2.03663,-0.49245,4.30331,17.83947,-0.96290,0.10803,10.85762
-1.72766,1.38544,1.88234,-0.58255,-1.55674,0.08176,16.49896
-2.40833,-0.00177,2.32146,-1.06438,2.92114,-0.05635,-8.16292
-1.22998,-1.81632,-2.81740,12.29083,-1.40781,-0.15404,-6.76994
-3.85332,-1.24892,-6.24187,0.95304,-3.66314,0.02746,-0.87206
-7.18419,-0.91048,-2.41759,2.46251,-5.11125,-0.05417,11.48350
5.69279,-0.66299,-3.40195,1.77690,3.70297,-0.02102,-23.71307
5.82082,1.75872,1.50493,-1.14792,-0.66104,0.14593,11.82506
0.98854,-0.91971,11.94650,1.36820,2.53711,0.30359,13.23011
1.55873,0.25462,2.37448,16.04402,-0.06938,-0.36479,-0.67043
-0.66650,-2.27045,6.40325,7.64815,1.58676,-0.11790,-3.12393
4.58728,-2.90732,-0.05803,2.27259,2.29507,0.13907,-16.76419
-11.73607,-2.26595,1.63461,6.21257,0.73723,0.03777,-7.00464
-2.03125,1.83364,1.57590,5.52329,-3.64759,0.06059,23.96407
4.63339,1.37232,-0.62675,13.46151,3.69937,-0.09897,-13.66325
-0.93955,-1.39664,-4.69027,-5.30208,-2.70883,0.07360,-0.26176
3.19531,-1.43186,3.82859,-9.83963,-2.83611,0.09403,14.30309
-0.66991,-0.33925,-0.26224,-6.71810,0.52439,0.00654,-2.45750
3.32705,-0.20431,-0.61940,-5.82014,-3.30832,-0.13399,9.94820
-3.01400,-1.40133,7.13418,-15.85676,3.92442,0.29137,-0.19544
10.75129,-0.08744,4.35843,-9.89202,-0.71794,0.12349,12.68742
4.74271,-1.32895,-2.73218,9.15129,0.93902,-0.17934,-15.58698
3.96678,-1.93074,-1.98368,-12.52082,7.35129,-0.30941,-40.20406
2.98664,1.85034,2.54075,-2.98750,0.37193,0.16048,9.08819
-6.73878,-1.08637,-1.55835,-3.93097,-3.02271,0.11860,6.24185
-4.58240,-1.27825,7.55098,8.83930,-3.80318,0.04386,26.14768
-10.00364,2.66002,-4.26776,-3.73792,-0.72349,-0.24617,0.76214
-4.32624,-2.30314,-8.16044,4.46366,-3.33569,-0.01655,-10.05262
-1.90167,-0.15858,-10.43466,4.89762,-0.64606,-0.14519,-19.63970
2.43213,2.41613,2.49949,-8.03891,-1.64164,-0.63444,12.76193
发布了185 篇原创文章 · 获赞 6 · 访问量 7万+

猜你喜欢

转载自blog.csdn.net/JackLi31742/article/details/105463499