Sklearn - 线性回归


sklearn 的线性回归文档:


from sklearn import datasets 
from sklearn.linear_model import LinearRegression 

boston = datasets.load_boston() 
boston_features = boston.data 
boston_target = boston.target 


拟合一条直线

boston_features
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])
features = boston_features[:, 0:2] 
features 
array([[6.3200e-03, 1.8000e+01],
       [2.7310e-02, 0.0000e+00],
       [2.7290e-02, 0.0000e+00],
       ...,
       [6.0760e-02, 0.0000e+00],
       [1.0959e-01, 0.0000e+00],
       [4.7410e-02, 0.0000e+00]])
rgs = LinearRegression() 
model = rgs.fit(features, boston_target) 
# 查看截距
model.intercept_ 
#   22.485628113468223


# 显示特征权重
model.coef_ 
# array([-0.35207832,  0.11610909])
# 目标向量第一个值 乘以 1000
boston_target[0] * 1000 # 24000.0
# boston_target
model.predict(features)[0] * 1000 
#    24573.366631705547

处理特征之间的影响

from sklearn.preprocessing import PolynomialFeatures 
# 创建交互特征
interaction = PolynomialFeatures( degree=3, include_bias=False, interaction_only=True)

features_interaction = interaction.fit_transform(features) 

rgs = LinearRegression() 
model = rgs.fit(features_interaction, boston_target) 

# 第一个样本的特征
features[0] 
#    array([6.32e-03, 1.80e+01])
import numpy as np 

# 将每个样本的第一个和第二个特征相乘
interaction_term = np.multiply(features[:, 0], features[:, 1]) 
# 查看第一个样本的交互特征
interaction_term[0] #    0.11376
# 观察第一个样本的值

features_interaction[0] 
#    array([6.3200e-03, 1.8000e+01, 1.1376e-01])

拟合非线性关系

features = boston_features[:, 0:1] 

# 创建多项式特征 x^2 和 x^3
polynomial = PolynomialFeatures( degree=3, include_bias=False )

features_polynomial = polynomial.fit_transform(features) 

# 创建线性回归对象
rgs = LinearRegression() 
model = rgs.fit(features_polynomial, boston_target)  

# 第一个样本的特征
features[0] 
# array([0.00632])


# 提升到 二阶
features[0] ** 2 
# array([3.99424e-05])



features[0] ** 3 
# array([2.52435968e-07])
# 观察第一个样本的所有三个特征 x, x^2, x^3 
features_polynomial[0] 
array([6.32000000e-03, 3.99424000e-05, 2.52435968e-07])

通过正则化减少方差

from sklearn.linear_model import Ridge 
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 
features_std = scaler.fit_transform(boston_features) 

rgs = Ridge(alpha=0.5) 

model = rgs.fit(features_std, boston_target) 

from sklearn.linear_model import RidgeCV 
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

# 拟合线性回归
model_cv = regr_cv.fit(features_std, boston_target) 
# 查看模型系数 
model_cv.coef_ 
array([-0.91987132,  1.06646104,  0.11738487,  0.68512693, -2.02901013,
        2.68275376,  0.01315848, -3.07733968,  2.59153764, -2.0105579 ,
       -2.05238455,  0.84884839, -3.73066646])
model_cv.alpha_ 
1.0

使用套索回归 减少特征

希望通过 减少特征的树龄啊,来简化线性回归模型

from sklearn.linear_model import Lasso 
scaler = StandardScaler() 
features_std = scaler.fit_transform(boston_features) 

rgs = Lasso(alpha=0.5)

model = rgs.fit(features_std, boston_target) 
# 查看系数
# 有很多系数为 0, 意味着他们对应的特征,并未在模型中使用
model.coef_ 
array([-0.11526463,  0.        , -0.        ,  0.39707879, -0.        ,
        2.97425861, -0.        , -0.17056942, -0.        , -0.        ,
       -1.59844856,  0.54313871, -3.66614361])

# 将 alpha 设置为更大的值,会看到模型不会使用任何特征 


rgs_10 = Lasso(alpha=10)

model_10 = rgs_10.fit(features_std, boston_target) 
model_10.coef_  
array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])
  • 利用这种特性,可以再特征矩阵中,包含100个特征,然后调整套索回归的超参数,生成比如 仅使用10个最重要特征的模型;
  • 这样做可以减少模型方差,同时提高模型的可解释性(特征越少,越容易解释)

2023-04-02(日)小雨空气清新

猜你喜欢

转载自blog.csdn.net/lovechris00/article/details/129912163