Sklearn - 线性回归 - 代码天地

文章目录

sklearn 的线性回归文档：

from sklearn import datasets 
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston() 
boston_features = boston.data 
boston_target = boston.target

拟合一条直线

boston_features

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

features = boston_features[:, 0:2] 
features

array([[6.3200e-03, 1.8000e+01],
       [2.7310e-02, 0.0000e+00],
       [2.7290e-02, 0.0000e+00],
       ...,
       [6.0760e-02, 0.0000e+00],
       [1.0959e-01, 0.0000e+00],
       [4.7410e-02, 0.0000e+00]])

rgs = LinearRegression() 
model = rgs.fit(features, boston_target)

# 查看截距
model.intercept_ 
#   22.485628113468223


# 显示特征权重
model.coef_ 
# array([-0.35207832,  0.11610909])

# 目标向量第一个值 乘以 1000
boston_target[0] * 1000 # 24000.0

# boston_target

model.predict(features)[0] * 1000 
#    24573.366631705547

处理特征之间的影响

from sklearn.preprocessing import PolynomialFeatures

# 创建交互特征
interaction = PolynomialFeatures( degree=3, include_bias=False, interaction_only=True)

features_interaction = interaction.fit_transform(features)


rgs = LinearRegression() 
model = rgs.fit(features_interaction, boston_target)

# 第一个样本的特征
features[0] 
#    array([6.32e-03, 1.80e+01])

import numpy as np 

# 将每个样本的第一个和第二个特征相乘
interaction_term = np.multiply(features[:, 0], features[:, 1])

# 查看第一个样本的交互特征
interaction_term[0] #    0.11376

# 观察第一个样本的值

features_interaction[0] 
#    array([6.3200e-03, 1.8000e+01, 1.1376e-01])

拟合非线性关系

features = boston_features[:, 0:1] 

# 创建多项式特征 x^2 和 x^3
polynomial = PolynomialFeatures( degree=3, include_bias=False )

features_polynomial = polynomial.fit_transform(features) 

# 创建线性回归对象
rgs = LinearRegression() 
model = rgs.fit(features_polynomial, boston_target)

# 第一个样本的特征
features[0] 
# array([0.00632])


# 提升到 二阶
features[0] ** 2 
# array([3.99424e-05])



features[0] ** 3 
# array([2.52435968e-07])

# 观察第一个样本的所有三个特征 x, x^2, x^3 
features_polynomial[0]

array([6.32000000e-03, 3.99424000e-05, 2.52435968e-07])

通过正则化减少方差

from sklearn.linear_model import Ridge 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() 
features_std = scaler.fit_transform(boston_features) 

rgs = Ridge(alpha=0.5) 

model = rgs.fit(features_std, boston_target)

from sklearn.linear_model import RidgeCV

regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])

# 拟合线性回归
model_cv = regr_cv.fit(features_std, boston_target)

# 查看模型系数 
model_cv.coef_

array([-0.91987132,  1.06646104,  0.11738487,  0.68512693, -2.02901013,
        2.68275376,  0.01315848, -3.07733968,  2.59153764, -2.0105579 ,
       -2.05238455,  0.84884839, -3.73066646])

model_cv.alpha_

1.0

使用套索回归减少特征

希望通过减少特征的树龄啊，来简化线性回归模型

from sklearn.linear_model import Lasso

scaler = StandardScaler() 
features_std = scaler.fit_transform(boston_features) 

rgs = Lasso(alpha=0.5)

model = rgs.fit(features_std, boston_target)

# 查看系数
# 有很多系数为 0， 意味着他们对应的特征，并未在模型中使用
model.coef_

array([-0.11526463,  0.        , -0.        ,  0.39707879, -0.        ,
        2.97425861, -0.        , -0.17056942, -0.        , -0.        ,
       -1.59844856,  0.54313871, -3.66614361])

# 将 alpha 设置为更大的值，会看到模型不会使用任何特征 


rgs_10 = Lasso(alpha=10)

model_10 = rgs_10.fit(features_std, boston_target)

model_10.coef_

array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])

利用这种特性，可以再特征矩阵中，包含100个特征，然后调整套索回归的超参数，生成比如仅使用10个最重要特征的模型；
这样做可以减少模型方差，同时提高模型的可解释性（特征越少，越容易解释）

2023-04-02（日）小雨空气清新

Sklearn - 线性回归

文章目录

拟合一条直线

处理特征之间的影响

拟合非线性关系

通过正则化减少方差

使用套索回归 减少特征

猜你喜欢

使用套索回归减少特征