线性回归函数LinearRegression、随机梯度下降函数SGDRegre使用，波士顿房价预测，sklearn.metrics.mean_squared_error(y_true, y_pred)

一、线性回归api

sklearn.linear_model.LinearRegression(fit_intercept=True)：通过正规方程优化
- fit_intercept：是否计算偏置
- LinearRegression.coef_：回归系数
- LinearRegression.intercept_：偏置
sklearn.linear_model.SGDRegressor(loss="squared_loss", fit_intercept=True, learning_rate ='invscaling', eta0=0.01)：SGDRegressor类实现了随机梯度下降学习，它支持不同的loss函数和正则化惩罚项来拟合线性回归模型
- loss：损失类型，loss=”squared_loss”为普通最小二乘法
- fit_intercept：是否计算偏置
- learning_rate : string, optional，学习率填充
  - 'constant': eta = eta0
  - 'optimal': eta = 1.0 / (alpha * (t + t0)) [default]
  - 'invscaling': eta = eta0 / pow(t, power_t)
    - power_t=0.25:存在父类当中
  - 对于一个常数值的学习率来说，可以使用learning_rate=’constant’ ，并使用eta0来指定学习率
- SGDRegressor.coef_：回归系数
- SGDRegressor.intercept_：偏置
sklearn提供给我们两种实现的API，可以根据选择使用

二、波士顿房价预测

2.1 数据集说明

实例数量：506
属性数量：13 数值型或类别型，帮助预测的属性
中位数(第14个属性)经常是学习目标
属性信息(按顺序)

CRIM：城镇人均犯罪率
ZN：占地面积超过2.5万平方英尺的住宅用地比例
INDUS：城镇非零售业务地区的比例
CHAS：查尔斯河虚拟变量，即是否邻近Charles River(如果土地在河边 = 1（邻近）;否则是0（不临近）)
NOX：一氧化氮浓度(每1000万份)
RM：平均每居民房数
AGE：在1940年之前建成的所有者占用单位的比例
DIS：与五个波士顿就业中心的加权距离
RAD：辐射状公路的可达性指数
TAX：每10,000美元的全额物业税率，即全值财产税率
PTRATIO：城镇师生比例
B：1000(Bk- 0.63)^2，其中Bk是城镇的黑人比例
LSTAT：人口中地位较低人群的百分数，即低收入人群占比
MEDV：以1000美元计算的自有住房的中位数，即同类房屋价格的中位数

其中，CHAS为离散值，即是否邻近查尔斯河，其余为连续值

缺失属性值：无
创建者：Harrison, D. and Rubinfeld, D.L

2.2 评估分析

分析：回归当中的数据大小不一致，是否会导致结果影响较大，所以需要做标准化处理

数据分割与标准化处理
回归预测
线性回归的算法效果评估

回归性能评估

均方误差(Mean Squared Error)MSE)评价机制

$M S E=\frac{1}{m} \sum_{i=1}^{m}\left(y^{i}-\bar{y}\right)^{2}$

$y^{i}$ 为预测值， $\bar{y}$ 为真实值

sklearn.metrics.mean_squared_error(y_true, y_pred)：均方误差回归损失
- y_true：真实值
- y_pred：预测值
- return：浮点数结果

2.3 正规方程优化

正规方程使用代码如下

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression  # 导入模块
from sklearn.metrics import mean_squared_error


def linear_model1():   # 线性回归:正规方程
    # 1.获取数据
    data = load_boston()
    # print(data)   # 查看数据
    # 2.数据集划分
    x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=22)
    # 3.特征工程-标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)
    # 4.机器学习-线性回归(正规方程)
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)
    # 5.模型评估
    # 5.1 获取系数等值
    y_predict = estimator.predict(x_test)
    print("预测值为:\n", y_predict)
    print("模型中的系数为:\n", estimator.coef_)
    print("模型中的偏置为:\n", estimator.intercept_)
    score = estimator.score(x_test, y_test)
    print('准确率为:', score)
    # 5.2 评价
    # 均方误差
    error = mean_squared_error(y_test, y_predict)
    print("误差为:\n", error)


if __name__ == '__main__':
    linear_model1()
------------------------------------------------------
输出：
预测值为:
 [27.79728567 30.90056436 20.70927059 31.59515005 18.71926707 18.46483447
 20.7090385  18.01249201 18.18443754 32.26228416 20.45969144 27.30025768
 15.04218041 19.25382799 36.18076812 18.45209512  7.73077544 17.33936848
 29.40094704 23.32172471 18.43837789 33.31097321 28.38611788 17.43787678
 34.25179785 26.06150404 34.65387545 26.07481562 19.13116067 12.66351087
 30.00302966 14.70773445 36.82392563  9.08197058 15.06703028 16.68218611
  7.99793409 19.41266159 39.15193917 27.42584071 24.24171273 16.93863931
 38.03318373  6.63678428 21.51394405 24.41042009 18.86273557 19.87843319
 15.71796503 26.48901546  8.09589057 26.90160249 29.19481155 16.86472843
  8.47361081 34.87951213 32.41546229 20.50741461 16.27779646 20.32570308
 22.82622646 23.45866662 19.01451735 37.50382701 23.61872796 19.43409925
 12.98316226  6.99153964 40.99988893 20.87265869 16.74869905 20.79222071
 39.90859398 20.20645238 36.15225857 26.80056368 19.20376894 19.60725424
 24.04458577 20.45114082 30.47485108 19.09694834 22.55307626 30.77038574
 26.2119968  20.48073193 28.53910224 20.16485961 25.94461242 19.13440772
 24.98211795 22.84782867 19.18212763 18.88071352 14.49151931 17.78587168
 24.00230395 16.01304321 20.51185516 26.1867442  20.64288449 17.35297955]
模型中的系数为:
 [-0.73088157  1.13214851 -0.14177415  0.86273811 -2.02555721  2.72118285
 -0.1604136  -3.36678479  2.5618082  -1.68047903 -1.67613468  0.91214657
 -3.79458347]
模型中的偏置为:
 22.57970297029704
准确率为: 0.7636568232153278
误差为:
 20.95597975396346

2.4 随机梯度下降优化

随机梯度下降使用

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor  # 导入模块
from sklearn.metrics import mean_squared_error


def linear_model1():   # 线性回归:梯度下降
    # 1.获取数据
    data = load_boston()
    # print(data)   # 查看数据
    # 2.数据集划分
    x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=22)
    # 3.特征工程-标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)
    # 4.机器学习-线性回归(特征方程)
    estimator = SGDRegressor(max_iter=1000)  # 可选参数：learning_rate="constant",eta0=0.1
    estimator.fit(x_train, y_train)
    # 5.模型评估
    # 5.1 获取系数等值
    y_predict = estimator.predict(x_test)
    print("预测值为:\n", y_predict)
    print("模型中的系数为:\n", estimator.coef_)
    print("模型中的偏置为:\n", estimator.intercept_)
    score = estimator.score(x_test, y_test)
    print('准确率为:', score)
    # 5.2 评价
    # 均方误差
    error = mean_squared_error(y_test, y_predict)
    print("误差为:\n", error)


if __name__ == '__main__':
    linear_model1()
--------------------------------------------------------
输出：
预测值为:
 [27.83246143 30.88161189 20.95432    31.45884213 18.90925761 18.45917348
 20.90390675 17.86489232 18.22187674 32.16423006 20.74622179 27.02917088
 15.15330812 19.44252401 36.21385874 18.29591891  8.05902515 17.45662698
 29.43176546 23.26800519 18.40708287 33.11232958 28.04759069 17.28697425
 34.04417864 25.93173669 34.01544881 25.87379276 19.0817398  13.35425655
 29.8038706  13.62694089 36.71012991  9.64018296 15.20666928 16.37591592
  8.13807324 19.21248282 38.83533902 27.56476042 24.20613861 16.99058386
 38.31995441  6.48680166 21.27848416 24.13211464 19.23114948 20.08241424
 15.48439508 26.52060401  9.16924982 26.48884249 29.10616856 16.88112873
  8.6002894  34.64518763 31.44042482 21.37712748 16.27340122 20.05326982
 22.90189135 23.27151853 19.19993177 37.02482951 24.47042046 19.32200853
 13.12788319  6.90005885 41.20296986 20.77692349 16.52125048 20.78254188
 39.83026312 20.41243682 35.97145063 26.71062411 19.7893679  19.87802844
 23.96525037 21.71779398 30.67652286 18.90746059 22.46569409 30.35593801
 26.56083175 20.42805978 28.43908174 20.15769996 26.25308188 18.0554561
 24.44439573 22.62991121 19.18708049 19.63198957 14.62405071 17.6798955
 23.7063984  16.06260271 20.29155128 26.22744611 20.54511949 17.41100032]
模型中的系数为:
 [-0.59947749  0.93207663 -0.41252482  0.89679016 -1.81068943  2.80178787
 -0.205792   -3.214616    1.80637319 -0.88243966 -1.63129733  0.90483779
 -3.76952324]
模型中的偏置为:
 [22.5715974]
准确率为: 0.7592182017426476
误差为:
 21.34954162015491

2.5 警告说明

报警告如下，即该数据集存在道德问题已弃用，可忽略

FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.

The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.

翻译如下：

未来警告：不推荐使用函数load_boston;“load_boston”在 1.0 中已弃用，并将在 1.2 中删除。波士顿房价数据集存在道德问题。有关更多详细信息，您可以参考此函数的文档。因此，scikit-learn维护者强烈反对使用此数据集，除非代码的目的是研究和教育数据科学和机器学习中的道德问题。

学习导航：http://xqnav.top/