Linear regression function LinearRegression, stochastic gradient descent function SGDRegre use, Boston house price prediction, sklearn.metrics.mean_squared_error(y_true, y_pred)

1. Linear regression api

  • sklearn.linear_model.LinearRegression(fit_intercept=True): optimize by normal equation
    • fit_intercept: whether to calculate the bias
    • LinearRegression.coef_: regression coefficient
    • LinearRegression.intercept_:bias
  • sklearn.linear_model.SGDRegressor(loss="squared_loss", fit_intercept=True, learning_rate ='invscaling', eta0=0.01): The SGDRegressor class implements stochastic gradient descent learning, which supports different loss functions and regularization penalties to fit linear regression models
    • loss: loss type, loss="squared_loss" is the ordinary least squares method
    • fit_intercept: whether to calculate the bias
    • learning_rate : string, optional, learning rate padding
      • 'constant': eta = eta0
      • 'optimal': eta = 1.0 / (alpha * (t + t0)) [default]
      • 'invscaling': eta = eta0 / pow(t, power_t)
        • power_t=0.25: exists in the parent class
      • For a learning rate with a constant value, you can use learning_rate='constant' and use eta0 to specify the learning rate
    • SGDRegressor.coef_: regression coefficient
    • SGDRegressor.intercept_:bias
  • sklearn provides us with two implementations of the API, which can be used according to the choice

2. Boston house price forecast

2.1 Dataset description

Number of instances: 506
Number of attributes: 13 Numerical or categorical,
the median of the attribute that helps predict (the 14th attribute) is often the learning target
attribute information (in order)

  1. CRIM: Urban crime rate per capita
  2. ZN: Proportion of residential land with a floor area of ​​more than 25,000 square feet
  3. INDUS: Share of Urban Non-Retail Business Areas
  4. CHAS: Charles River dummy variable, that is, whether it is adjacent to the Charles River (if the land is by the river = 1 (adjacent); otherwise 0 (not adjacent))
  5. NOX: nitric oxide concentration (per 10 million parts)
  6. RM: Average number of rooms per resident
  7. AGE: Proportion of owner-occupied units built before 1940
  8. DIS: Weighted Distance to Five Boston Employment Centers
  9. RAD: Accessibility index for radial roads
  10. TAX: The full property tax rate per $10,000, the full value property tax rate
  11. PTRATIO: Urban Teacher-Student Ratio
  12. B: 1000(Bk- 0.63)^2, where Bk is the proportion of blacks in the town
  13. LSTAT: Percentage of the lower status group in the population, i.e. the proportion of low-income groups
  14. MEDV: Median owner-occupied homes in $1,000, the median price of similar homes

Among them, CHAS is a discrete value, that is, whether it is adjacent to the Charles River, and the rest are continuous values

Missing attribute value: None
Created by: Harrison, D. and Rubinfeld, DL

2.2 Evaluation Analysis

Analysis: The size of the data in the regression is inconsistent, whether it will cause a greater impact on the results, so it needs to be standardized

  • Data Segmentation and Standardization
  • regression prediction
  • Algorithm Effect Evaluation of Linear Regression

Regression Performance Evaluation

Mean Squared Error (MSE) Evaluation Mechanism

M S E=\frac{1}{m} \sum_{i=1}^{m}\left(y^{i}-\bar{y}\right)^{2}

y^{i}is the predicted value, \bar{y}and is the real value

  • sklearn.metrics.mean_squared_error(y_true, y_pred): mean squared error regression loss
    • y_true: true value
    • y_pred: predicted value
    • return: floating point result

2.3 Normal equation optimization

The code for the normal equation is as follows 

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression  # 导入模块
from sklearn.metrics import mean_squared_error


def linear_model1():   # 线性回归:正规方程
    # 1.获取数据
    data = load_boston()
    # print(data)   # 查看数据
    # 2.数据集划分
    x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=22)
    # 3.特征工程-标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)
    # 4.机器学习-线性回归(正规方程)
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)
    # 5.模型评估
    # 5.1 获取系数等值
    y_predict = estimator.predict(x_test)
    print("预测值为:\n", y_predict)
    print("模型中的系数为:\n", estimator.coef_)
    print("模型中的偏置为:\n", estimator.intercept_)
    score = estimator.score(x_test, y_test)
    print('准确率为:', score)
    # 5.2 评价
    # 均方误差
    error = mean_squared_error(y_test, y_predict)
    print("误差为:\n", error)


if __name__ == '__main__':
    linear_model1()
------------------------------------------------------
输出:
预测值为:
 [27.79728567 30.90056436 20.70927059 31.59515005 18.71926707 18.46483447
 20.7090385  18.01249201 18.18443754 32.26228416 20.45969144 27.30025768
 15.04218041 19.25382799 36.18076812 18.45209512  7.73077544 17.33936848
 29.40094704 23.32172471 18.43837789 33.31097321 28.38611788 17.43787678
 34.25179785 26.06150404 34.65387545 26.07481562 19.13116067 12.66351087
 30.00302966 14.70773445 36.82392563  9.08197058 15.06703028 16.68218611
  7.99793409 19.41266159 39.15193917 27.42584071 24.24171273 16.93863931
 38.03318373  6.63678428 21.51394405 24.41042009 18.86273557 19.87843319
 15.71796503 26.48901546  8.09589057 26.90160249 29.19481155 16.86472843
  8.47361081 34.87951213 32.41546229 20.50741461 16.27779646 20.32570308
 22.82622646 23.45866662 19.01451735 37.50382701 23.61872796 19.43409925
 12.98316226  6.99153964 40.99988893 20.87265869 16.74869905 20.79222071
 39.90859398 20.20645238 36.15225857 26.80056368 19.20376894 19.60725424
 24.04458577 20.45114082 30.47485108 19.09694834 22.55307626 30.77038574
 26.2119968  20.48073193 28.53910224 20.16485961 25.94461242 19.13440772
 24.98211795 22.84782867 19.18212763 18.88071352 14.49151931 17.78587168
 24.00230395 16.01304321 20.51185516 26.1867442  20.64288449 17.35297955]
模型中的系数为:
 [-0.73088157  1.13214851 -0.14177415  0.86273811 -2.02555721  2.72118285
 -0.1604136  -3.36678479  2.5618082  -1.68047903 -1.67613468  0.91214657
 -3.79458347]
模型中的偏置为:
 22.57970297029704
准确率为: 0.7636568232153278
误差为:
 20.95597975396346

2.4 Stochastic Gradient Descent Optimization 

stochastic gradient descent using

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor  # 导入模块
from sklearn.metrics import mean_squared_error


def linear_model1():   # 线性回归:梯度下降
    # 1.获取数据
    data = load_boston()
    # print(data)   # 查看数据
    # 2.数据集划分
    x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=22)
    # 3.特征工程-标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)
    # 4.机器学习-线性回归(特征方程)
    estimator = SGDRegressor(max_iter=1000)  # 可选参数:learning_rate="constant",eta0=0.1
    estimator.fit(x_train, y_train)
    # 5.模型评估
    # 5.1 获取系数等值
    y_predict = estimator.predict(x_test)
    print("预测值为:\n", y_predict)
    print("模型中的系数为:\n", estimator.coef_)
    print("模型中的偏置为:\n", estimator.intercept_)
    score = estimator.score(x_test, y_test)
    print('准确率为:', score)
    # 5.2 评价
    # 均方误差
    error = mean_squared_error(y_test, y_predict)
    print("误差为:\n", error)


if __name__ == '__main__':
    linear_model1()
--------------------------------------------------------
输出:
预测值为:
 [27.83246143 30.88161189 20.95432    31.45884213 18.90925761 18.45917348
 20.90390675 17.86489232 18.22187674 32.16423006 20.74622179 27.02917088
 15.15330812 19.44252401 36.21385874 18.29591891  8.05902515 17.45662698
 29.43176546 23.26800519 18.40708287 33.11232958 28.04759069 17.28697425
 34.04417864 25.93173669 34.01544881 25.87379276 19.0817398  13.35425655
 29.8038706  13.62694089 36.71012991  9.64018296 15.20666928 16.37591592
  8.13807324 19.21248282 38.83533902 27.56476042 24.20613861 16.99058386
 38.31995441  6.48680166 21.27848416 24.13211464 19.23114948 20.08241424
 15.48439508 26.52060401  9.16924982 26.48884249 29.10616856 16.88112873
  8.6002894  34.64518763 31.44042482 21.37712748 16.27340122 20.05326982
 22.90189135 23.27151853 19.19993177 37.02482951 24.47042046 19.32200853
 13.12788319  6.90005885 41.20296986 20.77692349 16.52125048 20.78254188
 39.83026312 20.41243682 35.97145063 26.71062411 19.7893679  19.87802844
 23.96525037 21.71779398 30.67652286 18.90746059 22.46569409 30.35593801
 26.56083175 20.42805978 28.43908174 20.15769996 26.25308188 18.0554561
 24.44439573 22.62991121 19.18708049 19.63198957 14.62405071 17.6798955
 23.7063984  16.06260271 20.29155128 26.22744611 20.54511949 17.41100032]
模型中的系数为:
 [-0.59947749  0.93207663 -0.41252482  0.89679016 -1.81068943  2.80178787
 -0.205792   -3.214616    1.80637319 -0.88243966 -1.63129733  0.90483779
 -3.76952324]
模型中的偏置为:
 [22.5715974]
准确率为: 0.7592182017426476
误差为:
 21.34954162015491

2.5 Warning statement

The warning is as follows, that is, the data set has moral issues and has been deprecated and can be ignored

FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

The translation is as follows:

Future warning: The function load_boston is deprecated; "load_boston" was deprecated in 1.0 and will be removed in 1.2. There are ethical issues with the Boston house price dataset. For more details, you can refer to the documentation of this function. Therefore, the scikit-learn maintainers strongly discourage the use of this dataset unless the purpose of the code is to research and educate on ethical issues in data science and machine learning.

Learning to navigate: http://xqnav.top/

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128249536