1. Import the package
import numpy as np
import matplotlib.pylab as plt
from sklearn import datasets
boston = datasets.load_boston()
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
print(boston.DESCR)
2. Get data: No data cleaning is done here
x = boston.data
y = boston.target
x.shape # (506, 13)
y.shape # (506,)
3. Data segmentation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=666)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
4. Model training
reg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
5. Feature parameters, that is, the coefficients of the model in each feature dimension
# 每一个特征对应的参数
reg.coef_
array([-7.56857766e-02, 4.93306230e-02, 6.85902135e-02, 2.55876122e+00, -1.60400649e+01, 4.09692993e+00, 6.55718540e-03, -1.41742836e+00, 2.92373287e-01, -1.41859462e-02, -9.68019957e-01, 1.16809189e-02, -5.33536333e-01])
# 截距
reg.intercept_
32.926954792283404
6. Model accuracy evaluation index: R square
# R Squared(r2 score)
reg.score(X_test, y_test)
0.6336069713055628
R2_score can be understood as the proportion of model fitting characteristics, taking the benchmark (mean) error as a reference.
- R2_score = 1, the predicted value is exactly the same as the true value without any error;
- R2_score = 0. At this time, the numerator is equal to the denominator, and the model is the same as the benchmark model;
- R2_score<0, the foolish model is not as good as the prediction of the average benchmark model. At this time, the data does not have any linear correlation.