Machine Learning Basics 08-Regression Algorithm Matrix Analysis (Based on Boston House Price Dataset)

Regression algorithms often involve the use of matrices to represent data and model parameters. Linear regression is one of the most common regression algorithms, which can be expressed in matrix form.

Consider a simple linear regression model: y = mx + by = mx + by=mx+b , of whichyyy is the dependent variable,xxx is the independent variable,mmm is the slope,bbb is the intercept. Representing this model in matrix form can be shown as follows:

insert image description here
In the matrix expression above, the matrix on the left represents the dependent variable yyy , the matrix on the right represents the independent variablexxx and a constant term1 11 . And the model parametermmm andbbb is expressed in the form of a matrix.

Determine the optimal parameter mm by minimizing the residuals (the difference between the observed value and the model's predicted value)m andbbb , which usually involves solving methods in matrix calculations, such as least squares.

Other more complex regression algorithms, such as multivariate linear regression, ridge regression, Lasso regression, etc., can also be derived and solved through matrix representation. The matrix representation makes the calculation of the regression algorithm more compact and easy to understand.

Evaluation matrices for three regression algorithms for evaluating machine learning are presented next.

  1. Mean Absolute Error (Mean Absolute Error, MAE).
  2. Mean Squared Error (Mean Squared Error, MSE).
  3. Coefficient of determination (R2).

The example will use the Boston House Price data set for experimental operations

Dataset download address

https://github.com/selva86/datasets/blob/master/BostonHousing.csv

Dataset introduction:

Boston house price prediction is more like predicting a continuous value, of course this is also a very classic machine learning case
insert image description here

mean absolute error

The mean absolute error is the average of the absolute values ​​of the deviations of all individual observations from the arithmetic mean. Compared with the average error, the average absolute error will not cancel the positive and negative due to the absolute value of the dispersion. Therefore, the average absolute error can better reflect the actual situation of the error of the predicted value.

code show as below:


import pandas as pd
from sklearn.linear_model import  LinearRegression

from sklearn.model_selection import  KFold, cross_val_score

#数据预处理
path = 'D:\down\\BostonHousing.csv'
data = pd.read_csv(path)

array = data.values

X = array[:, 0:13]
Y = array[:, 13]

n_splits = 10

seed = 7

kflod = KFold(n_splits=n_splits, random_state=seed, shuffle=True)
#
model = LinearRegression()

scoring = 'neg_mean_absolute_error'

results = cross_val_score(model, X, Y, cv=kflod, scoring=scoring)

print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))


The execution results are as follows:


MSE: -3.387 (0.667)

mean square error

The mean square error is a method of measuring the average error, which can evaluate the degree of variation of the data. Root mean square error is the arithmetic square root of the mean square error. The smaller the value of the mean square error, the higher the accuracy of using the prediction model to describe the experimental data.

code show as below:


import pandas as pd
from sklearn.linear_model import  LinearRegression

from sklearn.model_selection import  KFold, cross_val_score

#数据预处理
path = 'D:\down\\BostonHousing.csv'
data = pd.read_csv(path)

array = data.values

X = array[:, 0:13]
Y = array[:, 13]

n_splits = 10

seed = 7

kflod = KFold(n_splits=n_splits, random_state=seed, shuffle=True)
#
model = LinearRegression()

scoring = 'neg_mean_squared_error'

results = cross_val_score(model, X, Y, cv=kflod, scoring=scoring)

print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))




The result of the operation is as follows:

MSE: -23.747 (11.143)

Coefficient of determination (R2)

The coefficient of determination reflects the proportion of all the variation of the dependent variable that can be explained by the independent variable through the regression relationship. The greater the goodness of fit, the higher the degree of interpretation of the dependent variable by the independent variable, the higher the percentage of the change caused by the independent variable to the total change, and the denser the observation points are near the regression line.

If R2 is 0.8, it means that the regression relationship can explain 80% of the variation of the dependent variable. In other words, if we can hold the independent variable constant, the degree of variation in the dependent variable will be reduced by 80%.
The characteristics of the coefficient of determination (R2):

  • The coefficient of determination is a non-negative statistic.
  • The value range of the coefficient of determination: 0≤R2≤1.
  • The coefficient of determination is a function of the sample observations and is a random variable that varies due to random sampling. For
    this reason, the statistical reliability of the coefficient of determination should also be tested.

code show as below:


import pandas as pd
from sklearn.linear_model import  LinearRegression

from sklearn.model_selection import  KFold, cross_val_score

#数据预处理
path = 'D:\down\\BostonHousing.csv'
data = pd.read_csv(path)

array = data.values

X = array[:, 0:13]
Y = array[:, 13]

n_splits = 10

seed = 7

kflod = KFold(n_splits=n_splits, random_state=seed, shuffle=True)
#
model = LinearRegression()

scoring = 'r2'

results = cross_val_score(model, X, Y, cv=kflod, scoring=scoring)

print("R2: %.3f (%.3f)" % (results.mean(), results.std()))




The execution results are as follows:

R2: 0.718 (0.099)

Usually, R2 (also known as the coefficient of determination) is an indicator used to measure the goodness of fit of a regression model. Its value ranges from 0 to 1. The closer to 1, the better the model fit, and the closer to 0, the poorer the model fit.

In this result, "R2: 0.718" means that the goodness of fit of the model is 0.718, which can be roughly understood as the model explains about 71.8% of the variance of the target variable. And "(0.099)" is the standard error information, which is used to represent the confidence interval of R2.

Guess you like

Origin blog.csdn.net/hai411741962/article/details/132162609