Python实现的 Linear Regression 例子(附图)
简单线性回归
Python来实现一个简单的线性回归的例子。
假设下面的两个变量是线性相关的。因此,我们试图找到一个线性函数,尽可能准确地预测响应值(y)作为特征或自变量(x)的函数。
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
y | 1 | 3 | 2 | 5 | 7 | 8 | 8 | 9 | 10 | 12 |
一般而言,我们定义:
x作为特征向量,即x=[x_1,x_2,…,x_n],
y为响应向量,即y=[y_1,y_2,…,y_n]
对于n个观测值(在上面的例子中,n=10)。
用Python来实现上述数据集的散点图。
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 26 18:48:49 2020
@author: Bean029
"""
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
产生的散点图如下所示:
现在,我们的任务是找到一条最适合上述散点图的线,以便我们可以预测任何新特征值的响应。(即数据集中不存在x值)
这条线叫做回归线。
多元线性回归
多元线性回归试图通过将线性方程拟合到观测数据来模拟两个或多个特征与响应之间的关系。
显然,这只是简单线性回归的一个扩展。
考虑一个具有p个特征(或自变量)和一个响应(或因变量)的数据集。
此外,数据集还包含n行/观察值。
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 26 18:53:13 2020
@author: Bean029
"""
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics
# load the boston dataset
boston = datasets.load_boston(return_X_y=False)
# defining feature matrix(X) and response vector(y)
X = boston.data
y = boston.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)
# create linear regression object
reg = linear_model.LinearRegression()
# train the model using the training sets
reg.fit(X_train, y_train)
# regression coefficients
print('Coefficients: \n', reg.coef_)
# variance score: 1 means perfect prediction
print('Variance score: {}'.format(reg.score(X_test, y_test)))
# plot for residual error
## setting plot style
plt.style.use('fivethirtyeight')
## plotting residual errors in training data
plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
color = "red", s = 10, label = 'Train data')
## plotting residual errors in test data
plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
color = "blue", s = 10, label = 'Test data')
## plotting line for zero residual error
plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
## plotting legend
plt.legend(loc = 'upper right')
## plot title
plt.title("Residual errors")
## function to show plot
plt.show()
程序运行后的结果: