Comparison of linear regression models for Boston house price prediction (ordinary VSsklearn library method)

Insert image description here

Work hard to not be mediocre~

The biggest reason for learning is to get rid of mediocrity. One day earlier will bring you more excitement in life; one day later, you will have more troubles of mediocrity.

Linear regression is a statistical learning method used to build a linear model to predict the relationship between a dependent variable and an independent variable. It assumes a linear relationship between the dependent and independent variables and fits the data by minimizing the sum of squared residuals.

The example of using the linear regression model to predict housing prices in Boston is a classic application scenario. In this problem, we want to predict housing prices in the Boston area based on some characteristics (such as number of rooms, crime rate, student-faculty ratio, etc.).

For example, using the methods in the sklearn library, we can implement the three main modules of the linear regression model: data preparation, model training, and prediction.

1. Data preparation:
First, we need to collect data sets related to Boston housing prices, such as the Boston housing price data set (load_boston) in the sklearn library.
Then, we divide the data set into features (X) and targets (y), where X contains all independent variable features and y contains the corresponding dependent variable (house price).
We can perform necessary transformation and scaling on the data through data preprocessing, such as standardization or normalization.

2. Model training:
Use the linear regression model (LinearRegression) in the sklearn library for model training.
Call the fit() method, take the feature data X and target data y as input, and fit the linear regression model.
During the training process, the model will automatically calculate the best regression coefficients and intercepts to minimize the sum of squares of the residuals.

3. Prediction:
Use the trained linear regression model to predict.
Call the predict() method, take the feature data to be predicted as input, and get the prediction result.
The prediction results can be used to determine housing prices in the Boston area.

Let’s first look at linear regression related simulations

# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 1、线性回归模拟
## 1.1 读取模拟数据ex0_1.csv
df = pd.read_csv("ex0_1.csv")
data = df.values
X = data[:, 0].reshape((-1, 1))
y = data[:, 1].reshape((-1, 1))

## 1.2 绘制散点图
plt.scatter(X, y)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Scatter Plot')
plt.show()

## 1.3 正规方程解 inv(X.T * X ) * X.T * y
X_b = np.c_[np.ones((len(X), 1)), X]  # 添加偏置项
theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

## 1.4 绘制拟合的直线
plt.scatter(X, y)
plt.plot(X, X_b @ theta, color='red')  # 使用X_b计算预测值
plt.xlabel('X')
plt.ylabel('y')
plt.title('Fitted Line')
plt.show()

## 1.5 函数形式
intercept, slope = theta[0][0], theta[1][0]
print("拟合函数形式为：y = {:.2f} + {:.2f}x".format(intercept, slope))

## 1.6 利用梯度下降法求
def gradient_descent(X, y, theta, alpha, num_iters):
    m = len(y)
    for iter in range(num_iters):
        h = X.dot(theta)
        loss = h - y
        gradient = X.T.dot(loss) / m
        theta -= alpha * gradient
    return theta

alpha = 0.01  # 学习率
num_iters = 1000  # 迭代次数
theta = np.zeros((X_b.shape[1], 1))  # 初始化参数
theta = gradient_descent(X_b, y, theta, alpha, num_iters)
print("利用梯度下降法求得的参数为：", theta)

Below we use two methods to achieve Boston house price prediction

# -*- coding: utf-8 -*-
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# 2.1 数据准备
data = pickle.load(open("data.save","rb"))
X = data[:,:-1]
y = data[:,-1]
print(data)

# 数据划分
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=420)

# 2.3 调用LR函数
def LR(X, y):
    ones = np.ones((X.shape[0], 1))
    X_ones = np.hstack((ones, X))
    XT = X_ones.T
    XTX = np.dot(XT, X_ones)
    XTX_1 = np.linalg.inv(XTX)
    XTX_1XT = np.dot(XTX_1, XT)
    W = np.dot(XTX_1XT, y)
    return W

W=LR(X_train,y_train)

# 2.4 预测
ones = np.ones((X_test.shape[0],1))
X_test_ones=np.hstack((ones,X_test))
y_predict=np.dot(X_test_ones,W)

# 2.5 评估
print("均方根误差：",mean_squared_error(y_test,y_predict))


plt.rcParams['font.sans-serif'] = ['SimHei']  # 设置中文字体为黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决负号无法正常显示的问题

# 绘制房间数量与房屋价格的关系
plt.scatter(X_train[:, 5], y_train, label='训练数据')
plt.scatter(X_test[:, 5], y_test, label='测试数据')
plt.xlabel('每个住宅的平均房间数量')
plt.ylabel('房价')
plt.title('房间数量与价格的关系')
plt.legend()
plt.show()

Note that the data set is imported directly here because the higher version of python is not suitable.

The following is another import method

# -*- coding: utf-8 -*-
# 3.利用sklearn中的方法实现
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#导入数据集
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

#划分数据集
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=38)

## 3.1 训练模型
model = LinearRegression()
model.fit(X_train, y_train)

## 3.2 预测
y_pred = model.predict(X_test)

## 3.3 评估
mse = mean_squared_error(y_test, y_pred)
print("均方根误差：", mse)
r2 = model.score(X_test, y_test)
print("拟合优度：", r2)


plt.rcParams['font.sans-serif'] = ['SimHei']  # 设置中文字体为黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决负号无法正常显示的问题

# 绘制房间数量与房屋价格的关系
plt.scatter(X_train[:, 5], y_train, label='训练数据')
plt.scatter(X_test[:, 5], y_test, label='测试数据')
plt.xlabel('每个住宅的平均房间数量')
plt.ylabel('房价')
plt.title('房间数量与价格的关系')
plt.legend()
plt.show()

Comparison of linear regression models for Boston house price prediction (ordinary VSsklearn library method)

Guess you like