Linear regression (for learning use only, to be supplemented)

Table of contents

1. Definition

2. A simple example


1. Definition

Linear regression is a classic statistical method used to model the linear relationship between independent and dependent variables. Its basic idea is to find an optimal straight line under the condition of given independent variables, so that the residual sum of squares between the straight line and the observed data is minimized.

A linear regression model can be expressed as:

y = β0 + β1x1 + β2x2 + ... + βpxp + ε

Among them, y is the dependent variable, x1, x2, ..., xp are the independent variables, β0, β1, β2, ..., βp are the coefficients of the linear regression model, and ε is the random error.

The establishment of a linear regression model is divided into two stages: model training and model prediction.

1. In the model training phase

A known set of training data is required to estimate the coefficients of a linear regression model. The most commonly used method is the method of least squares, which estimates the coefficients of the model by minimizing the sum of squared residuals. Specifically, the goal of the least squares method is to make

S = Σ(yi - ŷi) ^ 2 is minimized, where yi is the observed value and ŷi is the predicted value estimated by the linear regression model. A complete linear regression model can be obtained by solving for the estimated values ​​of the coefficients.

2. In the model prediction stage

The trained model can be used to make predictions on new unknown data. Specifically, by substituting the new independent variable value into the linear regression model, the corresponding predicted value of the dependent variable is obtained. Predicted values ​​can be used to describe the relationship between independent and dependent variables, or for tasks such as classification and diagnosis.

Linear regression is widely used in many practical problems, such as housing price forecast, sales forecast, risk assessment, etc. Its advantage is that it is simple to calculate and easy to understand and implement. However, linear regression also has some limitations, such as only dealing with linear relationships, being sensitive to outliers, etc. Therefore, in practical applications, it is necessary to select an appropriate regression method according to the specific problem.

2. A simple example

A common linear regression example is house price prediction. Suppose we have a data set that contains the characteristics of different houses (such as size, number of bedrooms, number of bathrooms, distance from the city center, etc.) and the corresponding prices. We can use a linear regression model to model the relationship between housing prices and these features in order to make price predictions for new homes.

Specifically, we can express the house price y as a linear combination of the area x1, the number of bedrooms x2, the number of bathrooms x3 and the distance from the city center x4, namely:

y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ε

Among them, β0, β1, β2, β3, β4 are the coefficients of the linear regression model, ε is the random error.

During the training phase, we need to estimate these coefficients using a known dataset. The most commonly used method is the method of least squares, which estimates the coefficients of the model by minimizing the sum of squared residuals. Specifically, the goal of the least squares method is to make

S = Σ(yi - ŷi) ^ 2

Minimize, where yi is the known house price and ŷi is the predicted house price estimated by the linear regression model. By solving for the estimated values ​​of the coefficients, we can obtain the complete linear regression model.

In the prediction phase, we can use the trained model to predict the price of new houses. Specifically, we input the eigenvalues ​​of the new houses and substitute them into the linear regression model to obtain the corresponding predicted housing prices.

It should be noted that the linear regression model is sensitive to outliers, so the data needs to be cleaned and processed before use to avoid unreasonable results.

Code

import numpy as np
import matplotlib.pyplot as plt

# 训练集
X = np.array([[1400, 3, 2],
              [1600, 3, 2.5],
              [1700, 4, 2.5],
              [1800, 4, 3],
              [2000, 4, 3.5]])
y = np.array([245000, 312000, 279000, 308000, 345000])

# 添加常数列
X = np.c_[np.ones(X.shape[0]), X]

# 使用最小二乘法估计系数
theta = np.linalg.inv(X.T @ X) @ X.T @ y

# 输出估计的系数
print("估计的系数:", theta)

# 预测
X_new = np.array([1, 1600, 3, 2.5]).reshape(1, -1)
y_pred = X_new @ theta
print("预测的房价:", y_pred)

# 绘制模型
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 1], X[:, 2], y, c='r', marker='o')
ax.set_xlabel('Size')
ax.set_ylabel('Bedrooms')
ax.set_zlabel('Price')

x1 = np.linspace(1400, 2000, 10)
x2 = np.linspace(3, 4, 10)
x1, x2 = np.meshgrid(x1, x2)
y_pred = theta[0] + theta[1] * x1 + theta[2] * x2
ax.plot_surface(x1, x2, y_pred, alpha=0.5)

plt.show()

operation result

估计的系数: [38290.76923077    82.87362637 -6477.27272727 35769.23076923]
预测的房价: [284667.58185425]

where X is the training set and y is the corresponding house price. When estimating coefficients using the least squares method, a constant column needs to be added first to facilitate the estimation of the intercept. When predicting, we can input a new feature value for prediction and get the corresponding house price prediction value. Finally, we can also plot the visualization results of the model to better understand the performance of the model.

Guess you like

Origin blog.csdn.net/z377989129/article/details/129967765