Linear regression
Foreword
We need to understand that the target value solved by the classification algorithm is a discrete problem; while the regression algorithm solves the problem of continuous target value
What is linear regression?
Definition: Linear regression is a regression analysis modeled between one or more independent and dependent variables. It is characterized by a linear combination of one or more model parameters called regression coefficients.
Univariate linear regression: only one variable is involved
Multiple linear regression: two or more variables are involved
Note: Linear regression needs to be standardized to avoid a single weight that is too large and affects the final result
Examples
Single feature:
- Try to find a k, b value satisfying:
- House price = house area * k + b (b is offset, in order to be more common for a single feature)
Multiple characteristics: (house size, house location, ...)
- Try to find a k1, k2, ..., b value meets:
- House price = house area * k1 + house location * k2 +… + b
Linear relationship model
Try to find a combination of attributes and weights to predict the result:
However, there is always an error in the predicted value, so you need to use the loss function to calculate the error
Loss function (least square method)
How to find W (weight) in the model to minimize the loss? (The purpose is to find the W value corresponding to the smallest loss)
optimization
The normal equation of least squares (not recommended)
Disadvantages:
- When the feature is too complex, the solution speed is too slow
- For complex algorithms, normal equations cannot be used (logistic regression, etc.)
The visual graph of the loss function (single variable example) is as follows:
Gradient descent of least square method (❤️ ❤️ ❤️)
Understand: Find along the downward direction of this function, and finally find the lowest point of the function, then update the W value. (The process of constant iteration)
Knowledge reserve
sklearn linear regression normal equation API
- sklearn.linear_model.LinearRegression ()
coef_: regression coefficient
Sklearn linear regression gradient descent API
- sklearn.linear_model.SGDRegressor ()
coef_: regression coefficient
Code demo
Normal equation case
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def mylinear():
'''
线性回归预测房价
:return: None
'''
# 加载数据
lb = load_boston()
# 分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
print(y_train,y_test)
# 进行标准化处理
# 特征值和目标值都需要进行标准化处理
# 扫描器要求的是二维数据类型,需要利用reshape
std_x = StandardScaler()
# 特征值
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)
#目标值
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1))
y_test = std_y.transform(y_test.reshape(-1,1))
# estimator预测
# 正规方程求解方程预测结果
lr = LinearRegression()
lr.fit(x_train,y_train)
print(lr.coef_)
# 预测测试集的房价
y_predict = std_y.inverse_transform(lr.predict(x_test))
print("预测测试集里面每个样本的测试价格:",y_predict)
return None
if __name__ == "__main__":
mylinear()
Gradient descent
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
def mylinear():
'''
线性回归预测房价
:return: None
'''
# 加载数据
lb = load_boston()
# 分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
print(y_train,y_test)
# 进行标准化处理
# 特征值和目标值都需要进行标准化处理
# 扫描器要求的是二维数据类型,需要利用reshape
std_x = StandardScaler()
# 特征值
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)
#目标值
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1))
y_test = std_y.transform(y_test.reshape(-1,1))
# estimator预测
# 正规方程求解方程预测结果
lr = LinearRegression()
lr.fit(x_train,y_train)
print(lr.coef_)
# 预测测试集的房价
y_lr_predict = std_y.inverse_transform(lr.predict(x_test))
print("正规方程预测测试集里面每个样本的测试价格:",y_lr_predict)
# 梯度下降进行房价预测
# 学习率参数 learning_rate 默认 learning_rate = invscaling
'''
learning_rate : string, default='invscaling'
The learning rate schedule:
'constant':
eta = eta0
'optimal':
eta = 1.0 / (alpha * (t + t0))
where t0 is chosen by a heuristic proposed by Leon Bottou.
'invscaling': [default]
eta = eta0 / pow(t, power_t)
'adaptive':
eta = eta0, as long as the training keeps decreasing.
Each time n_iter_no_change consecutive epochs fail to decrease the
training loss by tol or fail to increase validation score by tol if
early_stopping is True, the current learning rate is divided by 5.
'''
sgd = SGDRegressor()
sgd.fit(x_train, y_train)
print(sgd.coef_)
# 预测测试集的房价
y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test))
print("梯度下降预测测试集里面每个样本的测试价格:",y_sgd_predict)
return None
if __name__ == "__main__":
mylinear()
Regression performance evaluation
Mean square error regression loss
mean_squared_error(y_true, y_pred)
- y_true: true value
- y_pred: predicted value
- return: floating point result
Note: The true value, the predicted value is the value before normalization
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
def mylinear():
'''
线性回归预测房价
:return: None
'''
# 加载数据
lb = load_boston()
# 分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
print(y_train,y_test)
# 进行标准化处理
# 特征值和目标值都需要进行标准化处理
# 扫描器要求的是二维数据类型,需要利用reshape
std_x = StandardScaler()
# 特征值
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)
#目标值
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1))
y_test = std_y.transform(y_test.reshape(-1,1))
# estimator预测
# 正规方程求解方程预测结果
lr = LinearRegression()
lr.fit(x_train,y_train)
print(lr.coef_)
# 预测测试集的房价
y_lr_predict = std_y.inverse_transform(lr.predict(x_test))
#print("正规方程预测测试集里面每个样本的测试价格:",y_lr_predict)
print("正规方程的均方误差",mean_squared_error(std_y.inverse_transform(y_test),y_lr_predict))
# 梯度下降进行房价预测
# 学习率参数 learning_rate 默认 learning_rate = invscaling
'''
learning_rate : string, default='invscaling'
The learning rate schedule:
'constant':
eta = eta0
'optimal':
eta = 1.0 / (alpha * (t + t0))
where t0 is chosen by a heuristic proposed by Leon Bottou.
'invscaling': [default]
eta = eta0 / pow(t, power_t)
'adaptive':
eta = eta0, as long as the training keeps decreasing.
Each time n_iter_no_change consecutive epochs fail to decrease the
training loss by tol or fail to increase validation score by tol if
early_stopping is True, the current learning rate is divided by 5.
'''
sgd = SGDRegressor()
sgd.fit(x_train, y_train)
print(sgd.coef_)
# 预测测试集的房价
y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test))
#print("梯度下降预测测试集里面每个样本的测试价格:",y_sgd_predict)
print("梯度下降的均方误差",mean_squared_error(std_y.inverse_transform(y_test),y_lr_predict))
return None
if __name__ == "__main__":
mylinear()
to sum up
Features: Linear regression is the most simple and easy to use regression model.
To some extent, the use is restricted. Nevertheless, without knowing the relationship between features, we still use linear regression as the primary choice for most systems.
Small-scale data: LinearRegression (cannot solve the fitting problem) and other
large-scale data: SGDRegressor