Linear Regression [Brief Summary of Machine Learning Notes]

Definitions and Formulas

Linear regression is an analysis method that uses regression equations (functions) to model the relationship between one or more independent variables (eigenvalues) and dependent variables (target values) .

General formula: h ( w ) = w 1 x 1 + w 2 x 2 + w 3 x 3 . . . + b = w T x + bh(w)=w_1x_1+w_2x_2+w_3x_3...+b=w^ Tx+bh(w)=w1x1+w2x2+w3x3...+b=wTx+b

There are two main models in linear regression, one is a linear relationship and the other is a nonlinear relationship.

Losses and Optimization for Linear Regression

1.损失
    最小二乘法
2.优化
    正规方程
    梯度下降法
3.正规方程 -- 一蹴而就
    利用矩阵的逆,转置进行一步求解
    只是适合样本和特征比较少的情况
4.梯度下降法 -- 循序渐进
    举例:
        山  -- 可微分的函数
        山底 -- 函数的最小值
    梯度的概念
        单变量 -- 切线
        多变量 -- 向量
    梯度下降法中关注的两个参数
        α  -- 就是步长
            步长太小 -- 下山太慢
            步长太大 -- 容易跳过极小值点
        为什么梯度要加一个负号
            梯度方向是上升最快方向,负号就是下降最快方向
5.梯度下降法和正规方程对比:
    梯度下降               正规方程
    需要选择学习率          不需要
    需要迭代求解            一次运算得出
    特征数量较大可以使用     需要计算方程,时间复杂度高O(n3)
6.选择:
    小规模数据:
        LinearRegression(不能解决拟合问题)
        岭回归
    大规模数据:
        SGDRegressor

1 Loss function

J ( θ ) = ( h w ( x 1 ) − y 1 ) 2 + ( h w ( x 2 ) − y 2 ) 2 + . . . + ( h w ( x n ) − y n ) 2 J(\theta)=(h_w(x_1)-y_1)^2+(h_w(x_2)-y_2)^2+...+(h_w(x_n)-y_n)^2 J(θ)=(hw(x1)y1)2+(hw(x2)y2)2+...+(hw(xn)yn)2

= ∑ i = 1 n ( h w ( x i ) − y i ) 2 =\sum_{i=1}^{n}(h_w(x_i)-y_i)^2 =i=1n(hw(xi)yi)2

  • yi is the true value of the i-th training sample
  • h(xi) is the i-th training sample eigenvalue combination prediction function
  • Least Squares

2 Optimization algorithm

How to find W in the model to minimize the loss? (The purpose is to find the W value corresponding to the minimum loss)

Two optimization algorithms commonly used in linear regression

2.1 Normal equation

w = ( X T X ) − 1 X T y w=(X^TX)^{-1}X^Ty w=(XTX)1XTy

Understanding: X is the eigenvalue matrix and y is the target value matrix. Get the best results directly

Disadvantages: When there are too many and complex features, the solution speed is too slow and no results can be obtained.

Derivation of Normal Equation

J ( θ ) = ( h w ( x 1 ) − y 1 ) 2 + ( h w ( x 2 ) − y 2 ) 2 + . . . + ( h w ( x n ) − y n ) 2 J(\theta)=(h_w(x_1)-y_1)^2+(h_w(x_2)-y_2)^2+...+(h_w(x_n)-y_n)^2 J(θ)=(hw(x1)y1)2+(hw(x2)y2)2+...+(hw(xn)yn)2

= ∑ i = 1 n ( h w ( x i ) − y i ) 2 =\sum_{i=1}^{n}(h_w(x_i)-y_i)^2 =i=1n(hw(xi)yi)2

= ( X w − y ) 2 =(Xw-y)^2 =(Xwy)2

where y is the true value matrix, X is the eigenvalue matrix, and w is the weight matrix

To solve for the minimum value of w, start and end y and X are all known quadratic functions and directly derive the derivative. The position where the derivative is zero is the minimum value.

Derivative: Note
insert image description here
: During the derivation process from equation (1) to equation (2), X is a matrix with m rows and n columns. There is no guarantee that it has an inverse matrix. However, right multiplication of There is an inverse matrix.

The derivation process from formula (5) to formula (6) is similar to the above.

2.2 Gradient Descent

1 全梯度下降算法(FG)
    在进行计算的时候,计算所有样本的误差平均值,作为我的目标函数
2 随机梯度下降算法(SG)
    每次只选择一个样本进行考核
3 小批量梯度下降算法(mini-bantch)
    选择一部分样本进行考核
4 随机平均梯度下降算法(SAG)
    会给每个样本都维持一个平均值,后期计算的时候,参考这个平均值
  • 1. Gradient descent of single variable function

Suppose there is a function of one variable: J(θ) = θ2

Differential of function: ▽J(θ) = 2θ

Initialization, starting point is: θ0 = 1

Learning rate: α = 0.4

Iterative calculation process of gradient descent:
insert image description here
after four operations, that is, taking four steps, we basically reach the lowest point of the function, which is the bottom of the mountain.

  • 2. Gradient descent of multi-variable functions

Suppose there is an objective function: J ( θ ) = θ 1 2 + θ 2 2 J(θ) = θ_1^2 + θ_2^2J(θ)=i12+i22

Now we want to calculate the minimum value of this function through gradient descent. We can find through observation that the minimum value is actually the (0, 0) point. But next, we will calculate this minimum value step by step starting from the gradient descent algorithm! We assume that the initial starting point is: θ 0 θ^0i0 = (1, 3)

Initial learning rate: α = 0.1

The gradient of the function is: ▽J(θ) =<2θ1,2θ2>

Perform multiple iterations:
insert image description here
the gradient descent formula
insert image description here
α is called the learning rate or step size in the gradient descent algorithm

Adding a minus sign before the gradient means moving in the opposite direction of the gradient

  • Comparison of Gradient Descent and Normal Equations
gradient descent normal equation
Need to choose learning rate unnecessary
Requires iterative solution One operation results in
Larger number of features can be used Need to calculate the equation, the time complexity is high O(n3)

choose:

  • Small-scale data:

    • LinearRegression (cannot solve the fitting problem)
    • ridge regression
  • Large-Scale Data: SGDRegressor

  • Full gradient descent algorithm (Full gradient descent),

  • Stochastic gradient descent algorithm,

  • Stochastic average gradient descent algorithm

  • Mini-batch gradient descent algorithm

The full gradient descent algorithm (FG)
calculates the errors of all samples in the training set, sums them and takes the average as the objective function.

The weight vector moves in the opposite direction of its gradient, thereby reducing the current objective function the most.

Batch gradient descent is slow because we need to compute all gradients on the entire dataset when performing each update. At the same time, batch gradient descent cannot handle datasets that exceed the memory capacity limit.

The batch gradient descent method also cannot update the model online, that is, new samples cannot be added during the running process.

It calculates the gradient of the loss function with respect to the parameter θ over the entire training data set:
insert image description here
Stochastic Gradient Descent Algorithm (SG)

Since FG needs to calculate all sample errors for each iterative weight update, and there are often hundreds of millions of training samples in actual problems, the efficiency is low and it is easy to fall into the local optimal solution. Therefore, the stochastic gradient descent algorithm is proposed.

The objective function of each round of calculation is no longer the error of all samples, but only the error of a single sample. That is, only the gradient of the objective function of one sample is calculated each time to update the weight, and then the next sample is taken and the process is repeated until the loss function value Stop the decline or the loss function value is smaller than some tolerable threshold.

This process is simple and efficient, and can usually better prevent update iterations from converging to the local optimal solution. Its iterative form is to
insert image description here
use only one sample iteration at a time. If it encounters noise, it will easily fall into a local optimal solution.

Among them, x(i) represents the feature value of a training sample, and y(i) represents the label value of a training sample.

However, since SG only uses one sample iteration at a time, it is easy to fall into a local optimal solution if it encounters noise.

Mini-batch gradient descent algorithm (mini-bantch)

The small-batch gradient descent algorithm is a compromise between FG and SG, which takes into account the advantages of the above two methods to a certain extent.

Each time a small sample set is randomly selected from the training sample set, and FG is used to iteratively update the weights on the extracted small sample set.

The number of sample points contained in the extracted small sample set is called batch_size, which is usually set to a power of 2, which is more conducive to GPU acceleration.

In particular, if batch_size=1, it becomes SG; if batch_size=n, it becomes FG. Its iterative form is the
insert image description here
stochastic average gradient descent algorithm (SAG)

In the SG method, although the problem of high computational cost is avoided, the SG effect is often unsatisfactory for big data training because each round of gradient update is completely independent of the data and gradient of the previous round.

The stochastic average gradient algorithm overcomes this problem. It maintains an old gradient in memory for each sample, randomly selects the i-th sample to update the gradient of this sample, keeps the gradients of other samples unchanged, and then finds the gradient of all gradients. The average value is used to update the parameters.

In this way, each round of update only needs to calculate the gradient of one sample, and the calculation cost is equivalent to SG, but the convergence speed is much faster.

Comparison of four gradients

(1**) Since the FG method uses the entire data set for each round of update, it costs the most time and has the largest memory storage. **

(2) SAG performs poorly in the early stage of training, and the optimization speed is slow. This is because we often set the initial gradient to 0, and each round of SAG gradient update combines the previous round of gradient values.

(3) Considering the number of iterations and running time, SG performance is very good. It can quickly get rid of the initial gradient value in the early stage of training and quickly reduce the average loss function to a very low level. However, it should be noted that when using the SG method, the step size must be carefully selected, otherwise it is easy to miss the optimal solution.

(4) The mini-batch combines the "boldness" of SG and the "carefulness" of FG, and its performance is exactly between SG and FG. In the current field of machine learning, mini-batch is the most commonly used gradient descent algorithm, precisely because it avoids the shortcomings of low cost and high computational efficiency of FG and unstable convergence effect of SG.

simulated gradient descent

import numpy as np
import matplotlib.pyplot as plt

plot_x = np.linspace(-1,6,141)

plot_y = (plot_x-2.5)**2-1

plt.plot(plot_x,plot_y)
plt.show()

insert image description here

#求导函数
def dJ(theta):
    return 2*(theta-2.5)


def J(theta):
    return (theta-2.5)**2-1

eta=0.1  #学习率
epsilon=1e-8

theta=0
theta_history=[theta]
while True:
    gradient=dJ(theta)
    last_theta=theta
    theta=theta-eta*gradient
    theta_history.append(theta)
    
    if abs(J(theta)-J(last_theta))<epsilon :
        break
print('最小值点的x坐标:',theta)
print('最小值点的y坐标:',J(theta))
plt.plot(plot_x,J(plot_x))
plt.plot(np.array(theta_history),J(np.array(theta_history)),color='r',marker='+') 

insert image description here

LinearRegression 和 SGDRegressor

class sklearn.linear_model.LinearRegression(*, fit_intercept=True,normalize='deprecated', copy_X=True, n_jobs=None, positive=False)
  • Optimization by Normal Equation
  • fit_intercept : the default True, whether to calculate the intercept of the model, when it is False, the data will be centralized
    normalize : the default False, whether to centralize, if the fit_intercept parameter is set to False, the normalize parameter does not need to be set; The input sample data will be (XX mean)/||X||; if you set normalize=False, you can use sklearn.preprocessing.StandardScaler for normalization before training the model.
    copy_X : Default is True, otherwise X will be overwritten
    n_jobs : Default is 1, indicating the number of CPUs used. When -1, it means using all CPU
  • coef_ : regression coefficient
  • intercept_ : bias
  • predict(x) : predict data
  • score(X, y, sample_weight=None),其结果等于 1 − ( ( y _ t r u e − y _ p r e d ) ∗ 2 ) . s u m ( ) / ( y _ t r u e − y _ t r u e . m e a n ( ) ) ∗ 2 ) . s u m ( ) ) 1-((y \_true - y \_pred) *2).sum() / (y \_true - y \_true.mean()) * 2).sum()) 1((y_truey_pred)2).sum()/(y_truey_true.mean())2).sum())
class sklearn.linear_model.SGDRegressor(loss='squared_error', *, penalty='l2',alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False)
  • The SGDRegressor class implements stochastic gradient descent learning and supports different loss functions and regularization penalty terms to fit linear regression models.
  • loss: loss type
    • loss=”squared_error”: ordinary least squares method
  • fit_intercept: whether to calculate offset
  • learning_rate : string, optional
    • learning rate padding
    • 'constant': eta = eta0
    • ’optimal’: eta = 1.0 / (alpha * (t + t0)) [default]
    • ‘invscaling’: eta = eta0 / pow(t, power_t)
      • power_t=0.25: exists in the parent class
    • For a constant learning rate, use learning_rate='constant' and use eta0 to specify the learning rate.
    • eta0: double, the initial learning rate when learning_rate is 'constant' or 'invscaling'. The default value is 0.0. This parameter is useless if learning_rate='optimal'.
  • SGDRegressor.coef_: regression coefficient
  • SGDRegressor.intercept_: bias

The advantages of SGD are:
efficient
and easy to implement (there are many opportunities for code tuning)
The disadvantages of SGD are:
SGD requires many hyperparameters: such as regular term parameters and number of iterations.
SGD is sensitive to feature scaling.

from sklearn.linear_model import LinearRegression

x = [[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]]
y = [84.2, 80.6, 80.1, 90, 83.2, 87.6, 79.4, 93.4]

# 实例化API
estimator = LinearRegression()
# 使用fit方法进行训练
estimator.fit(x,y)

print(estimator.coef_)
print(estimator.predict([[100, 80]]))

Linear regression model evaluation

Linear regression model evaluation
Verify the regression model through several parameters
SSE (sum of squares and variance, error sum of squares): The sum of squares due to error
MSE (mean squared error, variance): Mean squared error 1 m ∑ i = 1 m ( yi − y ^ i ) 2 \frac{1}{m} \displaystyle \sum_{i=1}^{m}(y_i-\hat y_i)^2m1i=1m(yiy^i)2
RMSE(均方根、标准差):Root mean squared error 1 m ∑ i = 1 m ( y i − y ^ i ) 2 = M S E \sqrt{\frac{1}{m} \displaystyle \sum_{i=1}^{m}(y_i-\hat y_i)^2}=\sqrt{MSE} m1i=1m(yiy^i)2 =MSE

MAE(Mean Absolute Error) 1 m ∑ i = 1 m ∣ yi − y ^ i ∣ \frac{1}{m} \displaystyle \sum_{i=1}^{m} \lvert y_i-\hat y_i \rvertm1i=1myiy^i

R-square (coefficient of determination) Coefficient of determination
insert image description here
formula deformation

insert image description here

In fact, the "coefficient of determination" represents the quality of a fit through changes in data. The normal value range of the "determination coefficient" is [0,1], the closer to 1, the stronger the explanatory ability of the variables of the equation to y, and the better the model fits the data

R 2 R^2 RThe bigger the 2, the better, when the own prediction model does not make any mistakes:R 2 = 1 R^2= 1R2=1 ifR 2 R^2R2 < 0, indicating that the learned model is not as good as the baseline model.
#Note: It is possible that the data does not have any linear relationship

from sklearn.metrics import mean_squared_error #均方误差
from sklearn.metrics import mean_absolute_error #平方绝对误差
from sklearn.metrics import r2_score#R square
#调用
mean_squared_error(y_test,y_predict)
mean_absolute_error(y_test,y_predict)
r2_score(y_test,y_predict)
# 简单线性回归(一元线性回归)
# (1)数据示例
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus']=False

rng = np.random.RandomState(1)  
xtrain = 10 * rng.rand(30)
ytrain = 8 + 4 * xtrain + rng.rand(30)
# np.random.RandomState → 随机数种子,对于一个随机数发生器,只要该种子(seed)相同,产生的随机数序列就是相同的
# 生成随机数据x与y
# 样本关系:y = 8 + 4*x


fig = plt.figure(figsize =(12,3))
ax1 = fig.add_subplot(1,2,1)
plt.scatter(xtrain,ytrain,marker = '.',color = 'k')
plt.grid()
plt.title('样本数据散点图')
# 生成散点图

model = LinearRegression()
model.fit(xtrain[:,np.newaxis],ytrain)
# x[:,np.newaxis] → 将数组变成(n,1)形状

print('权重为:',model.coef_)
print('偏置为:',model.intercept_)

xtest = np.linspace(0,10,1000)
ytest = model.predict(xtest[:,np.newaxis])
# 创建测试数据xtest,并根据拟合曲线求出ytest
# model.predict → 预测

ax2 = fig.add_subplot(1,2,2)
plt.scatter(xtrain,ytrain,marker = '.',color = 'k')
plt.plot(xtest,ytest,color = 'r')
plt.grid()
plt.title('线性回归拟合')
# 绘制散点图、线性回归拟合直线

# 权重为: [4.00448414]
# 偏置为: 8.447659499431026

insert image description here

# 简单线性回归(一元线性回归)
# (2)误差

rng = np.random.RandomState(8)
xtrain = 10 * rng.rand(15)
ytrain = 8 + 4 * xtrain + rng.rand(15) * 30
model.fit(xtrain[:,np.newaxis],ytrain)
xtest = np.linspace(0,10,1000)
ytest = model.predict(xtest[:,np.newaxis])
# 创建样本数据并进行拟合

plt.plot(xtest,ytest,color = 'r',linestyle = '--')  # 拟合直线
plt.scatter(xtrain,ytrain,marker = '.',color = 'k')  # 样本数据散点图
ytest2 = model.predict(xtrain[:,np.newaxis])  # 样本数据x在拟合直线上的y值
plt.scatter(xtrain,ytest2,marker = 'x',color = 'g')   # ytest2散点图
plt.plot([xtrain,xtrain],[ytrain,ytest2],color = 'gray')  # 误差线
plt.grid()
plt.title('误差')
# 绘制图表

insert image description here

polynomial regression

numpy.polyfit implements one-variable polynomials

numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)

[polyfit] polynomial curve fitting
[polyval] polynomial curve evaluation
[poly1d] to obtain polynomial coefficients, arranged from high to low according to order

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

x = np.random.uniform(-3, 3, size=100) #从一个均匀分布[low,high)中随机采样,注意定义域是左闭右开
X = x.reshape(-1, 1) # n行1列
y = 0.5 + x**2 + x + 2 + np.random.normal(0, 1, size=100)  #一元二次方程并添加噪音

coef2 = np.polyfit(x,y, 2) # n表示阶数
poly_fit2 = np.poly1d(coef2)

plt.scatter(x, y)
plt.plot(np.sort(x),poly_fit2(x)[np.argsort(x)], color='r',label="二阶拟合")

print(poly_fit2)

insert image description here

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['KaiTi']
plt.rcParams['axes.unicode_minus']=False
 
x = np.array([-4,-3,-2,-1,0,1,2,3,4,5,6,7,8,9,10])
y = np.array(2*(x**4) + x**2 + 9*x + 2) #假设因变量y刚好符合该公式
#y = np.array([300,500,0,-10,0,20,200,300,1000,800,4000,5000,10000,9000,22000])
 
# coef 为系数,poly_fit 拟合函数
coef1 = np.polyfit(x,y, 1)
poly_fit1 = np.poly1d(coef1)
plt.plot(x, poly_fit1(x), 'g',label="一阶拟合")
print(poly_fit1)
 
coef2 = np.polyfit(x,y, 2)
poly_fit2 = np.poly1d(coef2)
plt.plot(x, poly_fit2(x), 'b',label="二阶拟合")
print(poly_fit2)
 
coef3 = np.polyfit(x,y, 3)
poly_fit3 = np.poly1d(coef3)
plt.plot(x, poly_fit3(x), 'y',label="三阶拟合")
print(poly_fit3)
 
coef4 = np.polyfit(x,y, 4)
poly_fit4 = np.poly1d(coef4)
plt.plot(x, poly_fit4(x), 'k',label="四阶拟合")
print(poly_fit4)
 
coef5 = np.polyfit(x,y, 5)
poly_fit5 = np.poly1d(coef5)
plt.plot(x, poly_fit5(x), 'r:',label="五阶拟合")
print(poly_fit5)
 
plt.scatter(x, y, color='black')
plt.legend(loc=2)
plt.show()

insert image description here

Polynomial regression in sklearn

Polynomial regression can be seen as preprocessing the data and adding new features to the data , so the library called is in preprocessing:

class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False, include_bias=True, order='C')

Use the class sklearn.preprocessing.PolynomialFeatures to construct features. The construction method is to multiply features by features (self and yourself, yourself and others). This method is called using polynomials.
For example: there are two features a and b, then the degree of its 2-degree polynomial is [ 1 , a , b , a 2 , ab , b 2 ] [1,a,b,a^2,ab,b^2 ][1,a,b,a2,ab,b2 ].
PolynomialFeatures This class has 3 parameters:
degree: controls the degree of polynomial;a 2 a^2
in the combined featuresa2 andb 2 b^2b2 ;
include_bias: Default is True. If it is True, then there will be 0 power terms in the result, that is, a column with all 1's.

Univariate polynomial regression in sklearn

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(1)
x = np.random.uniform(-3, 3, size=100) #从一个均匀分布[low,high)中随机采样,注意定义域是左闭右开
X = x.reshape(-1, 1) # n行1列
y = 0.5 + x**2 + x + 2 + np.random.normal(0, 1, size=100)
# 这个degree表示我们使用多少次幂的多项式
poly = PolynomialFeatures(degree=2)    
poly.fit(X)
X2 = poly.transform(X)
# X2.shape
# 输出:(100, 3)

X2[:5,:]
#array([[ 1.        , -0.49786797,  0.24787252],
#       [ 1.        ,  1.32194696,  1.74754377],
#       [ 1.        , -2.99931375,  8.99588298],
#       [ 1.        , -1.18600456,  1.40660683],
#       [ 1.        , -2.11946466,  4.49213042]])

The result of X2 is the first column of constant items, which can be regarded as adding a column of x to the power of 0; the second column of linear item coefficients (the original sample X features), and the third column of quadratic item coefficients (the features before X squared) .

After the features are ready, train:

reg = LinearRegression()
reg.fit(X2, y)
y_predict = reg.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

print(reg.coef_)
print(reg.intercept_)

insert image description here

Multivariate polynomial regression in sklearn

import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.arange(1, 11).reshape(5, 2)    # 5行2列 10个元素的矩阵
poly = PolynomialFeatures()
poly.fit(X)
# 将X转换成最多包含X二次幂的数据集
X2 = poly.transform(X)
# 5行6列

print(X2.shape)
print(X2)

>>>
(5, 6)
[[  1.   1.   2.   1.   2.   4.]
 [  1.   3.   4.   9.  12.  16.]
 [  1.   5.   6.  25.  30.  36.]
 [  1.   7.   8.  49.  56.  64.]
 [  1.   9.  10.  81.  90. 100.]]

It can be seen that when the data dimension is 2-dimensional, 6-dimensional data is generated after polynomial preprocessing.

The first column is obviously the 0-degree term coefficient; the second and third columns are the original X matrix; the fourth column is the result of the square of the second column (the first column of the original X); the fifth column is the second, The result of multiplying three and two columns; the sixth column is the result of the square of the third column (the second column of the original X).

From this, you can guess what would happen if the data is 3-dimensional:

poly = PolynomialFeatures(degree=3)
poly.fit(X)
x3 = poly.transform(X)
x3.shape  #(5, 10)
x3 

#输出
array([[   1.,    1.,    2.,    1.,    2.,    4.,    1.,    2.,    4.,    8.],
       [   1.,    3.,    4.,    9.,   12.,   16.,   27.,   36.,   48.,    64.],
       [   1.,    5.,    6.,   25.,   30.,   36.,  125.,  150.,  180.,    216.],
       [   1.,    7.,    8.,   49.,   56.,   64.,  343.,  392.,  448.,    512.],
       [   1.,    9.,   10.,   81.,   90.,  100.,  729.,  810.,  900.,    1000.]])

PolynomiaFeatures increases all possible combinations exponentially by increasing the dimension. This will also cause certain problems.

Pipeline

In specific programming practice, you can use the pipeline in sklearn to integrate operations.

First we review the process of polynomial regression:

  • PolynomialFeaturesGenerate the corresponding polynomial features from the original data through
  • Polynomial data may also need to undergo feature normalization.
  • Feed the data to linear regression

Pipeline puts these steps together. The parameters are passed in a list, and each element in the list is a step in the pipeline. Each element is a tuple, the first element of the tuple is the name (string), and the second element is the instantiation.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

np.random.seed(1)
x = np.random.uniform(-3, 3, size=100) 
X = x.reshape(-1, 1) # n行1列
y = 0.5 + x**2 + x + 2 + np.random.normal(0, 1, size=100)

poly_reg = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('std_scale', StandardScaler()),
    ('lin_reg', LinearRegression())
])  
poly_reg.fit(X, y)
y_predict = poly_reg.predict(X)

plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

insert image description here
In fact, there is nothing new about polynomial regression in the algorithm. It is completely based on the idea of ​​linear regression. The key is to add new features to the data, and these new features are polynomial combinations of the original features . This way can be solved. Nonlinear problems.

This idea is exactly the opposite of the dimensionality reduction idea of ​​PCA. Polynomial regression increases the dimension. After adding new features, it can better fit high-dimensional data.
Underfitting

Bias and Variance

Bias and variance are defined as follows:

  • Bias : Bias measures the deviation between the predicted value of the model and the actual value . For example, if the accuracy of a model is 96%, it means low bias; conversely, if the accuracy is only 70%, it means high bias.
  • Variance (variance) : Variance describes the fluctuation of the predicted value of the training data in the training model of different iteration stages (or discrete situation) . From a mathematical point of view, it can be understood as the average of the sum of the squares of the differences between each predicted value and the predicted mean. Usually, in model training, the complexity of the model is not high at the initial stage, and the variance is low; as the amount of training increases, the model gradually fits the training data, and the complexity begins to increase, and the variance will gradually increase at this time.

Corresponding to four situations:

  • Low deviation, low variance : This is an ideal model for training. At this time, the blue point set basically falls within the bullseye range, and the data dispersion is small, basically within the bullseye range;
  • Low bias, high variance : This is the biggest problem faced by deep learning, overfitting. That is to say, the model fits the training data too well, resulting in poor generalization (or universal) ability. If it encounters the test set, the accuracy will drop drastically;
  • High bias, low variance : This is often the initial stage of training;
  • High Bias, High Variance : This is the worst case for training, with poor accuracy and poor dispersion of the data.

Model error:

Model error = bias + variance + unavoidable error (noise). Generally speaking, as the complexity of the model increases, the variance will gradually increase and the bias will gradually decrease.

Reasons for bias variance:

If a model is biased, the main reason may be that the assumptions about the problem itself are incorrect, or it is underfitted. For example: using linear regression for non-linear problems; or using features that have nothing to do with the problem, such as using students' names to predict test scores, which will lead to high deviations.

Variance shows that a small perturbation in the data will greatly affect the model. That is, the model does not fully learn the essence of the problem, but learns a lot of noise. Usually the reason may be that the model used is too complex, such as using high-order polynomial regression, which is overfitting.

Some algorithms are inherently high variance algorithms, such as the kNN algorithm. Non-parametric learning algorithms are generally high variance because no assumptions are made about the data.

There are some algorithms that are inherently high in bias, such as linear regression. Parameter learning algorithms are often highly biased algorithms because of the bias in the data.

Balance of bias and variance:

Bias and variance are often contradictory. Reducing bias will increase variance; reducing variance will increase bias.

This requires maintaining a balance between bias and variance.

Taking the polynomial regression model as an example, we can choose different polynomial degrees to observe the impact of the polynomial degree on the model bias & variance:
insert image description here

We need to know that bias and variance cannot be completely avoided, we can only minimize their impact.

  1. When avoiding deviations, we need to try our best to choose the correct model. For a nonlinear problem, we have always used linear models to solve it. In any case, high deviations are unavoidable.
  2. With the correct model, we need to carefully choose the size of the data set. Usually, the larger the data set, the better, but when the data set is so large that it has a certain representativeness of all the data in the whole, no amount of data can improve the model. , but it will increase the amount of calculation. However, too small training data must be bad, which will lead to overfitting, high model complexity, large variance, and large changes in models trained with different data sets.
  3. Finally, choose the appropriate model complexity. A model with high complexity usually has a good ability to fit the training data.

In fact, in the field of machine learning, the main challenge comes from variance. Ways to deal with high variance include:

  • Reduce model complexity
  • Reduce data dimensions; reduce noise
  • Increase the number of samples
  • Use the validation set

Underfitting and Overfitting

欠拟合
    在训练集上表现不好,在测试集上表现不好
    解决方法:
        继续学习
        1.添加其他特征项
        2.添加多项式特征
过拟合
    在训练集上表现好,在测试集上表现不好
    解决方法:
        1.重新清洗数据集
        2.增大数据的训练量
        3.正则化
        4.减少特征维度
正则化
    通过限制高次项的系数进行防止过拟合
        L1正则化
            理解:直接把高次项前面的系数变为0
            Lasso回归
        L2正则化
            理解:把高次项前面的系数变成特别小的值
            岭回归

regularized linear model

During learning, some of the features provided by the data affect the model complexity or there are too many data points for this feature. Therefore, the algorithm tries to reduce the impact of this feature (or even delete the impact of a certain feature) when learning. This is regularization.

Note: During adjustment, the algorithm does not know the impact of a certain feature, but adjusts parameters to obtain optimized results.

岭回归和LASSO回归都是解决模型训练过程中的过拟合问题
1.Ridge Regression 岭回归
    就是把系数添加平方项
    然后限制系数值的大小
    α值越小,系数值越大,α越大,系数值越小
2.Lasso 回归
    对系数值进行绝对值处理
    由于绝对值在顶点处不可导,所以进行计算的过程中产生很多0,最后得到结果为:稀疏矩阵
3.Elastic Net 弹性网络
    是前两个内容的综合
    设置了一个r,如果r=0--岭回归;r=1--Lasso回归

1 Ridge Regression (also known as Tikhonov regularization)

Ridge regression is a regularized version of linear regression, that is, adding a regularization term to the cost function of the original linear regression:
insert image description here

In order to achieve the purpose of making the model weight as small as possible while fitting the data, the ridge regression cost function is:

insert image description here

  • α=0: Ridge regression degenerates into linear regression
  1. α: The new hyperparameter introduced to balance the relationship between the two parts of the new loss function ; it is the coefficient of the algebraic formula, representing that in the new loss function under model regularization, let each θ i θ_iiiare as small as possible, and this small degree accounts for how much of the entire optimization loss function;
  • If α = 0: it means that no model regularization is added to the objective function;
  • If α = +∞ : The other part of the objective function, MSE, accounts for a very small proportion of the entire objective function. The main optimization task is to make each θ i θ_iiiAll as small as possible;

2 Lasso Regression(Lasso regression)

Lasso regression is another regularized version of linear regression. The regular term is the ℓ1 norm of the weight vector.

Cost function of Lasso regression:
insert image description here
[Note]

  • The cost function of Lasso Regression is non-differentiable at θi=0.
  • Solution: Use a subgradient vector to replace the gradient at θi=0, as follows:
  • Subgradient vector of Lasso Regression


Lasso Regression has a very important property: it tends to completely eliminate unimportant weights.

For example: when the value of α is relatively large, the high-order polynomial degenerates into quadratic or even linear: the weight of the high-order polynomial feature is set to 0.

In other words, Lasso Regression can automatically perform feature selection and output a sparse model (only a few features have non-zero weights).

3 Elastic Net (elastic network)

The elastic network is a compromise between ridge regression and Lasso regression, controlled by the mix ratio r :

  • r=0: elastic network becomes ridge regression
  • r=1: The elastic network is Lasso regression

The cost function of the elastic network:
insert image description here
Generally speaking, we should avoid using naive linear regression , and should perform certain regularization processing on the model, so how to choose the regularization method?

summary:

  • Commonly used: Ridge regression

  • Assume that only a small number of features are useful:

    • elastic network
    • Lasso
    • In general, elastic networks are more widely used. Because when the feature dimension is higher than the number of training samples, or the features are strongly correlated, the performance of Lasso regression is unstable.
  • api:

    • from sklearn.linear_model import Ridge, ElasticNet, Lasso
      

Ridge Regression API

class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, normalize='deprecated', copy_X=True, max_iter=None, tol=0.001, solver='auto', positive=False, random_state=None)
  • Linear regression with l2 regularization
  • alpha: Regularization strength, also called λ, float type, default is 1.0. Regularization improves the condition of the problem and reduces the variance of estimates.
  • solver: solving method, str type, default is auto. Optional parameters are: auto, svd, cholesky, lsqr, sparse_cg, sag
    • auto automatically selects a solver based on the data type.
    • sag: If the data set and features are relatively large, choose the stochastic gradient descent optimization . Usually faster than other solvers when n_samples and n_feature are both large. Note that fast sag convergence is only guaranteed on features with approximately the same scale. You can preprocess your data using sklearn.preprocessing's scaler.
    • svd uses the singular value decomposition of X to calculate the Ridge coefficients. More stable than cholesky for singular matrices
    • cholesky uses the standard scipy.linalg.solve function to obtain closed form solutions.
    • sparse_cg uses the conjugate gradient solver found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more suitable than cholesky for large scale data (possibility of setting tol and max_iter).
    • lsqr uses the dedicated regularization least squares constant scipy.sparse.linalg.lsqr. It is the fastest but may not be available in older scipy versions. It uses an iterative process.
  • normalize: whether the data is normalized
    • normalize=False: preprocessing.StandardScaler can be called before fit to normalize data
  • Ridge.coef_: regression weights
  • Ridge.intercept_: The regression bias
    Ridge method is equivalent to SGDRegressor (penalty='l2', loss="squared_error"), except that SGDRegressor implements a common stochastic gradient descent learning. It is recommended to use Ridge (implemented SAG)
class sklearn.linear_model.RidgeCV(alphas=(0.1, 1.0, 10.0), *, fit_intercept=True, normalize='deprecated', scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False)

parameter:

  • alphas

Type: numpy array of shape [n_alphas]
Description: Array of alpha values. The strength of regularization must be a positive floating point number. Regularization improves the condition of the problem and reduces the variance of the estimator. Larger values ​​specify stronger regularization. In other models, such as LogisticRegression or LinearSVC, α corresponds to C − 1 C^{−1}C1

  • Object used for cross-validation generator.
  • A repeatable generated sequence.
  • Linear regression with l2 regularization that can be cross-validated
    insert image description here
  • The stronger the regularization, the smaller the weight coefficient will be.
  • The smaller the regularization strength, the larger the weight coefficient will be.

Simulating data using ridge regression

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
# np.random.uniform(-3, 3, size=100):在 [-3, 3] 之间等分取 100 个数;
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x + 3. + np.random.normal(0, 1, size=100)

plt.scatter(x, y)
plt.show()

insert image description here

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# 使用多项式回归的管道方法
def PolynomialRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])



np.random.seed(666)
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.metrics import mean_squared_error

poly_reg = PolynomialRegression(degree=20)
poly_reg.fit(X_train, y_train)

y_poly_predict = poly_reg.predict(X_test)
print(mean_squared_error(y_test, y_poly_predict))



X_plot = np.linspace(-3, 3, 100).reshape(100, 1)
y_plot = poly_reg.predict(X_plot)

plt.scatter(x, y)
plt.plot(X_plot[:, 0], poly_reg.predict(X_plot), color='r')
plt.axis([-3, 3, 0, 6])
plt.show()

insert image description here

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

def plot_model(model):
    X_plot = np.linspace(-3, 3, 100).reshape(100, 1)
    y_plot = model.predict(X_plot)

    plt.scatter(x, y)
    plt.plot(X_plot[:, 0], model.predict(X_plot), color='r')
    plt.axis([-3, 3, 0, 6])
    plt.show()
    

#使用管道的方式使用岭回归方法    
def RidgeRegression(degree, alpha):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('ridge_reg', Ridge(alpha=alpha))
    ])
#degree = 20、α = 0.0001

ridge1_reg = RidgeRegression(20, 0.0001)
ridge1_reg.fit(X_train, y_train)

y1_predict = ridge1_reg.predict(X_test)
print(mean_squared_error(y_test, y1_predict))
# 输出:1.323349275406402(均方误差)
plot_model(ridge1_reg)

insert image description here

# degree = 20、α = 1

ridge2_reg = RidgeRegression(20, 1)
ridge2_reg.fit(X_train, y_train)

y2_predict = ridge2_reg.predict(X_test)
print(mean_squared_error(y_test, y2_predict))
# 输出:1.1888759304218461(均方误差)
plot_model(ridge2_reg)

insert image description here

# degree = 20、α = 100
ridge3_reg = RidgeRegression(20, 100)
ridge3_reg.fit(X_train, y_train)

y3_predict = ridge3_reg.predict(X_test)
print(mean_squared_error(y_test, y3_predict))
# 输出:1.3196456113086197(均方误差)

plot_model(ridge3_reg)

insert image description here

#degree=20、alpha=1000000(相当于无穷大)

ridge4_reg = RidgeRegression(20, 1000000)
ridge4_reg.fit(X_train, y_train)

y4_predict = ridge4_reg.predict(X_test)
print(mean_squared_error(y_test, y4_predict))
# 输出:1.8404103153255003

plot_model(ridge4_reg)

insert image description here
When α = 1000000 (equivalent to infinity): the fitting curve is almost a horizontal straight line, because when α is very large, the impact on the objective function is equivalent to only the added model regularization at work

Lasso return API

class sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, normalize='deprecated', precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')

Simulated data using LASSO regression

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x + 3. + np.random.normal(0, 1, size=100)

from sklearn.model_selection import train_test_split

np.random.seed(666)
X_train, X_test, y_train, y_test = train_test_split(X, y)

#使用多项式回归拟合数据
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

def PolynomialRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])

#封装绘制代码
def plot_model(model):
    X_plot = np.linspace(-3, 3, 100).reshape(100, 1)
    y_plot = model.predict(X_plot)

    plt.scatter(x, y)
    plt.plot(X_plot[:, 0], model.predict(X_plot), color='r')
    plt.axis([-3, 3, 0, 6])
    plt.show()
#多项式回归并绘图
from sklearn.metrics import mean_squared_error

poly_reg = PolynomialRegression(degree=20)
poly_reg.fit(X_train, y_train)

y_poly_predict = poly_reg.predict(X_test)
print(mean_squared_error(y_test, y_poly_predict))
# 输出:167.9401085999025(均方误差)

plot_model(poly_reg)

insert image description here

#使用 LASSO Regression 改进算法模型

from sklearn.linear_model import Lasso

# 以管道的方式,使用 LASSO 回归的方法改进多项式回归的算法
def LassoRegression(degree, alpha):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('lasso_reg', Lasso(alpha=alpha))
    ])
# degree = 20、α = 0.01

lasso1_reg = LassoRegression(20, 0.01)
lasso1_reg.fit(X_train, y_train)

y1_predict = lasso1_reg.predict(X_test)
print(mean_squared_error(y_test, y1_predict))
# 输出:1.149608084325997(均方误差)

plot_model(lasso1_reg)

insert image description here

#degree = 20、α = 0.1

lasso2_reg = LassoRegression(20, 0.1)
lasso2_reg.fit(X_train, y_train)

y2_predict = lasso2_reg.predict(X_test)
print(mean_squared_error(y_test, y2_predict))
# 输出:1.1213911351818648(均方误差)

plot_model(lasso2_reg)

insert image description here

#degree = 20、α = 1
lasso3_reg = LassoRegression(20, 1)
lasso3_reg.fit(X_train, y_train)

y3_predict = lasso3_reg.predict(X_test)
print(mean_squared_error(y_test, y3_predict))
# 输出:1.8408939659515595(均方误差)

plot_model(lasso3_reg)

analyze

  • α = 0.01, which is much larger than the first α value in Ridge Regression, because for RidgeRegression(), the regularization term is θ^2, and the squared result will be larger, so the α value needs to be is very small to adjust the size of the regularization term; for LassoRegression(), the regularization term is |θ|, which is much smaller than the regularization term in ridge regression, so the value of α in LassoRegression() can be relatively larger;

  • In the specific process of implementing machine learning algorithms, you need to constantly experiment and look at the results, and slowly form experience. When adjusting parameters using various methods, you can roughly know in which range the parameters will be selected for different parameters. The corresponding one is better;

  • When α = 1, the degree of regularization corresponding to LassoRegression() is relatively high;
    the degree of regularization : the up and down jitter amplitude of the fitting curve;

  • In the process of real machine learning, it is to choose a situation with the best degree between no model regularization at all and excessive model regularization;

Compare Ridge Regression and LASSO Regression

1) Using Ridge's improved polynomial regression algorithm, as α changes, the fitting curve is always a curve until it finally becomes an almost horizontal straight line; that is, using Ridge's modified polynomial regression algorithm, the model variables obtained There are still coefficients before, so it is difficult to get a slant straight line;

2) Using Lasso's improved polynomial regression algorithm, as α changes, the fitting curve will quickly turn into an oblique straight line, and finally slowly turn into an almost horizontal straight line; the model is more inclined to a straight line.

insert image description here

  • Characteristics of LASSO : It tends to make part of the θ values ​​become 0, which means that LASSO considers the features corresponding to θ = 0 to be completely useless, while the remaining features corresponding to θ not 0 are useful, so LASSO Can be used for feature selection ;

  • When used for feature selection, LASSO may also change the coefficient θ of some useful features to 0, which will lead to inaccurate information; in comparison, Ridge is more accurate.

  1. But if the sample features are very large, such as degree = 100 when using polynomial regression, in this case, using LASSO can make the sample features smaller.

Guess you like

Origin blog.csdn.net/qq_45694768/article/details/121040770