Introductory Research on Machine Learning (12)-Linear Regression

table of Contents

return

Linear regression

definition

Generalized linear model

Linear relationship of linear model

Non-linear relationship of linear model

Solving the linear regression model

Loss function

Normal equation

PS several mathematical knowledge:

Solving process

Gradient descent

PS little knowledge of mathematics

Solving process

Distinguish the two concepts

Gradient descent classification

Corresponding API in sklearn

Linear regression example

1. Normal equation

Regression performance evaluation

API in sklearn

The calculation corresponding to the above example

to sum up


return

Regression: If the target value is continuous data, it is a regression problem. The usual application scenarios are as follows:

House price forecast

Sales forecast

Loan line forecast

Linear regression

definition

Linear Regression is an analysis method that uses regression equations (functions) to model the relationship between one or more independent variables (eigenvalues) and dependent variables (target values).

The short answer is to find a specific functional relationship to represent the relationship between the characteristic value and the target value. This functional relationship is a linear model. The general formula is:

among them

If there is only one feature, it is called univariate regression; if there are multiple feature values, it is called multivariate regression. For example: Suppose the influencing factors of house price can be expressed by the following functional relationship:

House price = 0.02x the distance of the central area + 0.04x the concentration of nitric oxide in the city + (-0.12x the average price of self-service houses) + 0.254x the city crime rate

Then we can call this functional relationship a linear model.

Generalized linear model

There is also a generalized linear model, although it is a non-linear relationship, it is also called a linear model.

There are two main types of models in linear regression: one is a linear relationship; the other is a non-linear relationship.

Linear relationship of linear model

When we show the house area and house price, we can fit this line through a linear model. At this time, the line is a linear model, and the relationship is also a linear relationship.

When we have two eigenvalues, we can fit the relationship between the eigenvalues ​​and the target value through a plane, the linear model at this time is still a linear relationship

 

Non-linear relationship of linear model

Like the figure below, the eigenvalue x and the target value y are not simply fitted with a straight line, but curved. Obviously x and y are no longer in a linear relationship, but the model is still a linear model.

Then we can understand the linear model as long as it meets any one of the following conditions, it belongs to the linear model:

(1) The independent variables x1, x2...

It can be seen that the eigenvalue and the target value are in a linear relationship

(2) Parameter once w1, w2...

At this time, the eigenvalue and the target value are not linear, but the parameter is still one time, and the model is still a linear model at this time.

In summary, the linear relationship must be the independent variable once, and the linear relationship must be a linear model, but the linear model is not necessarily a linear relationship.

Solving the linear regression model

In linear regression, it is assumed that there is a linear relationship between the eigenvalue and the target value, so what if this linear model is obtained?

Goal: Only the parameters in the linear model are required, that is, the linear model can be solved, and the linear model can be used for prediction. Then our goal is to solve the weights and biases, that is, regression coefficients or regression parameters.

Solution ideas:

Assume that the price of a house and the factors that affect the price of the house really have the following relationship:

The real house price = 0.02x the distance of the central area + 0.04x the city carbon monoxide + (-0.12) x the average house price of the independent house

We can assume the parameters of all factors at will, and set the coefficients of each factor as follows:

Predicted house price = 0.25x the distance from the central area + 0.14x city carbon monoxide + 0.42x the average house price of independent houses.

We can see that there is an error between the parameters corresponding to these factors and the true value. If we use a method to keep the error shrinking so that the error is close to 0, then the parameters of this model are the parameters we seek.

To measure the error between the true value and the predicted value is the loss function/cost/cost function/objective function. The sum of the squares of the distance difference between all true values ​​and predicted values ​​of this loss function. When solving this linear model, it is to calculate the minimum value of the loss function and find the time when the error is the smallest.

Loss function

Among them, y1, y2... are the true values ​​of the sample, and h(x) is the predicted value obtained by the prediction function.

With the loss function, when we solve this linear model, we reduce the loss, and then solve the minimum value of the loss function, and then the minimum value w is the regression coefficient of the linear model we require to solve.

There are two optimization methods when solving the minimum value of this loss function:

  • (1) Normal equation
  • (2) Gradient descent

The following two methods are introduced separately:

Normal equation

PS several mathematical knowledge:

(1) Minimum

When we solve the minimum value of a function, we can find the minimum value of the corresponding function by first solving the derivative of the function, and then when the derivative is 0, the value obtained is the minimum value of the corresponding function. Suppose we have an equation and then solve the minimum value:

  • Derivation of the equation

  • Let the derivative be 0, get the when, and substitute it into the original equation to find the minimum value of y.

(2) Matrix inversion

If two matrices AxB=E, where E is the identity matrix, then

Solving process

With the above mathematical knowledge, let's look at the so-called normal equation, also known as the least square method, which is to solve the minimum value of the loss function by solving the equation, that is, to derive the loss function, and then get the minimum value. . The formula above can be transformed into the form of a matrix:

Then take the derivative of the above formula, and then set the derivative to 0:

The process of solving for w is as follows:

Finally:

Where X is the matrix of eigenvalues ​​and y is the matrix of target values, then we can solve for the value of w by substituting the eigenvalues ​​and target values ​​of the sample into the above formula.

Advantages: Solve the best results directly through the characteristic equation

Disadvantages: When the features are too complicated, the solution speed is too slow and the final result may not be obtained

Usage scenario: When the amount of data is relatively small

Gradient descent

PS little knowledge of mathematics

Gradient: It is a vector that indicates that the directional derivative of a function at this point takes the maximum along this direction, that is, the function changes fastest along this direction (the direction of the gradient) at this point, and the rate of change is the largest (the modulus of the gradient). ).

In a univariate function, the gradient is the derivative of the function, representing the slope of the tangent of the function at a given point;

In a multivariate function, the gradient is a vector. The vector has a direction, and the direction of the gradient points out the direction in which the function rises fastest at a given point; and gradient descent is the direction in which the function drops the fastest at a given point, along the opposite direction of the gradient to the local lowest point.

Solving process

Gradient Descent. It is the process in which all the partial derivatives in the gradient drop to the lowest point. In a simple way: at the beginning, a set of w and b is randomly given, and then trial and error are continued, and continuous improvement, and finally the minimum value of J( θ ) is obtained . That is, it drops a little bit along the tangent line.

Since the selection of the initial point is random, it is very likely that if the image of the function is a wave shape, it is very likely that the lowest point obtained is not the minimum value of all the lowest points, but the problem can be solved by random multiple times.

The loss function we just mentioned can be represented by the following parabola:

There are also various versions of the corresponding gradient descent formula. I think that with my understanding of the following formula, I can better understand the concept of gradient descent:

(1) The formula for solving w and b corresponding to gradient descent. Initially, give a non-zero w and b randomly and substitute it into the following formula

Among them \alphais the learning rate, that is, the step length of each decline along the slope, which needs to be manually set when calling the API, which is a hyperparameter. The value cannot be too large or too small, you can try some values ​​such as 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001... If it is larger, then adjust it to a smaller value, and if it is smaller, adjust it to a larger point;

Is the direction in which the slope descends

2) The solved w and b are again substituted into the formula for solving. Until the iteration reaches a certain number, or the error between the previous value of w and b and the result of this time is less than a certain number, it is considered to be the optimal solution.

Advantages: In the case of a relatively large data set, the computational complexity can be reduced

Disadvantages: if it is a non-convex function, it is likely to fall into a local minimum, and the obtained solution is not the global optimal solution

If the step size is too small, the function will converge slowly, and it will be easy to find the optimal solution when it becomes larger.

Applicable scenarios: When the training data set is very large, better results can be found.

Distinguish the two concepts

Convex function: Pick a point at random, gradient descent can go to the lowest point

Non-convex function: there are many local minima, gradient descent may fall into the local minima

Gradient descent classification

  • Batch gradient descent (BGD)

Batch Gradient Descent

For the entire data set, an average value is calculated for all samples as the step size of each gradient descent. This can improve accuracy, but it is necessary to calculate the gradient of all data in one dimension, and the computational complexity becomes larger.

Advantages: global optimal solution, conducive to parallelism

Disadvantages: When the amount of sample data is too large, the calculation speed is slow

  • Small batch gradient descent (MBGD)

mini-batch Gradient Descent

Divide the data into several batches and update the parameters in batches. A set of data jointly determines the step size of each gradient descent.

Advantages: reduce computational overhead and reduce randomness

  • Stochastic gradient descent (SGD)

Each time only one piece of data in one dimension is randomly selected to find the gradient, and this value is used as the step size of this gradient dimension drop

Advantages: fast calculation speed

Disadvantages: poor convergence performance

          Need many hyperparameters, such as regular parameters, number of iterations

         Sensitive to feature standardization

  • GD

Gradient Descent, the original gradient descent algorithm. You need to calculate all the sample values ​​to be able to calculate the gradient

  • SAG

Stochastic average gradient Stochasitc Average Gradient, convergence speed is too slow, ridge regression and logistic regression have SAG optimization

  • the difference

Batch gradient descent (BGD): Integrate all the data to find the gradient, the descent process has been very smooth

Stochastic gradient descent (SGD): randomly select a piece of data as a parameter, the step size is unstable

Stochastic gradient descent will have a larger error than batch gradient descent, but as the number of iterations increases, the error will become smaller and smaller. Stochastic gradient descent can be regarded as a special case of mini-batch gradient descent.

Corresponding API in sklearn

We have already mentioned that the minimum value of the loss function can be solved through normal equations and gradient descent, and the corresponding API is also provided in sklearn.

  • (1) Normal equation
 sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True,
                 n_jobs=None)

The parameters are as follows:

parameter meaning
fit_intercept Whether to calculate the offset, the default is true
 normalize Whether to normalize, the default is False
copy_X

Whether to copy X, the default is True: X will be copied

False: X will be copied

n_jobs The number of tasks that can be parallelized, the default is 1

Among the returned parameters:

parameter meaning
coef_ Regression coefficient, which is w in the linear model
intercept_ Bias, which is b in the linear model
  • (2) Stochastic gradient descent
 sklearn.linear_model.SGDRegressor(loss="squared_loss", penalty="l2", alpha=0.0001,
                 l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=1e-3,
                 shuffle=True, verbose=0, epsilon=DEFAULT_EPSILON,
                 random_state=None, learning_rate="invscaling", eta0=0.01,
                 power_t=0.25, early_stopping=False, validation_fraction=0.1,
                 n_iter_no_change=5, warm_start=False, average=False)

Support different loss functions and regularization penalty terms to fit linear regression models. 

The parameters are as follows:

parameter meaning
loss

Type of loss. The default is "squared_loss": ordinary least squares method

'huber': Improved ordinary least squares method to correct outliers

'epsilon_insensitive': ignore errors smaller than eslion

'squared_epsilon_insensitive'

penalty

Penalties. The default is l2.

none:

l1:

elasticnet:

alpha Used to control the degree of regularization of the model, the default is 0.0001
l1_ratio Use this parameter to adjust the mixing ratio of ridge regression and lanyard regression
fit_intercept Whether to calculate the offset. The default is True
max_iter Number of iterations
tolls        Threshold for judging whether iterative convergence
shuffle  
verbose  
epsilon
               
 
 random_state Random seed
learning_rate The filling of the learning rate. The default is invscaling. That is eta=eta0/pow(t,power_t)
and0 Default 0.01
power_t The default value is 0.25
early_stopping   
validation_fraction  
 n_iter_no_change  
warm_start  
average  

其返回的参数中:

参数 含义
coef_ 回归系数,即线性模型中的w
intercept_ 偏置,即线性模型中的b
n_iter_ 实际迭代的次数

线性回归实例

直接使用sklearn中自带的预测波士顿房价的数据集来进行看下上面的两个API的使用。

波士顿房价的数据集的内容如下:

共有13个特征值,506个样本集,其中特征值描述依次为:城镇人均犯罪率、占地面积超过2.5万平方英尺的住宅用地比例、城镇非零售业务地区的比例、查尔斯河虚拟变量 (1 如果土地在河边;否则是0)、一氧化氮浓度(每1000万份)、平均每居民房数、在1940年之前建成的所有者占用单位的比例、与五个波士顿就业中心的加权距离、辐射状公路的可达性指数、每10,000美元的全额物业税率、城镇师生比例、1000(Bk - 0.63)^ 2其中Bk是城镇黑人的比例、人口中地位较低人群的百分数

 

1.正规方程

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import  mean_squared_error

def linear():
    #1)获取数据集
    boston = load_boston()
    #2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    #3)进行标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    #4)预估器
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)
    #5)进行预测
    y_predict = estimator.predict(x_test)
    print("线性模型的参数为 w :", estimator.coef_)
    print("线性模型的参数为 b:", estimator.intercept_)

    return

我们看下运行结果如下:

线性模型的参数为 w : [-0.64817766  1.14673408 -0.05949444  0.74216553 -1.95515269  2.70902585
 -0.07737374 -3.29889391  2.50267196 -1.85679269 -1.75044624  0.87341624
 -3.91336869]
线性模型的参数为 b: 22.62137203166228

2.梯度下降

def sgd():
    #1)获取数据集
    boston = load_boston()
    #2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    #3)进行标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    #4)预估器
    estimator = SGDRegressor(fit_intercept=True, eta0=0.001 )
    estimator.fit(x_train, y_train)
    #5)进行预测
    y_predict = estimator.predict(x_test)
    print("线性模型的参数为 w :", estimator.coef_)
    print("线性模型的参数为 b:", estimator.intercept_)

    return

运行结果如下:

线性模型的参数为 w : [-0.40728413  0.70209168 -0.5460648   0.81482482 -1.29062491  3.01039296
 -0.22184241 -2.66819966  1.19623538 -0.56932281 -1.64938944  0.90486251
 -3.79375431]
线性模型的参数为 b: [22.60833855]

回归性能评估

均方误差(Mean Squared Error)MSE用来评估线性回归的效果,其公式如下:

其中yi为预测值,y为真实值。从定义中可以看出和损失函数的差别在于该均方误差除以m。

均方误差越小,则效果越好。

sklearn中的API

 mean_squared_error(y_true, y_pred,
                       sample_weight=None,
                       multioutput='uniform_average'):

其中参数含义如下:

参数 含义

y_true

真实值
y_pred 预测值
sample_weight 样本权重
multioutput

uniform_average:计算所有元素的均方误差,返回的是一个数字

raw_values:返回对应列的均方误差,返回的一个列数相同的数组

对应上述实例的计算

    #)模型评估
    error = mean_squared_error(y_test, y_predict)

对比两个运行结果的均方误差值为:

正规方程的误差值: 20.627513763095408
梯度下降的误差值: 21.670171812236077

在梯度下降使用默认值的情况下,显然正规方程的均方误差要小,但是我们可以通过条件eta0学习率、max_iter迭代次数、learning_rate学习率的算法来减少梯度下降的误差。

总结

梯度下降 正规方程
需要选择学习率 不需要学习率,直接求解
需要迭代求解 一次即可以得出
特征量较大可以使用,适用于大规模数据 Need to calculate the equation, so it is suitable for small-scale data, and ridge regression is also suitable for small-scale data
  Can't solve the fitting problem, use less

 

 

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/103476956