Linear regression linear relationship, nonlinear relationship, common function derivative, loss function and optimization algorithm, normal equation and univariate function gradient descent, multivariate function gradient descent

1. Overview of Linear Regression

Linear regression (Linear regression): It is an analysis method that uses a regression equation (function) to model the relationshipbetween one or more independent variables (eigenvalues) and dependent variables (target values)

Features: The case of only one independent variable is called univariate regression, and the case of more than one independent variable is called multiple regression

 A relationship is established between the eigenvalue and the target value, which can be understood as a linear model

2. Analysis of the relationship between the characteristics of linear regression and the target

There are two main models in linear regression, one is a linear relationship and the other is a nonlinear relationship. Here we can only draw a plane to understand better, so we use a single feature or two features as an example

linear relationship

The univariate linear relationship and multivariate linear relationship are shown in the figure below

The relationship between a single feature and the target value is linear, or the relationship between two features and the target value is flat

The nonlinear relationship regression equation can be understood as:y = w_{1}x_{1} + w_{2}x_{2}^{2} + w_{3}x_{3}^{2}, as follows

3. Linear regression API

  • sklearn.linear_model.LinearRegression()
    • LinearRegression.coef_: regression coefficient

The simple use code is as follows

from sklearn.linear_model import LinearRegression

x = [[80, 86], [82, 80], [85, 78], [90, 90], [86, 82], [82, 90], [78, 80], [92, 94]]   # [平时成绩,考试成绩]
y = [84.2, 80.6, 80.1, 90, 83.2, 87.6, 79.4, 93.4]   # 总成绩
estimator = LinearRegression()   # 实例化API
estimator.fit(x, y)   # 使用fit方法进行训练
print('系数是:', estimator.coef_)   # 平时成绩与考试成绩系数,即所占比例
print('预测值是:', estimator.predict([[90, 85]]))   # 预测总成绩
------------------------------------------------------------------
输出:
系数是: [0.3 0.7]
预测值是: [86.5]

4. Derivatives of Common Functions

Four Arithmetic Operations of Derivatives

5. Loss function and optimization algorithm

5.1 Loss function

The total loss is defined as

 That is, the sum of the squares of the differences between each predicted value and the actual value, where

  • y_{i}is the true value of the i-th training sample
  • h(x_{i})Combine the prediction function for the i-th training sample feature value, that is, the predicted value
  • The loss function is also called the least squares method

5.2 Optimization algorithm (optimization reduces total loss)

The purpose of optimization : find the W value corresponding to the minimum loss

There are two optimization algorithms commonly used in linear regression

1. Normal equation

Use the inverse and transpose of the matrix to solve the problem, X is the eigenvalue matrix, and y is the target value matrix. After bringing X and y into it, the best result can be obtained directly

Disadvantages: only suitable for samples and fewer features, when the features are too many and complex, the solution speed is too slow and no results can be obtained

The derivation process of the normal equation

Convert this loss function into a matrix notation

\begin{aligned} J(\theta) & =\left(h_{w}\left(x_{1}\right)-y_{1}\right)^{2}+\left(h_{w}\left(x_{2}\right)-y_{2}\right)^{2}+\cdots+\left(h_{w}\left(x_{m}\right)-y_{m}\right)^{2} \\ & =\sum_{i=1}^{m}\left(h_{w}\left(x_{i}\right)-y_{i}\right)^{2} \\ & =(y-X w)^{2} \end{aligned}

where y is the true value matrix, X is the eigenvalue matrix, and w is the weight matrix

Solve for the minimum value of w, starting and ending y, X are known quadratic function direct derivation, the position where the derivative is zero is the minimum value

The derivation is as follows

\begin{aligned} 2(\mathrm{Xw}-\mathrm{y}) * \mathrm{X} & =0 \\ 2(\mathrm{Xw}-\mathrm{y}) *\left(X \mathrm{X}^{\mathrm{T}}\right) & =0 \mathrm{X}^{T} \\ 2(\mathrm{Xw}-\mathrm{y} ) *\left(X \mathrm{X}^{\mathrm{T}}\right)\left(X \mathrm{X}^{\mathrm{T}}\right)^{-1} & =0 \mathrm{X}^{T}\left(\mathrm{XX}^{\mathrm{T}}\right)^{-1} \\ 2(\mathrm{Xw}-\mathrm{y}) & =0 \\ \mathrm{Xw} & =\mathrm{y} \\ \mathrm{X}^{T} X w & =X^{T}y \\\left(\mathrm{X}^{T}X\right)^{-1}\left(\mathrm{X}^{T}X\right) * w & =\left(\mathrm{X}^{T}X\right)^{-1} * X^{T} y\\w&=\left(\mathrm{X}^{T}X\right)^{-1}*X^{T}y\end{aligned}

In the above derivation process, X is a matrix with m rows and n columns, and it cannot be guaranteed to have an inverse matrix. It can be transformed into a square matrix by right multiplication or left multiplication to X^{T}ensure that it has an inverse matrix

2. Gradient descent

Take the following mountain as an example. Based on his current position, find the steepest place in this position, and then walk towards the place where the height of the mountain drops. Every time you walk a certain distance, you will use the same method repeatedly, and finally you will be able to successfully reach the valley.

Gradient descent: A differentiable function represents a mountain. The minimum value of the function is the bottom of the mountain. The fastest way to go down the mountain is to find the steepest direction of the current position, and then go down this direction, which corresponds to the function. It is to find the gradient of a given point, and then move in the opposite direction of the gradient, so that the function value can drop the fastest, because the direction of the gradient is the direction of the fastest change of the function. Repeat this method, repeatedly obtain the gradient, and finally reach the local  minimum

  • In a univariate function: the gradient is actually the differential of the function, which represents the slope of the tangent of the function at a given point
  • ​In a multivariate the gradient is a vector, the vector has a direction, and the direction of the gradient indicates the fastest rising direction of the function at a given point
  • The direction of the gradient is the direction in which the function rises fastest at a given point , then the opposite direction of the gradient is the direction in which the function drops fastest at a given point

Univariate Function Gradient Descent

Suppose there is a univariate function: J(θ) = \theta^{2}

  • Differentiation of a function: J'(θ) = 2θ
  • Initialization, starting at: \theta ^{0} = 1
  • Learning rate: α = 0.4

The iterative calculation process of gradient descent is as follows

\begin{aligned} \theta^{0} & =1 \\ \theta^{1} & =\theta^{0}-\alpha * J^{\prime}\left(\theta^{0}\right) \\ & =1-0.4 * 2 \\ & =0.2 \\ \theta^{2} & =\theta^{1}-\alpha * J^{\prime}\left(\theta^{1}\right) \\ & =0.04 \\ \theta^{3} & =0.008 \\ \theta^{4} & =0.0016 \end{aligned}

Gradient Descent for Multivariate Functions

Suppose there is an objective function:\mathrm{J}(\theta)=\theta_{1}^{2}+\theta_{2}^{2}

Now we need to calculate the minimum value of this function through the gradient descent method. Through observation, we can find that the minimum value is actually the point (0, 0). Now we start to calculate the minimum value step by step from the gradient descent algorithm. Suppose the initial starting point is: = (1, 3  \theta ^{0} )

The initial learning rate is: α = 0.1

The gradient of the function is\nabla J(\theta)=<2 \theta_{1}, 2 \theta_{2}>

multiple iterations

 \begin{aligned}\Theta^{0}&=(1,3)\\\Theta^{1}&=\Theta^{0}-\alpha\nabla J(\Theta)\\&=(1,3)-0.1(2,6)\\&=(0.8,2.4)\\\Theta^{2}&=(0.8,2.4)-0.1(1.6,4.8)\\&=( 0.64.1.92) \\ \Theta^{3} & =(0.512.1.536) \\\Theta^{4} & =(0.4096.1.2288000000000001) \\\vdots & \\\Theta^{10} & =(0.10737418240000003.0.322122547 20000005) \\ \vdots & \\ \Theta^{50} & =\left(1.1417981541647683 \mathrm{e}^{-05}, 3.425394462494306 \mathrm{e}^{-05}\right) \\\vdots & \\ \Theta^{100} & =\left(1 .6296287810675902 \mathrm{e}^{-10}, 4.888886343202771 \mathrm{e}^{-10}\right)\end{aligned}

 Univariate and multivariate

Gradient Descent formula

\theta_{i}=\theta_{i}-\alpha \frac{\partial}{\partial \theta_{i}} J(\theta)

α is called the learning rate or step size in the gradient descent algorithm. It means that α can be used to control the distance of each step to ensure that the steps are not too large and miss the lowest point. At the same time, it is also necessary to ensure not to walk too slowly

Adding a negative sign before the gradient means moving in the opposite direction of the gradient. The direction of the gradient is actually the direction in which the function rises fastest at this point, and walking in the direction of the fastest decline is naturally the direction of the negative gradient, so you need to add a negative sign

gradient descent normal equation
Need to choose the learning rate unnecessary
need to iteratively solve One operation yields
Larger number of features can use Need to calculate the equation, the time complexity is high O( n^{3})

With an optimization algorithm such as gradient descent, regression has the ability to "automatically learn" 

Optimize the dynamic graph as follows  

choose

  • Small-scale data: LinearRegression (doesn't solve the fitting problem) and ridge regression
  • Large scale data: SGDRegressor

Learning to navigate: http://xqnav.top/

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128232829