Deep learning principle ----- linear regression + gradient descent method

Series Article Directory

Principles of deep learning ----- linear regression + gradient descent method Principles
of deep learning ----- logistic regression algorithm
Principles of deep learning ----- fully connected neural network Principles
of deep learning ----- convolutional neural network
depth Learning principle-----recurrent neural network (RNN, LSTM)
time series forecasting-----based on BP, LSTM, CNN-LSTM neural network algorithm, single-feature electricity load forecasting
time series forecasting (multi-features)-- ---Multi-feature electricity load forecasting based on BP, LSTM, CNN-LSTM neural network algorithm


Series of teaching videos

Quick introduction to deep learning and actual combat
[hands-on teaching] based on BP neural network single-feature electricity load forecasting
[hands-on teaching] based on RNN, LSTM neural network single-feature electricity load forecasting
[hands-on teaching] based on CNN-LSTM neural network single-feature electricity consumption Load forecasting
[Multi-feature forecasting] Multi-feature electric load forecasting based on BP neural network
[Multi-feature forecasting] Multi-feature power load forecasting based on RNN and LSTM [Multi-feature forecasting
] Multi-feature power load forecasting based on CNN-LSTM network



foreword

  The linear regression algorithm is a must-learn algorithm for getting started with machine learning and deep learning. Although its algorithm principle is simple, it contains some important basic ideas in machine learning. Many more powerful nonlinear models can be obtained by introducing hierarchical structures or high-dimensional mappings on the basis of linear models. At the same time, the core idea of ​​machine learning and deep learning is to optimize the solution, constantly looking for the most suitable parameters, especially understanding how to use the gradient descent method to solve the parameters, which is of great help to the subsequent learning of neural networks.


1. Linear regression model

  假设给定数据集 D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x m , y m ) } D=\left\{\left(\boldsymbol{x}_1, y_1\right),\left(\boldsymbol{x}_2, y_2\right), \ldots,\left(\boldsymbol{x}_m, y_m\right)\right\} D={ (x1,y1),(x2,y2),,(xm,ym)},其中 x i = ( x i 1 ; x i 2 ; … ; x i d ) , y i ∈ R \boldsymbol{x}_i=\left(x_{i 1} ; \quad x_{i 2} ; \ldots ; x_{i d}\right), y_i \in \mathbb{R} xi=(xi 1;xi2;;xid),yiIn R , linear regression is trying to learn a linear model that predicts actual output values ​​as accurately as possible.
  In layman's terms, it is to seek the linear relationship between attributes and results. The function expression of the linear regression model can be expressed by the following formula: f ( x ) = w 1 x 1 + w 2 x 2 + ⋯ + wnxn + bf(x)=w_1 x_1+w_2 x_2+\cdots+w_n x_n +bf(x)=w1x1+w2x2++wnxn+Of course b   can also be expressed in the form of vectors:f ( x ) = w T x + bf(\boldsymbol{x})=\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}+bf(x)=wTx+b   From the above formula, it can be concluded that the linear regression model is to obtain a set of optimalwi \mathcal{w}_iwiand bbb to determine the linear model, making this model infinitely close to the existing dataxi x_ixiand result f ( xi ) f\left(x_i\right)f(xi) relationship between.
  Of course, the theory mentioned above may be difficult for beginners to understand. Let’s consider the simplest case first, that is, there is only one data feature attribute in a given data set, then the model can be determined as: f ( x )
= wx + bf(x)=w x+bf(x)=wx+b


1.2. Case 1

  Let's use a simple example to understand the regression model.
  Now suppose such a case, now there is a set of such data, which is the data of Xiao Ming's daily study time and the score of the final exam. The data is shown in the table below, and at the same time, I want to know how many points Xiao Ming will get in the final exam if he studies for 4 hours?

daily study time test score
1 2
2 4
3 6
4

  This is obviously a regression task, which is to predict a specific value. Now let's draw the daily study time and test scores. It seems that a rule can be drawn from the figure, that is, as the study time increases, the score of the final exam will be higher.
insert image description here
  This problem is addressed using the single-feature linear model described above. First of all, there is only one input for this data, which is Xiao Ming's study time. The output is the score of the exam. So the model can be determined as: f ( x ) = wx + bf(x)=w x+bf(x)=wx+bBut   to simplify the model for later understanding and calculation, we only use aw \mathcal{w}w to express the relationship between input and output (although this is not rigorous, it is only for the convenience of subsequent calculations), so the current model can be simplified as:
f ( x ) = wxf(x)=wxf(x)=w x   Now our goal is to find an optimalw \mathcal{w}w to express the relationship between input and output. But what kind ofw \mathcal{w}W is the best? Now think about it, isn't our model going to infinitely approximate the relationship between study time and test scores? Since it is an infinite approximation, it should be sure thatw \mathcal{w}w makes the difference between the output test score and the real test score (also called the real value or label) as small as possible, preferably 0, if it is 0, then this w \mathcal{w}w correctly describes the relationship between the output and the output (but obviously this is unrealistic, because the data obtained in life has noise, and the relationship between the data has certain errors).
  Now we use a formula to calculate the error between the output and the true value:  loss = ( f ( x ) − y ) 2 = ( wx − y ) 2 \text { loss }=(f(x)-y)^2= (w xy)^2 loss =(f(x)y)2=(wxy)2   Of course, we have a lot of data. We need to calculate the sum of the errors between the true value of all the data and the output and calculate the average value. This function is the mean square error function and also the loss function of the linear regression model. J ( x ) = 1 2 m ∑ i = 1 m ( f ( xi ) − yi ) 2 J(x)=\frac{1}{2 m} \sum_{i=1}^m\left(f\ left(x_i\right)-y_i\right)^2J(x)=2 m1i=1m(f(xi)yi)2   In this way we have an index to evaluate differentw \mathcal{w}Whether w is optimal, it is obvious that the smaller the mean square errorw \mathcal{w}The better w is, of course the minimum mean square error (here the minimum is not necessarily 0, in practical problems due to noise in the data, then the mean square error cannot be 0) is the optimal w \mathcal{w}w .
  Now we need to find a way to help us find a reasonablew \mathcal{w}w . Now we use one of the most stupid methods to carry outw \mathcal{w}The solution of w is the exhaustive method. We tryw \mathcal{w}The w value is used to calculate the mean square error sum, which happens to be whenw \mathcal{w}When w = 2, the value of the mean square error is 0, which is the most reasonable we are looking for. The specific calculation is shown in the figure below:
insert image description here
  Obviously, the exhaustive method can help us find the optimalw \mathcal{w}w , but it is very complicated to deal with problems in our reality, and there are often multiple input features, which means that there are multiplew \mathcal{w}w to express the relationship between input and output, one morew \mathcal{w}w then means one more dimension, use the exhaustive method to find a reasonablew \mathcal{w}The difficulty of w increases by one dimension (this is why I don’t add b to the calculation. The b here is calculated to be 0 in the end. Of course, this is the reason why I knew the answer before. Try not to discard the parameter b when dealing with actual tasks) , so using the exhaustive method can find the optimalw \mathcal{w}w , but it is unrealistic on the problem of multi-feature input. Then there are other ways to help find the optimalw \mathcal{w}W ? Obviously there is, this method is called the gradient descent method.


2. Least square method

  Before talking about the gradient descent method, let's think about a problem. The function of the loss value is a quadratic function. Obviously, the image diagram of the quadratic function is shown in the figure below: this quadratic function obviously has a minimum point
insert image description here
  . is at the same time the extreme point of the function. Then, can we use derivation and then set the derivative to 0 to calculate its extreme point, and then the point where the derivative is 0 is the optimal w we have been looking for \ mathcal{w}W ? Let's try to find the derivative of the loss function.
  The loss function is: J ( x ) = 1 2 m ∑ i = 1 m ( f ( xi ) − yi ) 2 J(x)=\frac{1}{2 m} \sum_{i=1}^m\ left(f\left(x_i\right)-y_i\right)^2J(x)=2 m1i=1m(f(xi)yi)2  求损失函数关于 的导数: ∂ J ( w ) ∂ w = ∂ 1 2 m ∑ i = 1 m ( w x i − y i ) 2 ∂ w = 1 2 m ∑ i = 1 m ∂ ( w x i − y i ) 2 ∂ w = 1 2 m ∑ i = 1 m × 2 ( w x i − y i ) × ∂ ( w x i − y i ) ∂ w = 1 m ∑ i = 1 m ( w x i − y i ) x i \begin{aligned} &\frac{\partial J(w)}{\partial w}=\frac{\partial \frac{1}{2 m} \sum_{i=1}^m\left(w x_i-y_i\right)^2}{\partial w} \\ &=\frac{1}{2 m} \sum_{i=1}^m \frac{\partial\left(w x_i-y_i\right)^2}{\partial w} \\ &=\frac{1}{2 m} \sum_{i=1}^m \times 2\left(w x_i-y_i\right) \times \frac{\partial\left(w x_i-y_i\right)}{\partial w} \\ &=\frac{1}{m} \sum_{i=1}^m\left(w x_i-y_i\right) x_i \end{aligned} wJ(w)=w2 m1i=1m(wxiyi)2=2 m1i=1mw(wxiyi)2=2 m1i=1m×2(wxiyi)×w(wxiyi)=m1i=1m(wxiyi)xi  Let the derivative of the loss function be 0, and get: w = 1 m ∑ i = 1 myixiw=\frac{1}{m} \sum_{i=1}^m \frac{y_i}{x_i}w=m1i=1mxiyi  According to the above derivation and find the w \mathcal{w} that makes the loss function the smallestexpression of w . And the final value that can be obtained by calculating through the existing data is 2. Obviously, this method is much better than the exhaustive method, and there is no need to blindly guess in a range. lead to uncertainty in the final outcome. Since this is the case, why do we need the gradient descent method to find an optimalw \mathcal{w}As for the w value, isn’t this least squares method the best? The answer is obviously no.
  We have already said that the vector expression of the linear regression model is as follows: f ( x ) = w T x + bf(\boldsymbol{x})=\boldsymbol{w}^{\mathrm{T}} \ boldsymbol{x}+bf(x)=wTx+b In order to facilitate the understanding of the principle and calculation, we include the parameter b into the matrix w. At this time, the data feature matrix x is: X = (x 11 x 12 ... x 1 d 1 x 21 x 22 ... x 2 d 1 ⋮ ⋮ ⋱ ⋮ ⋮ xm 1 xm 2 … xmd 1 ) \mathbf{X}=\left(\begin{array}{ccccc} x_{11} & x_{12} & \ldots & x_{1 d} & 1 \\ x_{21} & x_{22} & \ldots & x_{2 d} & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_{m 1} & x_{m 2} & \ldots & x_{md} & 1 \end{array}\right)X= x11x21xm 1x12x22xm 2x1 dx2 dxmd111   矩阵w为: w = ( w 1 w 2 w 3 ⋮ w m b ) w=\left(\begin{array}{c} w_1 \\ w_2 \\ w_3 \\ \vdots \\ w_m \\ b \end{array}\right) w= w1w2w3wmb   The vector expression of the obtained linear regression model is as follows: f ( X ) = X wf(\boldsymbol{X})=\boldsymbol{X} \boldsymbol{w}f(X)=X w   obviouslyX \boldsymbol{X}Both X and W are a matrix, and the optimal W matrix parameters are obtained for this matrix by the least square method. The calculation steps are as follows:J ( w ) = 1 2 ( J ( w ) − Y ) 2 = 1 2 ( X w − Y ) 2 = 1 2 ( X w − Y ) ⊤ ( X w − Y ) = 1 2 ( w ⊤ X ⊤ − Y ⊤ ) ( X w − Y ) = 1 2 ( w ⊤ X ⊤ X w − Y ⊤ X w − w ⊤ X ⊤ Y + Y ⊤ Y ) \begin{aligned} J( w) &=\frac{1}{2}(J(w)-Y)^2 \\ &=\frac{1}{2}(X wY)^2 \\ &=\frac{1}{ 2}(X wY)^{\top}(X wY) \\ &=\frac{1}{2}\left(w^{\top} X^{\top}-Y^{\top}\ right)(X wY) \\ &=\frac{1}{2}\left(w^{\top} X^{\top} X wY^{\top} X ww^{\top} X^{ \top} Y+Y^{\top} Y\right) \end{aligned}J(w)=21(J(w)Y)2=21(XwY)2=21(XwY)(XwY)=21(wXY)(XwY)=21(wXXwYXwwXY+YY)  Now to find the derivative for J(w), you must first know the following knowledge points: ∂ AB ∂ B = AT , ∂ ATB ∂ A = B , ∂ XTAX ∂ X = 2 AX 4 \frac{\partial AB}{\partial B }=A^T, \quad \frac{\partial A^TB}{\partial A}=B, \quad \frac{\partial X^TAX}{\partial X}=2 AX^4BAB=AT,AATB=B,XXTAX=2AX4  按这个规律对函数进行求导: ∂ J ( w ) ∂ w = 1 2 ( ∂ w ⊤ X ⊤ X w ∂ w − ∂ Y ⊤ X w ∂ w − ∂ w ⊤ X ⊤ Y ∂ w ) = 1 2 [ 2 X ⊤ x w − ( Y ⊤ X ) ⊤ − ( X ⊤ Y ) ] = 1 2 [ 2 X ⊤ × w − 2 ( X ⊤ Y ) ] = X ⊤ X w − X ⊤ Y \begin{aligned} \frac{\partial J(w)}{\partial w} &=\frac{1}{2}\left(\frac{\partial w^{\top} X^{\top} X w}{\partial w}-\frac{\partial Y^{\top} X w}{\partial w}-\frac{\partial w^{\top} X^{\top} Y}{\partial w}\right) \\ &=\frac{1}{2}\left[2 X^{\top} x w-\left(Y^{\top} X\right)^{\top}-\left(X^{\top} Y\right)\right] \\ &=\frac{1}{2}\left[2 X^{\top} \times w-2\left(X^{\top} Y\right)\right] \\ &=X^{\top} X w-X^{\top} Y \end{aligned} wJ(w)=21(wwXXwwYXwwwXY)=21[ 2X _xw(YX)(XY)]=21[ 2X _×w2(XY)]=XXwXYLet the derivative be ∂ J ( w ) ∂ w = 0 \frac{\partial J(w)}{\partial w}=0wJ(w)=0,解得: X ⊤ X w − X ⊤ Y = 0 X ⊤ X = X ⊤ Y W = ( X ⊤ X ) − 1 X ⊤ Y \begin{aligned} X^{\top} X w-X^{\top} Y &=0 \\ X \top X &=X^{\top} Y \\ W &=\left(X^{\top} X\right)^{-1} X^{\top} Y \end{aligned} XXwXYXXW=0=XY=(XX)1XY  Obviously, the optimal 0 can also be solved by using the least squares method, but in real tasks X ⊤ XX^{\top} XX Xis often not a full-rank matrix, which means thatX ⊤ XX^{\top} XX Xis irreversible. Then the least squares method cannot help us find the optimal parameters, so the least squares method cannot be applied to all models.
  In fact, whether it is machine learning or depth, we hope that the model will continuously learn useful things from data samples, rather than solving in one step. This is not in line with the original intention of artificial intelligence, so let's take a look at the current deep learning The most widely used method in solving neural network parameters is the gradient descent method.


3. Gradient descent method

  The least squares method has been explained above, and it is found that the least squares method cannot find the optimal w in any scenario. Therefore, we introduce a new method, the gradient descent method, to solve the optimal w.
  Now let's assume that there is such a scene. In a dark night, a person is going down the mountain, but he can't see the surrounding environment at all, and can only perceive it through his hands. Therefore, this person thought of a way to feel the slope of the mountain around him. If he felt that the slope of a method was downward and the steepest, then he walked to the position touched by his hand. The method continued to go step by step, and this person finally reached the bottom of the mountain. Specifically, it can be imagined as the picture below, the black dot is a person.
insert image description here
  The above scenario can describe the process of gradient descent method to find the optimal parameters very vividly.
  First of all, if the loss function is a differentiable function, our goal is to find the parameter of the minimum value of this differentiable function.
At the same time, in the scene just now, a very important method is to find the direction where the slope is downward and the steepest; in a differentiable function, differentiation is the gradient of this function, and the gradient is a vector, and the direction of the gradient is to point to the direction where the function rises the most. The direction of the fastest, then obviously, the opposite direction of the gradient is the direction in which the function drops the fastest.
There is also a very important information in the scene that the distance that this person goes down the mountain every time is the distance of one step in normal walking. Just imagine, assuming that this person can take a big step in one step, which is big enough to cross from one mountain to that mountain, then this People will never be able to go down the mountain, and have been jumping repeatedly across the two hills.
  Therefore, when using the gradient descent method to solve the parameters, the gradient update step should not be too large. If it is too large, it may cause the parameter value that minimizes the loss function to be skipped. The gradient update step should also not be too small. If it is too small, look for The speed of optimal parameters will be slower, and it will also consume the computing resources of the computer. The pace at which this parameter is looked for is called the learning rate. Therefore, the calculation formula of gradient descent method parameter update is as follows: w = w − a ∂ J ( w ) ∂ ww=wa \frac{\partial J(w)}{\partial w}w=wawJ(w)  In the above formula, w is the parameter of the model, and a is the learning rate. The specific process of finding the optimal parameters is as shown in the figure below. By finding the gradient and continuously approaching the minimum value of the loss function, the optimal parameters are found.
insert image description here
  Let's use a practical example to see how the parameters are updated.


3.2. Case 2

  Now suppose there is a loss function of the following formula: J ( w ) = 4 w 2 J(w)=4 w^2J(w)=4w2   First, w needs to be initialized randomly, assuming w=4, and the learning rate is set to 0.1. The derivative of the loss function is as follows:w 0 = 4 , a = 0.1 , ∂ J ( w ) ∂ w = 8 w w_0=4, a=0.1, \frac{\partial J(w)}{\partial w }=8ww0=4,a=0.1,wJ(w)=8 w   The first w update process of w is calculated as follows:w 1 = w 0 − 0.1 × ∂ J ( ω ) ∂ w = 4 − 0.1 × 8 × 4 = 0.8 \begin{aligned} w_1 &=w_0-0.1 \ times \frac{\partial J(\omega)}{\partial w} \\ &=4-0.1 \times 8 \times 4 \\ &=0.8 \end{aligned}w1=w00.1×wJ ( ω )=40.1×8×4=0.8  The subsequent updating process of w is as follows: W 2 = 0.8 − 0.1 × 8 × 0.8 = 0.16 W 2 = 0.16 − 0.1 × 8 × 0.16 = 0.032 W 4 = 0.032 − 0.1 × 8 × 0.032 = 0.0064 \begin{aligned} &W_2 =0.8-0.1 \times 8 \times 0.8=0.16 \\ &W_2=0.16-0.1 \times 8 \times 0.16=0.032 \\ &W_4=0.032-0.1 \times 8 \times 0.032=0.0064 \end{aligned}W2=0.80.1×8×0.8=0.16W2=0.160.1×8×0.16=0.032W4=0.0320.1×8×0.032=0.0064  Obviously, the value calculated by this learning rate is farther and farther away from the optimal value, so you must be careful when setting the learning rate. There is an empirical value here that the learning rate is often set to 0.01 or 0.001.
  Now let's go back to the example of the relationship between Xiaoming's academic performance and learning time at the beginning of the article, and see how to solve the optimal w through the gradient descent method. It should also be noted here that the gradient calculated each time here is all The average value of the corresponding gradient of the data, the calculation formula is as follows: J ( w ) ∂ w = 1 m ∂ ∑ i = 1 m ( f ( xi ) − yi ) 2 ∂ w \frac{J(w)} {\partial w}=\frac{1}{m} \frac{\partial \sum_{i=1}^m\left(f\left(x_i\right)-y_i\right)^2}{\partial w}wJ(w)=m1wi=1m(f(xi)yi)2  To calculate according to this formula, assume that the initial w=4, and set the learning rate to 0.01 w 0 = 4 w 1 = 4 − 0.01 × 1 m ∂ J ( w ) ∂ w = 3.813 w 2 = 3.644 w 3 = 3.490 ⋮ w 99 = 2.000111 \begin{aligned} &w_0=4 \\ &w_1=4-0.01 \times \frac{1}{m} \frac{\partial J(w)}{\partial w}=3.813 \\ &w_2 =3.644 \\ &w_3=3.490 \\ &\vdots \\ &w_{99}=2.000111 \end{aligned}w0=4w1=40.01×m1wJ(w)=3.813w2=3.644w3=3.490w99=2.000111
  From the above calculation, we can see that after 100 rounds of continuous gradient updates, w is very close to the optimal value of 2. Obviously, such a result already meets our needs, and the solution process is shown in the figure below. It can be seen that the result is constantly approaching the direction where the loss is 0, and the position of the red point is also the position of the optimal parameter. insert image description here
  Therefore, the gradient descent method can help find the parameter that minimizes the loss by continuously calculating the gradient. The above examples are all in the case of only one feature. When there are two or more features in our data, the gradient descent method can also be used. To find the optimal parameters, the answer is yes, as shown in the figure below (when the parameter is greater than 2, the dimension exceeds 3 dimensions, and the picture cannot be drawn), it is to find the optimal solution of the two parameter weights, specifically The method is the same as a parameter, which is to find the partial derivative of the loss function with respect to the parameter, and then use the formula of the gradient descent method to iteratively solve it until the loss function becomes stable and no longer declines.insert image description here


Summarize

  Finally, let’s summarize the characteristics of the gradient descent method and the least squares method: the
  gradient descent method has a wide range of applications. When solving the parameters of the subsequent deep learning neural network, the gradient descent method is used to solve it, and the parameters of the neural network are often There are tens of thousands of parameters; but the gradient descent method has no advantage in speed for a small amount of data.
  The least squares method is often faster when there are fewer data and fewer features, but when the order of magnitude reaches a certain level, the gradient descent method is faster, because the matrix needs to be inverted in the normal equation, and the inversion time is complicated. is the third power of n.
  Finally, I would like to reiterate that linear regression is a very representative algorithm. Learning it well will be of great help to the subsequent deep learning, especially the gradient descent method is the core of subsequent deep learning.

Guess you like

Origin blog.csdn.net/didiaopao/article/details/126483324