Optimization method - least square method and gradient descent method

Table of contents

Series Article Directory

1. Problems

2. Summary of Experimental Ideas

1. Experimental tools and algorithms

2. Experimental data

3. Experimental objectives

4. Experimental steps

3. The introduction of the least squares problem

1. Sample least squares problem

2. Least squares solution and mathematical modeling

3. Introduction of relevant linear algebra knowledge

3.1 Gradient

3.2 Inverse of matrix

3.3 QR Decomposition

Four, the least squares method

1. Definition

2. Mathematical modeling

2.1 Objective function

2.2 The solution of the least squares method

2.3 Significance of column vector space

3. Target solution derivation

4. Normal equation

4.1 Solving normal equations by Gram matrix

4.2 Solving normal equations by QR decomposition

5. Programming practice

5.1 QR Decomposition

5.2 Find the optimal solution  

5. Gradient descent method

1. Definition

2. Objective function derivation

3. Operation and algorithm flow

4. Programming practice

4.1 Number of iterations

4.2 "Relative closeness" between adjacent iterative solutions

5. Analysis and error comparison of solutions in different situations

5.1 Analysis of Different Algorithms

5.2 Error Analysis

5.3 Efficiency comparison

6. The influence of different languages ​​and platforms on the solution

6. Theoretical Supplement and Application Expansion

1. Least square method

1.1 Linear regression definition and algorithm steps

1.2 Application of least square method

2. Gradient descent method

2.1 BP neural network

2.2 Application of Gradient Descent Method

7. Experimental summary

1. Summary of solving the least squares problem

2. References


Series Article Directory

This series of blogs focuses on the concept, principle and code practice of optimization methods (if you have any questions, please discuss and point them out in the comment area, or contact me directly by private message).

The code can be copied in full, and     it makes sense for everyone to understand the principle and process to reproduce it! ! !

The first chapter  optimization method - K-means to achieve handwritten digital image clustering_@李敬如的博客-CSDN博客

Chapter Two  Optimization Method - QR Decomposition_@李敬如的博客-CSDN Blog

Chapter 3 Optimization Method - Least Squares Method 


synopsis

    This blog mainly introduces the principle and process of the least squares method and gradient descent method. Using Matlab and Pycharm respectively, the least squares method and the gradient descent method with different iteration stop conditions are used to solve the given optimization model and solve it. Between the error analysis and comparison , and some theory and application (enclosed data sets and python and matlab code).


1. Problems

Read the matrix A and vector b in the attached " MatrixA_b.mat " file . Establish a least squares optimization model for matrices , vectors , and unknown vectors :\min _{x}\|A x-b\|_{2}^{2}

1) Find the exact solution of the optimization model through the normal equation of the least square method;

2 ) Use the gradient descent method to iteratively obtain the " approximate solution " of the model , and analyze the error between the " approximate solution " and the "accurate solution" by setting the iteration stop condition .

2. Summary of Experimental Ideas

1. Experimental tools and algorithms

    In this experiment, Matlab and Pycharm were respectively used to realize the least square method, the gradient descent method with different iteration stop conditions and other methods to solve the given optimization model and to analyze and compare the errors between the solutions.

2. Experimental data

    In this experiment, an optimized model composed of a given matrix A (50x40) and vector b (50x1) was used to explore the experimental content, and some network data sets were used in the exploration and attempt to expand the content .\min _{x}\|A x-b\|_{2}^{2}

3. Experimental objectives

    This experiment requires the use of different methods to solve a given optimization model (least squares problem) and to analyze and compare the errors of the solutions. In addition, I have supplemented relevant theories and practiced the application of algorithms.

4. Experimental steps

    The general process of this experiment is shown in Table 1:

Table 1 Experiment 3 process

1 . Summary of Experimental Ideas

2. The introduction of the least squares problem

3. Derivation and solution of the least squares method

4. Derivation and solution of gradient descent method

5. Analysis and error comparison of solutions in different situations

6. Theoretical expansion and application practice

3. The introduction of the least squares problem

1. Sample least squares problem

    Before solving the least squares problem, we need to define and model it mathematically, so this part introduces a two-dimensional sample as shown in Figure 1, and an actual measurement problem is shown in Figure 2:

Figure 1 Example of two-dimensional least squares problem

Figure 2 Example of practical least squares problem

    Analysis : For the problem in Figure 1, it is impossible to find a straight line passing through the three points A, B, and C at the same time. For the problem in Figure 2, we cannot solve a set of x1, x2, and x3 that meet the conditions.

2. Least squares solution and mathematical modeling

    Least squares problem : Due to various errors, it is difficult to find a set of solutions that satisfy the problem conditions (it is impossible to fit a line or hyperplane through all the data through the existing data) .

    Solution : For the least squares method, the core solution is to find an approximate solution to the problem. And as close as possible to the goal of the original problem , so that the residual vector r=Ax-b is as small as possible under a certain measure. The mathematical modeling of the least squares problem is shown in Figure 3:

Figure 3 Least squares problem model

3. Introduction of relevant linear algebra knowledge

    In the follow-up, we need to use different methods to solve the least squares problem, and here we will make some supplements to the core related linear algebra knowledge.

3.1 Gradient

    The original meaning of the gradient is a vector (vector), which means that the directional derivative of a certain function at this point obtains the maximum value along this direction, that is, the function changes the fastest along this direction (the direction of this gradient) at this point, and the change The rate is the largest (the modulus of the gradient).

    The gradient solution sample is shown in Equation 1:

\nabla f(z)=\left[\begin{array}{c} \frac{\partial f}{\partial z_{1}}(z) \\ \vdots \\ \frac{\partial f}{\partial z_{n}}(z) \end{array}\right]

Equation 1 gradient solution example

3.2 Inverse of matrix

    When a matrix X satisfies XA=I, X is called the left inverse of A, and the right inverse can be defined similarly.

    The inverse of the matrix : If there is a left inverse and a right inverse in the matrix A, then the left inverse and the right inverse must be equal. At this time, X is called the inverse of the matrix (the matrix is ​​non-singular), and it is recorded as A^-1 .

    Judgment of the existence of the inverse : For the existence of the inverse of a matrix, there are five commonly used methods as shown in Table 2:

Table 2 Common methods for judging the existence of an inverse matrix

1. If the matrix determinant is not 0, it is reversible

2. If the rank of the matrix is ​​n, it is invertible

3. If there is a matrix B, make AB=BA=I, reversible

4. For the homogeneous equation AX=0, if the equation has only zero solutions, it is reversible

5. For the non-homogeneous linear equation AX=b, if the equation has only a special solution, it is reversible

    The common proof framework of matrix inverse is shown in Figure 4:

Figure 4 Proof frame of matrix inverse

    Supplement : Property (a) is true for any matrix A, and property (b) is true for matrix A.

    Inverse matrix solution : In programming, matrix inversion generally uses library functions, which are packaged in different languages. For example, inv() can be used to invert a matrix in matlab . For details, see: Matrix inversion - MATLAB inv - MathWorks India , use pinv() to find the pseudo-inverse, see: Moore-Penrose pseudo-inverse - MATLAB pinv - MathWorks India

3.3 QR Decomposition

    QR decomposition is an algorithm that decomposes a matrix A into a matrix Q with orthonormal column vectors and an upper triangular matrix R (diagonal elements are not 0) . This decomposition can effectively improve the efficiency of computers in solving linear equations, least squares problems, and least squares problems with constraints, and effectively reduce computational complexity. The QR decomposition form is shown in Figure 5.

Figure 5 QR decomposition definition form

According to the principle ,     QR decomposition is divided into three implementation methods : Gram-Schmidt, Householder , and Givens . After my experiment 2, I found that for denser matrices, using Householder QR decomposition has higher efficiency and stability .

Four, the least squares method

    In this part, the definition of the least squares method, mathematical modeling, derivation of the target solution, and model solution are explained in detail.

1. Definition

    Least squares is a mathematical optimization technique. It finds the best function fit to the data by minimizing the sum of squared errors . The unknown data can be easily obtained by using the least square method, and the sum of squares of the errors between the obtained data and the actual data can be minimized.

    The least squares method can also be used for curve fitting, and some other optimization problems can also be expressed by the least squares method by minimizing energy or maximizing entropy. It is also widely used in many disciplines of data processing such as error estimation, uncertainty, system identification, prediction, and forecasting.

2. Mathematical modeling

2.1 Objective function

    Combining the least squares problem model and the definition of the least squares method, the least squares method is mathematically modeled, so for a given given A∈ R^mxn , b∈ R^m , solve x∈ R^n to minimize the objective function, The objective function is shown in formula 2:

Equation 2 Objective function of the least squares method

2.2 The solution of the least squares method

    Combined with the principle of the least squares method, to solve the objective function of formula 2, the obtained x should satisfy the conditions of formula 3:

Conditions for the solution of formula 3 least square method

    Analysis : When the residual r =A x −b=0 , then x is the solution of the linear equation system Ax=b ; otherwise, it is the approximate solution of the following equation system with the least square sum of errors .

2.3 Significance of column vector space

For the solution x     that satisfies the objective function formula 2 of the least squares method , the meaning of its column vector space is shown in Figure 6:

Figure 6 Significance of least squares column vector space

Analysis : As shown in Figure 6, the vector closest to b in A x ∈ range (A), r = A x -b is orthogonal (perpendicular) to the range space range (A).

3. Target solution derivation

For the objective function formula 2 of the least squares method, we need to obtain the optimal solution x     satisfying the condition of formula 3 . Since the objective function f(x) is a differentiable function, the optimal solution x satisfies the gradientf( x )=0 , as shown in Equation 4:

4. Normal equation

    From the definition of the least squares method and the derivation of the gradient formula, we know that we need to find the optimal solution of the objective function, that is, to find the gradient f ( x )=0. The equation is shown in Equation 9:

A^{T} A x=A^{T} b

Equation 9 Least squares normal equation

    Analysis : Analyzing the normal equation in Formula 9, it is equivalent to f(x)=0 , f(x)=\|A x-b\|_{2}^{2} , and all solutions of the least square method problem satisfy the normal equation. If the columns of A are linearly independent , then A^ TA is a non-singular matrix, and the normal equation (the original problem) has a unique solution .

    There are generally three methods for solving normal equations, which are directly solving the normal equations, solving through the Gram matrix and QR decomposition . The implementation process of the latter two methods is explained in detail as follows:

4.1 Solving normal equations by Gram matrix

    The general process of solving the normal equation through the Gram matrix is ​​shown in Table 3:

    Tips: After rounding, the Gram matrix is ​​a singular matrix.

4.2 Solving normal equations by QR decomposition

Method ② is more stable     than method ① , because it avoids the construction of Gram matrix , and the general process of solving normal equations through QR decomposition is shown in Table 4:

5. Programming practice

    According to the requirements of experimental task 1), this part will practice programming to obtain the exact solution of the given data optimization model through the normal equation of the least square method.

5.1 QR Decomposition

    Import the given matrix A and vector b in the experiment, and perform QR decomposition (Householder) on A. The algorithm flow is shown in Table 5:

    In terms of code implementation, you can use the matlab library function [Q,R] = qr(A) to perform QR decomposition using Householder . The usage analysis can be seen in: QR decomposition - MATLAB qr - MathWorks China , you can also build your own QR decomposition function, decomposition and stability Sexual analysis code is visible: optimization method - QR decomposition_@李敬如的博客-CSDN博客

5.2 Find the optimal solution  

    After obtaining the Q and R matrices decomposed from the given matrix A (different QR decomposes to obtain different matrices, which need to be converted), according to formula 10, the optimal solution of Q, R and b is obtained and programmed (the inverse matrix can be inv() function), finally get the optimal solution x_least and save it for subsequent comparison , the code is as follows:

x_least=inv(R)*Q'*b; %精确解

5. Gradient descent method

    In addition to the least squares method, the gradient descent method is also commonly used to approximate the optimal solution of the optimization problem , especially for the case where R^mxn column vectors are linearly related or n is very large . This part is about the definition and mathematical modeling of the gradient descent method , Target solution derivation, model solution for detailed explanation.

1. Definition

    Gradient descent is a first-order optimization algorithm. To use the gradient descent method to find the local minimum of a function, it is necessary to iteratively search for the specified step distance point in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function . That is, the process of gradient descent method to solve the optimal solution of the target problem is: x 1 , x 2 , ,x k →x , where xk is the kth iteration, and it is expected to update xk+1 , satisfying f( xk+1 )<f ( xk ) , the core principle is shown in Figure 12:

Figure 12 The core principle of the gradient descent method

2. Objective function derivation

3. Operation and algorithm flow

    For the optimization problem \min _{x \in \mathbb{R}^{n}} \frac{1}{2}\|A x-b\|_{2}^{2}, A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^{m}, according to the principle of the gradient descent method and the derivation of the target solution, the algorithm flow is summarized as shown in Table 6:

    Among them, there are generally two kinds of iteration stop conditions: setting the number of iterations and the "relative closeness" between adjacent iteration solutions .

4. Programming practice

    According to the requirements of the experimental task 2), this part will use the gradient descent method to find the "approximate solution" of the given optimization model. The core code is as follows:

%%梯度下降法
min=0.01;
x=zeros(40,1);
for k = 1:30 %或指定迭代次数
f(1,k)=0.5*norm(A*x-b,2)^2; % 目标函数值
p = A'*(A*x-b);
a = norm(p,2)^2 / norm(A*p,2)^2;
y = x - a * p; %y为x(k+1)
temp(1,k) = norm((x-y),2)/norm(x,2); %迭代解间的相对接近程度
error(1,k) = norm((x_least - x),2); %误差迭代
% 
%   if norm((x-y),2)/norm(x,2) < min
%       break
%   end
x = y; %迭代
end

4.1 Number of iterations

    In this part, the number of iterations is used as the stop condition for iterations. In order to explore the optimal number of iterations , the influence of different iterations on the solution of the target (the change in the value of the objective function) should be observed and analyzed. In this experiment, the relationship between the number of iterations and the value of the objective function is shown in Figure 14:

Figure 14 The relationship between the number of iterations and the value of the objective function

    Analysis : It can be seen from Figure 14 that after the number of iterations is 30, as the number of iterations increases, the objective function tends to be stable. Therefore, in this experiment, choosing the number of iterations to be 30 as the stop condition is a better choice .

4.2 "Relative closeness" between adjacent iterative solutions

    In this part, the "relative closeness" between adjacent iterative solutions is used as the iteration stop condition. In this experiment, the formula: is taken \left\|x^{k}-x^{k+1}\right\|_{2} /\left\|x^{k}\right\|_{2}as an evaluation standard. In order to explore the optimal threshold , the effect of different thresholds on the solution of the objective (the change of the objective function value) should be observed and analyzed. In this experiment, the relationship between the number of iterations and the "relative proximity" between adjacent iterative solutions is shown in Figure 15:

Figure 15 The relationship between the number of iterations and the "relative closeness" between adjacent iterative solutions

    Analysis : Combining with Figure 14, it can be seen from Figure 15 that as the number of iterations increases, the fluctuation of "relative proximity" between adjacent iterative solutions decreases. After statistical analysis, for this experiment, I choose the threshold of "relative closeness" between adjacent iterative solutions to be 0.01 as the termination condition of the gradient descent method.

5. Analysis and error comparison of solutions in different situations

    Different methods and languages ​​for solving the least squares problem have different results and efficiencies, so this part makes a comparative analysis.

5.1 Analysis of Different Algorithms

The solutions and corresponding efficiencies obtained by     the least squares method and the gradient descent method are different, which can be explained by combining the principles and process analysis of the two algorithms.

    For the least squares method, the core is to find the partial derivative, and then make the partial derivative 0 to get the theoretical "accurate solution". The last step is to solve the equation system, and the calculation amount is relatively large.

    For the gradient descent method, it can be regarded as a simpler method of solving the equation in the last step of the least squares method, which is essentially an algorithm that iteratively approaches the target "accurate solution" in the direction of the gradient and the step size . The error exists in the gradient descent and there will be an initial solution, which is often far away from the "accurate solution", so the direction and length of the step size of each iteration are to "reduce" the error as much as possible, but the final solution is still There will be a certain error with the "exact solution".

    In general, the least squares method can obtain a globally optimal closed-form solution , and the gradient descent method is a parameter optimization method that is carried out step by step through iterative updates, and the final result is a local optimum .

5.2 Error Analysis

    This part compares the "approximate solution" obtained by the gradient descent method with two different iteration stop conditions and the "exact solution" obtained by the least squares method, and then uses it for error analysis. Among them, the error between the iterative solution and the exact solution is shown in Figure 16, and the error between the approximate solution and the exact solution is shown in Table 7:\text { distance error }=\| x_{-} \text {least }-x \|_{2}

Figure 16 Error relationship between iterative solution and exact solution

    Analysis : As can be seen from Figure 16, when initializing x (0) = 0 to \| x_{-} \text {least }-x \|_{2}measure the error, the error between the iterative solution obtained by the gradient descent method and the accurate solution obtained by the least squares method decreases as the number of iterations increases , The error is reduced from 2.0007 at 0 iterations to 0.752 at 100 iterations .

Table 7 The error between the approximate solution and the exact solution in this experiment

    Analysis: As can be seen from Table 7, in the case of initializing x (0) =0 to measure the error, the approximate solution obtained by the two gradient descent methods used in this experiment (number of iterations = 30 stop, <0.01 stop) and the least squares method The errors of the obtained exact solutions (closed optimal) are 1.33798527 and 1.491785332 respectively . The error between the objective function value obtained by the two gradient descent methods and the function value obtained by the least square method is 0.084586155 and 0.1191079252 respectively.\| x_{-} \text {least }-x \|_{2}\left\|x^{k}-x^{k+1}\right\|_{2} /\left\|x^{k}\right\|_{2}

5.3 Efficiency comparison

    In order to explore the efficiency comparison of different methods, this part uses the three methods mentioned above to solve the least squares problem given in the experiment. Each method is run 20 times. The running time data is summarized in Table 8. The efficiency comparison is as follows As shown in Figure 17:

Table 8 Summary of the average running time of different methods for solving the least squares problem

Figure 17 Comparison of the efficiency of different methods for solving the least squares problem

    Analysis : It can be seen from Table 8 and Figure 17 that no matter which gradient descent method is used, the average running time is lower than that of the least squares method .

    Combining correctness and efficiency analysis, although the least squares method can find a relatively accurate solution, it needs a longer running time. Therefore, when facing a given problem, you should selectively choose two methods according to the nature of the problem. one of the.

    Specifically, the least squares method needs to calculate the inverse of the matrix , which is quite time-consuming, and the inversion will also cause numerical instability, so such a calculation method is sometimes not worth advocating in applications.

    In contrast, although the gradient descent method has some disadvantages, and the number of iterations may be relatively high, the amount of calculation is relatively small . Moreover, on the problem of the least squares method, the convergence is guaranteed . Therefore, when there is a large amount of data, the gradient descent method (in fact, it should be some other better iterative methods) is more worth using.

6. The influence of different languages ​​and platforms on the solution

In order to explore the influence of different languages ​​and platforms on the solution of the least squares problem, the least squares method and the two gradient descent methods are refactored in Python     in Pycharm2021 . The specific code is shown in the attachment.

    The three methods implemented by maatlab and python were used to solve the given matrix A (50x40) and vector b (50x1) in the experiment. Each method of each platform was used to calculate the average running time 20 times. The data summary is shown in Table 8. , and the effect comparison is shown in Figure 18:

Table 8 The average running time of solving the least squares problem in different languages ​​and different methods

Figure 18 Comparison of the efficiency of solving the least squares problem in different languages ​​and different methods

    Analysis : It can be seen from Table 8 and Figure 18 that in solving the least squares problem with different methods, the running time of matlab is slightly lower than that of Python, and the efficiency is higher .

6. Theoretical Supplement and Application Expansion

    For the least squares method and gradient descent method, in addition to solving the optimization model composed of matrix vectors in this experiment, they are also widely used in other aspects. In this part, we will do a simple attempt and practice.

1. Least square method

1.1 Linear regression definition and algorithm steps

Linear regression and its detailed applications can be seen: Machine Learning - LR (Linear Regression), LRC (Linear Regression Classification) and Face Recognition

    Regression and linear regression: Regression analysis refers to a predictive modeling technique that mainly studies the relationship between independent variables and dependent variables. Linear regression is the most basic regression algorithm. Use models such as lines (planes) to fit the existing relatively linear data with less loss, and make the fitted model predict the data better. The general algorithm flow is shown in Table 9:

Table 9 Linear regression algorithm flow

Input : dataset

Process :

1. Variable screening and control

2. Do scatter plot and correlation analysis on the data with normal distribution

3. Determine the parameters by minimizing the loss function, and obtain (fit) the regression equation

4. Continuously test the model, optimize parameters, and obtain the optimal regression equation

5. Use the regression equation to make predictions

Output: regression equation

1.2 Application of least square method

    According to the definition of linear regression and shown in Table 9, in linear regression problems, the least square method is often used to fit the data , including but not limited to fitting a straight line or a hyperplane based on the solution of the normal equation, and predicting the data. In this part, a practical example is used to explore the application of the least squares method in linear regression.

    Description of the problem : Explore the relationship between student grades and student learning time

    Linear regression implementation : use learning time as a variable and grade as a predicted value, establish a regression equation, and use the least square method to minimize the loss function, obtain the regression equation and verify it, and use it to predict after verification. The core code is as follows:

%%最小二乘法应用
x=[23.80,27.60,31.60,32.40,33.70,34.90,43.20,52.80,63.80,73.40];
y=[41.4,51.8,61.70,67.90,68.70,77.50,95.90,137.40,155.0,175.0];
figure
plot(x,y,'r*') %作散点图(制定横纵坐标)
xlabel('x(学生学习时间)','fontsize',12)
ylabel('y(学生成绩)','fontsize',12)
set(gca,'linewidth',2)
%采用最小二乘拟合
Lxx=sum((x-mean(x)).^2);
Lxy=sum((x-mean(x)).*(y-mean(y)));
b1=Lxy/Lxx;
b0=mean(y)-b1*mean(x);
y1=b1*x+b0; %线性方程用于预测和拟合
hold on
plot(x,y1,'linewidth',2);
m2=LinearModel.fit(x,y); %函数进行线性回归

The data and fitting results are shown in Figure 20:

Figure 20 Data and fitting results

    Analysis : As can be seen from Figure 20, it can be seen that the model fits the data well and the prediction is relatively linear. To predict data that is not in the graph, simply substitute the corresponding learning time as x into the regression model (equation). The correctness of the least squares method in the application of linear regression is verified .

2. Gradient descent method

2.1 BP neural network

    For details on BP neural network and its application, please refer to: Machine Learning - Deep Neural Network Practice (FCN, CNN, BP

    BP neural network is a simple neural network. The core idea is to imitate the working principle of the human brain and construct a mathematical model . Its bionic structure is shown in Figure 21:

Figure 21 Topological diagram of BP neural network

    Among them, the structure of BP neural network consists of three layers, the front is the input layer, the middle is the hidden layer (there can be multiple hidden layers, and each hidden layer can have multiple neurons), and the last is the output layer, the general work The process is shown in Table 10:

Table 10 BP neural network process

Input : dataset

Process :

1. The input layer is responsible for receiving the input. After the input layer receives the input, each input neuron will transfer the weighted value to each hidden layer neuron.

2. After each hidden neuron receives the value passed by the input neuron, it sums up with its own basic threshold b, passes through an activation function (usually the activation function is the tansig function), and then weights it to the output layer.

3. The output neuron sums the value transmitted by each hidden neuron with its own threshold b (it can also undergo a layer of conversion after the summation), which is the output value.

output: corresponding result

2.2 Application of Gradient Descent Method

    According to the definition of neural network and shown in Table 10, for related algorithms, parameter update is an important step. For the BP neural network, the gradient descent method is commonly used to update the parameters , that is, the gradient of different parameters is calculated through backpropagation, and then the gradient is used to optimize the parameters. In this part, a practical example is used to explore the application of gradient descent in BP neural network.

    Description of the problem : Classification of iris data (according to the four characteristic attributes of iris to three categories)

    BP neural network implementation : I chose a four-layer BP neural network for this practice. The first layer is the input layer, the second and third layers are the middle layer, and the fourth layer is the output layer.

    The input layer is four neurons (each type of feature attribute can participate in the calculation), the output layer is three neurons (corresponding to the probability of the three categories), and the two middle layers are empirically determined to be twenty-five neurons .

    The neurons in different layers are fully linked, and the middle link is the weight w . Except for the input layer, neurons in other layers are given a bias b and an activation function f , and a loss function of the judgment error is given to the final output result.

    The weight w and bias b are generated by random numbers, the middle layer activation function is set to relu function , the output layer activation function is set to softmax function (for classification), and the loss function is set to cross entropy error (because one-hot encoding is used for classification , so it is suitable to use the cross-entropy error).

    The method of parameter update is set to the stochastic gradient descent method . That is, the gradient of different parameters is calculated by backpropagation, and then the gradient is used to optimize the parameters. code show as below:

# 训练集:鸢尾花150*50%
# 网络结构:输入层(4)+中间层(25)+中间层(25)+输出层(3)
# 中间层激活函数:relu,输出层激活函数:softmax
# 损失函数:交叉熵误差
# 随机梯度下降

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 鸢尾花数据读入
iris_data = datasets.load_iris()
input_data = iris_data.data
correct = iris_data.target
n_data = len(correct)

# 对数据进行预处理
# 标准化
ave_input = np.average(input_data, axis=0)
std_input = np.std(input_data, axis=0)
input_data = (input_data - ave_input) / std_input
print(input_data)

# 标签转化为独热编码
correct_data = np.zeros((n_data, 3))
for i in range(n_data):
    correct_data[i, correct[i]] = 1.0
print(correct_data)
# 切分训练集和测试集
index = np.arange(n_data)
index_train = index[index % 2 == 0]
index_test = index[index % 2 != 0]

input_train = input_data[index_train, :]
input_test = input_data[index_test, :]

correct_train = correct_data[index_train, :]
corre_test = correct_data[index_test, :]

n_train = input_train.shape[0]
n_test = input_test.shape[0]

# 设置参数
n_in = 4
n_mid = 10
n_out = 3

wb_width = 0.1
eta = 0.1
epoch = 100

batch_size = 8
interval = 100


# 实现网络层
class Baselayer:
    def __init__(self, n_upper, n):
        self.w = wb_width * np.random.randn(n_upper, n)
        self.b = wb_width * np.random.randn(n)

    def updata(self, eta):
        self.w = self.w - eta * self.grad_w
        self.b = self.b - eta * self.grad_b


class MiddleLayer(Baselayer):
    def forward(self, x):
        self.x = x
        self.u = np.dot(x, self.w) + self.b
        self.y = np.where(self.u <= 0, 0, self.u)  # relu函数

    def backward(self, grad_y):
        delta = grad_y * np.where(self.u <= 0, 0, 1.0)  # relu函数的求导--!!
        self.grad_w = np.dot(self.x.T, delta)
        self.grad_b = np.sum(delta, axis=0)
        self.grad_x = np.dot(delta, self.w.T)


class OutputLayer(Baselayer):
    def forward(self, x):
        self.x = x
        u = np.dot(x, self.w) + self.b
        self.y = np.exp(u) / np.sum(np.exp(u), axis=1, keepdims=True)  # SoftMax函数

    def backward(self, t):
        delta = self.y - t
        self.grad_w = np.dot(self.x.T, delta)
        self.grad_b = np.sum(delta, axis=0)
        self.grad_x = np.dot(delta, self.w.T)


# 实例化
middle_layer_1 = MiddleLayer(n_in, n_mid)
middle_layer_2 = MiddleLayer(n_mid, n_mid)
output_layer = OutputLayer(n_mid, n_out)


# 定义函数
def forward_propagation(x):
    middle_layer_1.forward(x)
    middle_layer_2.forward(middle_layer_1.y)
    output_layer.forward(middle_layer_2.y)


def back_propagation(t):
    output_layer.backward(t)
    middle_layer_2.backward(output_layer.grad_x)
    middle_layer_1.backward(middle_layer_2.grad_x)


def update_wb():
    middle_layer_1.updata(eta)
    middle_layer_2.updata(eta)
    output_layer.updata(eta)


def get_error(t, batch_size):
    return -np.sum(t * np.log(output_layer.y + 1e-7)) / batch_size


train_error_x = []
train_error_y = []
test_error_x = []
test_error_y = []

# 学习过程

n_batch = n_train // batch_size

for i in range(epoch):
    # 统计误差
    forward_propagation(input_train)
    error_train = get_error(correct_train, n_train)

    forward_propagation(input_test)
    error_test = get_error(corre_test, n_test)

    train_error_x.append(i)
    train_error_y.append(error_train)

    test_error_x.append(i)
    test_error_y.append(error_test)

    index_random = np.arange(n_train)

    np.random.shuffle(index_random)

    for j in range(n_batch):
        mb_index = index_random[j * batch_size:(j + 1) * batch_size]
        x = input_train[mb_index, :]
        t = correct_train[mb_index, :]

        forward_propagation(x)

        back_propagation(t)

        update_wb()

plt.plot(train_error_x, train_error_y, label="Train")

plt.plot(test_error_x, test_error_y, label="Test")

plt.legend()

plt.xlabel("epoch")

plt.ylabel("error")

plt.show()

The relationship between classification results and epoch under different gradient descent method update parameters is shown in Figure 22:

Figure 22 The relationship between the classification results and epoch under the update parameters of different gradient descent methods

    Analysis : It can be seen from Figure 22 that no matter what kind of gradient descent update parameters, as the epoch increases, the error between the training set and the test set will decrease and show a similar trend . But for stochastic gradient descent , the fluctuation and error are large . However, the adaptive gradient descent update the parameters of the BP neural network is relatively stable , and the fitting effect of the two data sets is good (good classification effect, small error).

7. Experimental summary

1. Summary of solving the least squares problem

(1) For this experiment, there are two methods for solving the least squares problem that are mainly introduced , namely, the least square method and the gradient descent method. The comparison of the two methods is briefly summarized as shown in Table 11:

Table 11 Comparison and summary of methods for solving least squares problems

method

principle

advantage

defect

least square method

Let ∇fx =0 , based on the normal equation to solve

The obtained solution is relatively accurate

1. Sensitive to abnormal points

2. The complexity of inversion is high

3. Not good for nonlinear data

Gradient Descent

Iterative, gradually approaching the exact solution

relatively high efficiency

1. The obtained solution is a local optimum , and may stagnate in a local optimum

2. When it is close to the minimum point, there is a sawtooth phenomenon, and the convergence speed is reduced

( 2) Different methods are used to solve the given optimization model in the experiment. In this experiment, the efficiency from high to low is gradient descent method > least squares method , and the accuracy of the solution from high to low is least squares method > gradient descent method . For the total in Table 11, the choice of the solution method of the actual problem should be determined according to the type of data and the requirements of the task.

(3) From the perspective of optimization, both the least squares method and the gradient descent method have certain problems, and there is still room for optimization in mathematical derivation. Therefore, there are many other optimization methods to solve the least squares problem , which are also worth learning.

(4) Different languages ​​and platforms have a certain impact on the efficiency of solving the least squares problem . Generally speaking, as the size of the matrix and vector increases, the efficiency of the same method under Matlab will be higher than that of Python . In the selection process, it is necessary to Combine data with personal familiarity.

(5) The least square method and gradient descent method have many applications , such as linear regression fitting data and neural network parameter update, etc., and have many connections in theory and practice.

2. References

1. Optimization method - Least Squares_Obviously easy to prove blog-CSDN blog_least_squares optimization 

2. Optimization method - QR decomposition_@李敬如的博客-CSDN Blog

3. Gradient descent method to solve a simple demo of BP neural network_Old cake explanation-BP neural network blog- 

​​​​​​4. Machine Learning - LR (Linear Regression), LRC (Linear Regression Classification) and Face Recognition

5. BP Neural Network Iris Classification Python Stochastic Gradient Descent Method Adagrad (Adaptive Gradient Descent Method) 

Guess you like

Origin blog.csdn.net/weixin_51426083/article/details/128425705