Linear regression gradient descent algorithm and Python-based underlying code implementation


Gradient descent is a commonly used optimization algorithm that minimizes a loss function through continuous iteration. According to different loss functions and iteration methods, gradient descent can be divided into batch gradient descent (Batch Gradient Descent, BGD), stochastic gradient descent (Stochastic Gradient Descent, SGD), mini-batch gradient descent (Mini-batch Gradient Descent), total Yoke gradient method (Conjugate Gradient, CG), etc.

1 Batch Gradient Descent (Batch Gradient Descent, BGD)

Batch Gradient Descent (BGD) is a very common gradient descent algorithm that updates model parameters by computing the gradients of all training samples in each iteration. For the specific algorithm, please refer to the previous blog post: The principle of linear regression gradient descent and the implementation of the underlying code based on Python

Advantages of GD include:

  1. Convergence to a globally optimal solution is guaranteed, especially in convex optimization problems.
  2. Matrix operations can be used to speed up the calculation process.
  3. For denser datasets, BGD is computationally faster.

Disadvantages of BGD include:

  1. All training samples need to be processed, and the calculation is large, so it is not suitable for processing large-scale data sets.
  2. May get stuck in local optima, especially in non-convex optimization problems.
  3. For relatively sparse datasets, BGD is computationally inefficient because most of the data is irrelevant.
  4. BGD is also computationally inefficient when dealing with real-time data streaming problems such as online learning or real-time learning, since the entire dataset needs to be processed for each iteration.

2 Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a commonly used gradient descent algorithm that updates model parameters by calculating the gradient of a training sample in each iteration .

Advantages of SGD include:

  1. It can handle large-scale, sparse or real-time data flow problems, because only one sample is processed at a time, and the calculation efficiency is high.
  2. You can jump out of the local optimal solution, because the direction of each update parameter is not necessarily the same.
  3. For non-convex optimization problems, SGD may perform better because of its ability to jump out of local optima.

Disadvantages of SGD include:

  1. Convergence to the global optimal solution may not be guaranteed because the update direction is random.
  2. SGD can be slow to compute when dealing with denser datasets because of the frequent reads required.
  3. Oscillation or jittering may occur, resulting in slower convergence.

The stochastic gradient descent algorithm is suitable for dealing with large-scale, sparse or real-time data flow problems, and can jump out of local optimal solutions. However, for problems that are small, dense, or require a guaranteed global optimal solution, SGD may not perform as well as batch gradient descent algorithms. At the same time, the convergence speed of SGD may be affected by oscillation or jitter, requiring some additional optimization or adjustment.

3 Mini-batch Gradient Descent (MGD)

Mini-batch Gradient Descent (MBGD) is a gradient descent algorithm between Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD). Compute gradients for a small fraction of training samples in one iteration to update model parameters.

Advantages of MBGD include:

  1. Matrix operations can be used to speed up the calculation process.
  2. For denser datasets, MBGD is computationally faster.
  3. For relatively sparse data sets, the computational efficiency of MBGD is relatively high.
  4. Convergence to a globally optimal solution is guaranteed, especially in convex optimization problems.
  5. You can jump out of the local optimal solution, because the direction of each update parameter is not necessarily the same.

Disadvantages of MBGD include:

  1. The mini-batch size needs to be set manually, which may affect the convergence speed and accuracy if not chosen properly.
  2. For large-scale, sparse, or real-time data flow problems, MBGD may not be as computationally efficient as SGD, but it is worse than BGD.

The mini-batch gradient descent algorithm is a compromise gradient descent algorithm that can balance computational efficiency and convergence speed to a certain extent, and is suitable for the training of most deep learning models. However, the mini-batch size needs to be chosen on a case-by-case basis to get the best results.

4 Conjugate Gradient (CG)

The conjugate gradient method (Conjugate Gradient, CG) is an iterative method for solving a special matrix structure, which can quickly converge to the global optimal solution. The CG method is an iterative algorithm. The direction of each update is different from the gradient direction, but it will be updated along the direction of the linear combination of the previous update direction and the current gradient direction. The iterative process of the CG algorithm can be described as the following steps:

  1. Randomly initialize the model parameters.
  2. Compute the gradient and use it as the initial search direction.
  3. Update model parameters along the search direction.
  4. Compute new gradients, and compute a new search direction that is conjugate to the previous search direction .
  5. Repeat steps 3-4 until a predetermined number of iterations or an error threshold is reached.

Advantages of the CG algorithm include:

  1. It can quickly converge to the global optimal solution, especially for symmetric and positive definite matrix structures.
  2. There is no need to store all historical gradients, which can save memory space.
  3. In the direction of updating model parameters, the CG method does not require line search, so parameters such as learning rate do not need to be set.

Disadvantages of the CG algorithm include:

  1. Only applicable to certain types of matrix structures, especially symmetric, positive definite matrix structures.
  2. For non-convex optimization problems, CG may not perform as well as other gradient descent algorithms.
  3. It may be limited when dealing with large-scale problems due to the additional memory required to store some temporary variables.

The conjugate gradient method is suitable for symmetric, positive definite matrix structures, can quickly converge to the global optimal solution, and does not require line search. However, for non-convex optimization problems and large-scale problems, the performance of CG may be somewhat limited.

5 Low-level code examples for different gradient methods

5.1 Constructing the dataset

Here we assume that the data set has only one variable x, and the relationship between x and y is y = 8 xy=8xy=8x . _ Next, 100 data will be constructed, and the value range of x is range(0, 10, 0.1).

EXAMPLE_NUM = 100  
BATCH_SIZE = 10    
TRAIN_STEP = 150  
LEARNING_RATE = 0.0001  
X_INPUT = np.arange(EXAMPLE_NUM) * 0.1  
Y_OUTPUT_CORRECT = 8 * X_INPUT + np.random.uniform(low=-10, high=10)

def train_func(X, K):  
    result = K * X  
    return result

Here EXAMPLE_NUM is the number of data; BATCH_SIZE is the number of data used each time for small batch gradient descent; TRAIN_STEP is the number of iterations; X_INPUT is the value range of x constructed; Y_OUTPUT_CORRECT is the corresponding real value of y, here according to the mapping relationship of xy , adding (-10, 10) noise to the data set. At the same time, train_func is constructed for the use of gradient descent to find the best k value later.

5.2 BGD regression

k_BGD = 0.0  
k_BGD_RECORD = [0]  
for step in range(TRAIN_STEP):  
    SUM_BGD = 0  
    for index in range(len(X_INPUT)):  
	    SUM_BGD += (train_func(X_INPUT[index], k_BGD) - Y_OUTPUT_CORRECT[index]) * X_INPUT[index]
    k_BGD -= LEARNING_RATE * SUM_BGD  
    k_BGD_RECORD.append(k_BGD)  

k_BGD is the given initial k value, and k_BGD_RECORD is used to record the change of K value during the gradient descent process. The first cycle is the cycle in 150 trainings, and the second cycle is to calculate the gradient of each data in turn and sum all the calculation results. Here SUM_BGD is the derivative of the loss function.

The loss function is: 1 2 m ∑ ( kx − ytrue ) 2 \frac{1}{2m} \sum_{}^{} (kx-y_{true})^22 m1(kxytrue)2
After deriving the loss function:1 m ∑ ( kx − ytrue ) × x \frac{1}{m} \sum_{}^{} (kx-y_{true})×xm1(kxytrue)×x

1 m \frac{1}{m} is not written in the codem1, because it is also a constant and can be integrated into the learning rate. (So ​​compared to stochastic gradient descent, the learning rate of batch gradient descent can be smaller, otherwise it will learn too fast)

5.3 SGD regression

k_SGD = 0.0  
k_SGD_RECORD = [0]  
for step in range(TRAIN_STEP*10):  
    index = np.random.randint(len(X_INPUT))  
    SUM_SGD = (train_func(X_INPUT[index], k_SGD) - Y_OUTPUT_CORRECT[index]) * X_INPUT[index]  
    k_SGD -= LEARNING_RATE * SUM_SGD  
    if step%10==0:  
        k_SGD_RECORD.append(k_SGD)

SGD has only one cycle, because SGD only uses one data at a time, and considering the training speed, its training cycle has been expanded by 10 times. After calculating the gradient of a piece of data, k will be updated. Since the update of k is slow, we only record the change value of k every 10 times.

5.4 MGD regression

k_MBGD = 0.0  
k_MBGD_RECORD = [0]  
for step in range(TRAIN_STEP):  
    SUM_MBGD = 0  
    index_start = np.random.randint(len(X_INPUT) - BATCH_SIZE)  
    for index in np.arange(index_start, index_start+BATCH_SIZE):  
        SUM_MBGD += (train_func(X_INPUT[index], k_MBGD) - Y_OUTPUT_CORRECT[index]) * X_INPUT[index]  
    k_MBGD -= LEARNING_RATE * SUM_MBGD  
    k_MBGD_RECORD.append(k_MBGD)

The codes of MGD and BGD are very similar, except that the data of each step is only BATCH_SIZE. It should be noted that when the starting point of the data is randomly selected, the range is from 0 to len(X_INPUT) - BATCH_SIZE, so as to prevent the data selection range from exceeding the data volume.

5.4 Comparative plots of different methods

plt.plot(np.arange(TRAIN_STEP+1), np.array(k_BGD_RECORD), label='BGD')  
plt.plot(np.arange(TRAIN_STEP+1), k_SGD_RECORD, label='SGD')  
plt.plot(np.arange(TRAIN_STEP+1), k_MBGD_RECORD, label='MBGD')  
plt.legend()  
plt.ylabel('K')  
plt.xlabel('step')  
plt.show()

image.png

It can be seen that the training effect of BGD is the fastest, because the amount of data of BGD is 10 times more than that of the other two. In general, the curve of BGD will be smoother, and the other two methods will occasionally deviate from the correct value. However, BGD cannot avoid local optimum. Since there is no local optimum in this function, the fitting methods of the three effects are not bad.

Guess you like

Origin blog.csdn.net/nkufang/article/details/129728761