Derivation of gradient algorithm (machine learning must read 02)

Key points:

1 The core of gradient descent is to find the derivative of a certain point, and then calculate the next value according to the step * derivative set by the learning rate, and then calculate the derivative of the next point.

1 Unconstrained Optimization Problem

Unconstrained optimization problem (unconstrained optimization problem) refers to selecting the optimal solution according to a certain indicator from all possible alternatives for a problem s solution. Mathematically speaking, optimization is the study of the minimization or maximization problem of functional $J(\theta)$ on a given set S : In a broad sense, optimization includes mathematical programming, graphs and networks, combinatorial optimization, inventory theory, decision theory, queuing theory, optimal control, etc. In a narrow sense, optimization refers only to mathematical programming.

1.1 Gradient descent

Gradient Descent(Gradient Descent) is an algorithm, but it is not a specific algorithm for regression tasks like multiple linear regression, but a very general optimization algorithm. Helps some machine learning algorithms (all unconstrained optimization problems) to find the optimal solution. The so-called universal means that many machine learning algorithms use gradient descent, and even deep learning also uses it. Find the optimal solution. The purpose of all optimization algorithms is to solve the model parameters θ as quickly as possible. The gradient descent method is a classic and commonly used optimization algorithm.

  The reason why θ previously solved using normal equations is the optimal solution is that the MSE loss function is a convex function. However, the loss functions of machine learning are not all convex functions. Setting the derivative to 0 will result in many extreme values, and the unique solution cannot be determined.

Before we set the derivative to 0, and in turn find out what the lowest point θ is, and the gradient descent method is to approach the optimal solution bit by bit!

1.2 Gradient descent formula

The core of gradient descent is to find the reciprocal of a certain point, and then calculate the next value based on the step * derivative set by the learning rate, and then calculate the reciprocal of the next point.

1.3 Learning rate

According to the gradient descent formula we talked about above, we know that$\and$  is the learning rate. If you set a large learning rate$w_j$, the adjustment will be larger each time. , set a small learning rate $w_j$ and the range of each adjustment will be small. However, if the steps are too big, there will be problems. As the saying goes, if the steps are too big, it will be easy to lose your temper! If the learning rate is large, it may cross the curve all at once and go to the other side (jump from the left half of the curve to the right half), and then continue the gradient descent and step back again, causing the curve to oscillate back and forth. If the steps are too small, it will be like a snail moving forward step by step, which will also increase the overall number of iterations.

Setting the learning rate is a science. Generally, we will set it to a relatively small positive integer, 0.1, 0.01, 0.001, 0.0001, which are all common Set the value (and then adjust it as appropriate). Under normal circumstances, the learning rate remains unchanged during the entire iteration process, but it can also be set so that the learning rate gradually becomes smaller as the number of iterations increases, because the closer we are to the valley, we can take smaller steps and reach the lowest point more accurately. , while preventing people from walking by. There are also some deep learning optimization algorithms that control and adjust the value of the learning rate by themselves. We will talk about these strategies one by one in the code explanation later in the learning process.

1.4 Randomly pick the initial value

  • If is initialized randomly, the algorithm starts from the left side of , then it willconverge to a local minimum instead of the global minimum;

  • If initialized randomly, the algorithm starts from the right side of , then will take a long time to cross Plateau (function stagnation zone, very small gradient), if it stops too early, the global minimum will never be reached;

  And the linear regression  model MSE loss function  happens to be a convex function. The convex function ensures that there is only one global minimum, followed by a continuous function , the slope will not change steeply, so even if you walk randomly, gradient descent can approach the global minimum.

1.5 Gradient Descent Steps

The gradient descent process is the process of "guessing" the correct answer:

  • 1. "Blind", Random random number generation $\theta$, randomly generate a set of values< /span> to be 1. Normally distributed data.  to be 0 and variance$w_0, w_1.......w_n$, expect $\mu$$\sigma$

  • 2. Find the gradient g. The gradient represents the slope of the tangent line at a certain point of the curve. Going down the tangent line is equivalent to descending in the steepest direction of the slope

  • 3. if g < 0, $\theta$  becomes larger , if g > 0,  $\theta$ becomes larger Small. # Calculate the next value based on the slope

  • 4. Determine whether it is converged.. If convergence, jump out of the iteration. If convergence is not achieved, go back to step 2 and perform steps 2~4 again. The criterion for judging convergence is: as the iteration proceeds, the loss function Loss changes very slightly or even no longer changes, that is, it is considered to have converged. # Determine whether the slope is large or not. reach the target value.

2. Code to simulate gradient descent

2.1 Define the solution function

import numpy as np
import matplotlib.pyplot as plt

# 求解函数
f = lambda x: (x - 3.5) ** 2 - 4.5 * x + 10 
# 导函数
d = lambda x: 2 * (x - 3.5) - 4.5
# 定义学习率,步幅
step = 0.1

 2.2 Randomly pick the initial value

# 随机取初始值
last_x = np.random.randint(0,12,size = 1)[0]
# 求变化值
dif = d(last_x)
# 记录上一步的值,首先让两个值有差异
x = last_x - dif*step
# 上一个值:11,斜率:10.5,实时值x:9.95
print(f'上一个值:{last_x},斜率:{dif},实时值x:{x}')   

2.3 Iterative loop

# 精确率,真实计算,都是有误差,自己定义 # 判断是否收敛
precision = 1e-4
x_ = [x]
print(f'开始迭代计算,最终误差:{precision}')
num = 0   # 记录迭代次数
while True:
    num += 1
    # 退出条件,精确度,满足了
    if np.abs(x - last_x) < precision:
        break   
    # 更新
    last_x = x
    x -= step*d(x) # 更新,减法:最小值
    x_.append(x)
    print(f'------>迭代次数:{num:2},斜率:{d(last_x):.5f},差值:
          {(x-last_x):.5f},上一个值:{last_x:.5f},重算后值:{x:.5f}')

2.4 Visualized calculation process

# 数据可视化
plt.rcParams['font.family'] = 'Kaiti'
plt.figure(figsize=(9,6))
x = np.linspace(5.75 - 5, 5.75 + 5, 100)
y = f(x)
plt.plot(x,y,color = 'green')
plt.title('梯度下降',size = 24,pad = 15)
x_ = np.array(x_)
y_ = f(x_)
plt.scatter(x_, y_,color = 'red')

Notice:

  1. Gradient descent has certain errors and is not a perfect solution~

  2. Within the allowable error range, the machine learning model obtained by gradient descent is worthy of use!

  3. The step size of gradient descent cannot be too big. As the saying goes, the step size cannot be too big!

  4. Accuracy, can be adjusted according to actual situation

  5. In the while True loop, gradient descent continues

  6. The exit condition of the while loop is: the absolute value of the difference between x after updating and the previous time is less than a specific accuracy!

Guess you like

Origin blog.csdn.net/March_A/article/details/134102228