Machine Learning - Gradient Descent Algorithm

4.3.3 Gradient Calculation

    Gradient descent is widely used in machine learning. Whether it is in linear regression or logistic regression, gradient means that the direction derivative of a certain function at that point takes the maximum value along this direction, that is, the derivative of the function in this direction. The main purpose of is to iteratively find the minimum of the objective function, or converge to the minimum.

This article will start with a downhill scene, first propose the basic idea of ​​the gradient descent algorithm, then explain the principle of the gradient descent algorithm mathematically, explain why gradients are used, and finally implement a simple example of the gradient descent algorithm.

    The basic idea of ​​the gradient descent method can be compared to a downhill process. Assume such a scenario: a person is trapped on a mountain and needs to come down from the mountain (find the lowest point of the mountain, which is the valley). But at this time, the dense fog on the mountain is very heavy, resulting in low visibility; therefore, the path down the mountain cannot be determined, and you must use the information around you to find the way down the mountain step by step. At this time, you can use the gradient descent algorithm to help you go down the mountain. How to do it, first use his current position as the benchmark, find the steepest place at this position, then take a step in the direction of descent, then continue to use the current position as the benchmark, find the steepest place, and walk again Until finally reaching the lowest point; the same is true for going up the mountain, but at this time it becomes a gradient ascent algorithm.

The basic process of gradient descent is very similar to the downhill scene.

First, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum value of this function, which is the bottom of the mountain. According to the previous scenario assumptions, the fastest way to go down the mountain is to find the steepest direction of the current position, and then walk down this direction, which corresponds to the function, which is to find the gradient of a given point, and then move in the direction opposite to the gradient. It can make the function value drop the fastest! Because the direction of the gradient is the fastest changing direction of the function (will be explained in detail later), we repeatedly use this method to obtain the gradient repeatedly, and finally reach the local minimum, which is similar to our downhill process. Finding the gradient determines the steepest direction, which is the means of measuring direction in the scene. So why is the direction of the gradient the steepest direction? Next, let's start with differentiation:

 

First, a question is given: Given a function f(x) as shown above, find the minimum value of f(x). According to our university mathematics knowledge, we know that the function can be derived first, then the stagnation point can be calculated, and finally the minimum value can be obtained by comparing the function value at the stagnation point. But obviously this is not the way we want to solve it. What we want to learn is machine learning, and our purpose is to let the computer get the result.

  Then, in our college mathematics study, in addition to the above-mentioned analytical method, there is also a numerical solution method, that is, the downhill method.

Step 1: Assign initial value x=x 0

Step 2: Randomly generate increment ∆x (incremental direction is also random)

Step 3: If f(x+∆x)<f(x), then x = x + ∆x

Step 4: Repeat steps 2 and 3 until convergence

For multivariate functions f(x,y) (binary as an example here), there is a similar downhill algorithm:

Step 1: Assign initial value x=x 0 , y=y 0

Step 2: Randomly generate increments ∆x, ∆y (incremental direction is also random)

Step 3: If f(x+∆x,y+∆y)<f(x,y), then x=x+∆x,y=y+∆y

Step 4: Repeat steps 2 and 3 until convergence

From the hill-climbing method to find the minimum value, we can find that there are some problems:

  1. The setting of the initial value x 0 has a great influence on the speed of convergence
  2. The direction of increment ∆ is randomly generated, which is less efficient. If the step size is too small, the calculation efficiency will be low, if it is too large, it will easily cross the extreme point
  3. It is easy to see the local optimum (in the case of multiple troughs in the function)
  4. The case of "plateau" and "ridge" cannot be handled.

                       "Plateau" situation

Points that can be improved in the mountain climbing method:

According to the characteristics of the function, determine the incremental direction and search step size of each iteration

Note that after the kth iteration, the independent variable takes the value x=xk, and f(x) is expanded at xk using the Taylor formula:

f(x)=f(xk)+f’(xk)(x-xk)+R(x)

After the (k+1)th round of iteration, we hope f(xk+1)<f(xk) (because it is seeking the minimum value), so

f(xk+1)-f(xk)=f’(xk)(xk+1-xk)<0

If f'(xk)>0, xk+1-xk<0 is required; if f'(xk)>0, xk+1-xk<0 is required.

According to this need, we can construct xk+1 that meets the requirements

xk+1-xk = -f'(xk)

根据f(xk+1)-f(xk)=f'(xk)(xk+1-xk)

Therefore, it can be verified by the following inequality that the iteration direction is always the correct direction, that is, the direction down the mountain

f(xk+1)-f(xk)=-[f'(xk)]2 <0 the value of the function decreases each time

In addition, if the value of |f'(xk)| is very large, it will cause xk+1 to deviate far from xk, so that Taylor's formula does not hold, so a small coefficient γ>0 is usually introduced to reduce the degree of offset

xk+1-xk = -γf'(xk)

So we get the iterative formula

xk+1= xk -γf'(xk)

For multivariate functions, using the Taylor formula of multivariate functions, there is a similar analysis process, and the iterative formula is

xk+1= xk -γg(xk)

where x=(x1...xn)T is an n-dimensional vector,

Become the gradient (vector) of the multivariate function f(x), γ>0 is called the step size or learning rate, so this optimization method is called the gradient descent method.

Gradient descent method compared to hill climbing method:

  1. Solve the problem of updating the direction of the independent variable in each round of iteration, and the calculation efficiency is higher
  2. The step size can be determined by one-dimensional column search, that is, by finding the minimum value of the unary function f(xk -γg(xk))-f(xk) (γ>0 is the only independent variable). In practical applications, generally rely on experience to set a smaller constant step size.
  3. The gradient descent method itself cannot solve the global optimal and local optimal problems brought about by the initial value setting.
  4. If it is to find the maximum value of the function, the derivation process is similar, but the sign in the iterative formula is changed, which is called the gradient ascending method xk+1= xk +γg(xk)

4.3.4 Gradient descent method to find the minimum value

    Use python to implement a simple gradient descent algorithm to find the minimum value of the loss function and make a regression line . The scenario is an example of a simple linear regression: Suppose now we have a series of points ( data set ) , as shown in the following figure:

We will use gradient descent to fit this straight line . First, we need to define a cost function, here we choose the mean square error cost function (also known as the square error cost function)

Annotate the formula as follows:

  (1) m is the number of data points in the data set, that is, the number of samples

(2) ½ is a constant, so that when calculating the gradient, the 2 multiplied by the quadratic power will be offset by ½ here. Naturally, there will be no redundant constant coefficients, which is convenient for subsequent calculations, and will not affect the results. influential

(3) y is the value of the real y coordinate of each point in the data set, that is, the class label

(4) h is our prediction function (hypothesis function), according to each input x, the predicted y value is obtained according to the calculation, namely 

The cost function and gradient are clarified, as well as the functional form of the prediction. We can start writing code. For the convenience of understanding, we assume that the cost function is a concave function y = 0.5 * (x-0.25)^2 with the opening upward. All the codes are as follows: 

import numpy as np

import matplotlib as mpl

import matplotlib.pyplot as plt

mpl.rcParams['font.family'] = 'sans-serif'

mpl.rcParams['font.sans-serif'] = 'SimHei'

mpl.rcParams['axes.unicode_minus'] = False

# Construct a one-dimensional loss function

def f1(x):

    return 0.5 * (x-0.25) ** 2 # Assume a loss function that opens upwards

def h1(x):

    return 0.5 * 2 * (x-0.25) # first derivative of the original function

# use gradient descent method to solve

GD_X = [] # Store the x value of each iteration coordinate

GD_Y = [] # Store the y value of each iteration coordinate

x = 4 # is equivalent to x0, which is the position where the iteration starts

alpha = 0.5 # step size or learning rate

f_change = f1(x) # The initial value is the y value at x0, which is used to judge the iteration condition of y after iteration. The change after iteration cannot be too small

f_current = f_change # The change of y is used as a recursive conditional judgment

GD_X.append(x) # Store the value of coordinate x into the list

GD_Y.append(f_current) # Store the value of coordinate y into the list

# number of iterations

iter_number = 0 # The initial value of iteration number is 0

while f_change > 1e-10 and iter_number < 100: # The condition for the iteration to continue, to avoid infinite iteration

    iter_number += 1

    x = x - alpha * h1(x) # x value after one iteration

    tmp = f1(x) # y value after one iteration

    # Judge the change of y value, not too small

    f_change = np.abs(f_current - tmp)

    f_current = tmp # Update the y value of the current coordinate

    GD_X.append(x) # Store the value of coordinate x into the list

    GD_Y.append(f_current) # Store the value of coordinate y into the list

print('The final result: (%.5f,%.5f)' % (x, f_current))

print('The number of iterations is: %d' % iter_number)

print(GD_X)

# build data

X = np.arange(-4, 4.5, 0.05)

Y = np.array(list(map(lambda t: f1(t), X)))

# drawing

plt.figure(facecolor='w')

plt.plot(X, Y, 'r-', linewidth=2)

plt.plot(GD_X, GD_Y, 'bo--', linewidth=2)

plt.title('Loss function $y=0.5*(x-0.25)^2$; \nLearning rate: %.3f; Final solution: (%.3f, %.3f); Number of iterations: %d'% (alpha, x, f_current, iter_number))

plt.show()

The result of the operation is as follows:

 

The code is quoted from "Mathematics of Vernacular Machine Learning"

Guess you like

Origin blog.csdn.net/xsh_roy/article/details/121454674