Gradient descent (rpm)

Layman - gradient descent method and its implementation

96 
Six Feet tent  
 37.4  2018.01.17 21:06  Word Count 3001  Reading 201,538 Reviews 129
  • The scenario assumes that the gradient descent
  • gradient
  • Mathematical explanation gradient descent algorithm
  • Examples of the gradient descent algorithm
  • Gradient descent algorithm implementation
  • Further reading

This article from the scene of a downhill start, first put forward the basic idea of ​​gradient descent algorithm, and then explain the principles of the gradient descent algorithm mathematically, the final realization of a simple example of gradient descent algorithm!

The scenario assumes that the gradient descent

The basic idea of ​​gradient descent method is analogous to a downhill course. Assume that such a scenario: a man trapped in the mountains, you need to come down the mountain (ie find the lowest point of the mountain, which is the valley). But this time a big hill fog, resulting in very low visibility. Thus, down the path can not be determined, he must use the information around them to find the path down the mountain. This time, he can make use of gradient descent algorithm to help them down the mountain. Specifically, that is, to his current position as a reference in which to find the steepest place in this position, then drop towards the local height of the mountain go the same way, if our goal is the mountain, which is the climb to the top , then the time should be going up toward the steepest direction. Then every walk some distance, have repeatedly using the same method, and finally be able to successfully reach the valley.


 
image.png

We also can assume the steepest mountain where the naked eye can not be observed immediately out, but rather a sophisticated tool to measure by, at the same time, this person just at this time have the ability to measure the steepest direction. So, every person walking some distance, take some time to measure the direction of the steepest location, which is more time-consuming. So in order to reach the bottom of the hill before the sun goes down, it is necessary to reduce as much as possible the number of measurement direction. This is a dilemma of choice, if the measured frequently, you can ensure downhill direction is absolutely right, but very time-consuming, too little if measured, there is risk of deviation of the track. So it is necessary to find a suitable frequency measurement direction, down to ensure that the direction is not wrong, and, without consuming too much!

Gradient descent

The basic process of gradient descent down the mountain and on the scene very similar.


First of all, we have a differential function. This function represents the mountain. Our goal is to find the minimum of this function, which is the base of the mountain. The scenario assumes that before, the fastest way is to find the most down a steep current position and direction, and then go down along this direction corresponds to the function in a given point is to find a gradient  , and then toward the opposite direction of the gradient, you can make the fastest decline in the value of the function! Because the direction of the gradient is a function of changes in the fastest direction (will explain later)
so we reuse this method, repeatedly strike a gradient, and finally able to reach a local minimum, which is similar to the process we are down. The strike will determine the steepest gradient direction, which is the scene in the direction of the means of measurement. Why then the direction of the gradient is steepest direction? Next, we start with the differential

differential

Viewing the Significance of differential, you can have a different point of view, the two most common are:

  • Image function, the slope of the tangent at a point
  • The rate of change of a function
    a few examples of differentiation:


     
    image.png

The above examples are single variable differential, when a function of a plurality of variables, there is a multi-variable differential, i.e. separately for each variable differentiating


 
image.png

gradient

Multivariable generalized gradient is actually a derivative.
The following example:


 
image.png

We can see that the gradient is separately for each variable differential, then separated by a comma, the gradient is <> include them, in fact, a gradient vector instructions.

Calculus gradient is a very important concept, previously mentioned meaning gradient

  • In univariate function, the gradient is actually a differential function, representing the slope of the tangent function at a given point of
  • In multivariable function, the gradient is a vector, the vector has direction, the direction of the gradient pointed to the rise of fixed-point function in the fastest direction

This also explains why we need to do everything possible to strike a gradient! We need to reach the bottom of the hill, you need at every step of the observed at this time the steepest places, the gradient will happen to tell us in this direction. Direction of the gradient is a function of a given point in the direction of the fastest rising, then the gradient in the opposite direction is a function of the direction of a given point in the fastest decline, which is what we need. So we just have to go along the direction of the gradient, you can come to the lowest point of the local!


 
image.png

Mathematical explanation gradient descent algorithm

Above we took the basic idea of ​​the scene and a large number of chapters on gradient descent algorithm assumptions, concepts and ideas as well as gradients. Here we begin to explain the calculation process and the idea of ​​gradient descent algorithm mathematically!


 
image.png

The meaning of this formula is: J is a function of the Θ, the location of which is our current Θ0 point, from this point came J minimum point, which is the base of the mountain. First we determine the way forward, that is, the gradient is reversed, and then walk some distance in steps, that is, α, finish this segment step, it reached Θ1 this point!


 
image.png

Here are a few common questions about this formula:

  • α What is the meaning?
    α is called the gradient descent algorithm as a learning rate or step size, it means that we can control every step of walking distance by α, to ensure that not too much pace across pulled eggs, ha ha, in fact, do not go too fast, He missed the lowest point. But also to ensure not to go too slow, causing the sun goes down, and has not come to the mountain. So α the choice is often very important in the gradient descent in! α not too big nor too small, too small, it may lead to delays able to get to the lowest point, too, it can lead to missed lowest point!
 
image.png
  • Why gradient is multiplied by a negative number?
    Former gradient plus a minus sign, it means moving in the opposite direction of the gradient! We first mentioned, the actual direction of the gradient function at this point is the fastest growing direction! And we need the fastest decline towards the direction to go, so naturally it is the negative direction of the gradient, so here need to add negative sign

Examples of the gradient descent algorithm

We have a basic understanding of the calculation process gradient descent algorithm, then we'll look at a few examples of small gradient descent algorithm, starting with the function of a single variable

Gradient of one variable decreases

We assume that there is a single variable functions


 
image.png

Differential function


 
image.png

Initialization, starting at
 
image.png

Learning rate


 
image.png

The gradient descent calculation formula
 
image.png

We begin the process of gradient descent iteration:
 
image.png

As shown, after four operations, which is four steps away, basically reached the lowest point of the function, which is the base of the mountain
 
image.png

Gradient multivariable function decreases

We assume that there is an objective function


 
image.png

Now the minimum value of the function to be calculated by a gradient descent method. We can find by looking at the minimum is actually (0,0) point. But then, we will drop from the beginning of a step by step gradient algorithm to calculate the minimum!
We assume that the initial starting point for:


 
image.png

The initial learning rate:
 
image.png

Gradient function is:


 
image.png

Multiple iterations:
 
image.png

We found that have basically close to the minimum point of the function
 
image.png

Gradient descent algorithm implementation

Below we will implement a simple gradient descent algorithm with a python. The scene is a simple linear regression example: Suppose now that we have a series of points, as shown in FIG.

 
image.png

 

We will use the gradient descent method to fit out this line!

First, we need to define a cost function, here we use the mean squared error cost function

 
image.png

 

In this publicity

  • m is the number of data set points
  • ½ is a constant, so that the time required for gradient and down on the square by ½ offset here, naturally, no extra constant coefficient, to facilitate subsequent calculations, the results will not be affected while
  • y is a real data set y coordinate value of each point
  • h is our prediction function in accordance with each input x, the predicted value of y calculated according to [Theta], i.e.,


     
    image.png

We can see that the cost function, the cost function of two variables, so the gradient is a multi-variable drop problems solved gradient of the cost function, that is, respectively, two differentiating variables


 
image.png

It defined the cost function and gradient, and the functional form predicted. We can start writing code. But before that, it is necessary to point out, is to facilitate the preparation of the code, we will all formulas are converted to the form of a matrix, python calculated matrix is ​​very convenient, but the code will become very simple.

In order to calculate the conversion matrix, we observed in the form of a prediction function


 
image.png

We have two variables, in order for this formula of the matrix, we can give every point increase in one-dimensional x, the value of this dimension is fixed at 1, this dimension will be multiplied to Θ0. This will facilitate our unity matrix of computing


 
image.png

We then converted to a gradient of the cost function and matrix-vector multiplication of the form


 
image.png

coding time

First of all, we need to define data sets and learning rate

import numpy as np

# Size of the points dataset.
m = 20

# Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 21 ]).reshape(m, 1) # The Learning Rate alpha. alpha = 0.01 

Next we define the cost function and the cost function in the form of a matrix of gradient vectors

def error_function(theta, X, y):
    '''Error function J definition.''' diff = np.dot(X, theta) - y return (1./2*m) * np.dot(np.transpose(diff), diff) def gradient_function(theta, X, y): '''Gradient of the function J definition.''' diff = np.dot(X, theta) - y return (1./m) * np.dot(np.transpose(X), diff) 

The last part is the core of the algorithm, iterative gradient descent

def gradient_descent(X, y, alpha):
    '''Perform gradient descent.''' theta = np.array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, y) while not np.all(np.absolute(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, y) return theta 

When the gradient is less than 1e-5, explained that it had entered a relatively smooth state, similar to the state of the valley, this time to continue the iterative effect is not large, so this time you can exit the loop!

The complete code is as follows

import numpy as np

# Size of the points dataset.
m = 20

# Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 21 ]).reshape(m, 1) # The Learning Rate alpha. alpha = 0.01 def error_function(theta, X, y): '''Error function J definition.''' diff = np.dot(X, theta) - y return (1./2*m) * np.dot(np.transpose(diff), diff) def gradient_function(theta, X, y): '''Gradient of the function J definition.''' diff = np.dot(X, theta) - y return (1./m) * np.dot(np.transpose(X), diff) def gradient_descent(X, y, alpha): '''Perform gradient descent.''' theta = np.array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, y) while not np.all(np.absolute(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, y) return theta optimal = gradient_descent(X, y, alpha) print('optimal:', optimal) print('error function:', error_function(optimal, X, y)[0,0]) 

Results of running the code, calculated as follows


 
image.png

The fitted straight line is as follows


 
image.png

summary

So far, we have finished basic introduction to the basic idea of gradient descent algorithm and process, and a case with a python to achieve a simple gradient descent algorithm to fit a straight line!
Finally, we return to the scene at the beginning of the article proposed hypothesis:
this down people actually represents the back-propagation algorithm , down the path actually represents the algorithm had been looking for parameters Θ, the steepest mountain current point direction is actually the cost function at this point in the direction of the gradient, the scene observation tool used by the steepest direction is differential  . Time before the next observation is that we have the learning rate α algorithm defined.
We can see scenes assumptions and gradient descent algorithm corresponding well done!

Further reading

Guess you like

Origin www.cnblogs.com/rswss/p/11447017.html