Gradient descent formula derivation

gradient descent

@(machine learning)

1. Gradient

 In calculus, the partial derivative of ∂ is obtained for the parameters of the multivariate function, and the obtained partial derivative of each parameter is expressed in the form of a vector, which is the gradient. For example, the function f(x,y) takes partial derivatives of $x$ and $y$ respectively, and the obtained gradient vector is $(\frac{∂f}{∂x}, \frac{∂f}{∂y })^T$ or $\nabla f(x_0, y_0)$, if it is a vector gradient of 3 parameters, it is $(\frac{∂f}{∂x}, \frac{∂f}{∂y} , \frac{∂f}{∂z})^T$, and so on.

 So what's the point of finding this gradient vector? Geometrically speaking, his meaning is the place where the function changes the fastest (here, it should be understood as a three-dimensional solid, not a two-dimensional figure). Specifically, for the function $f(x, y)$ at the point $(x_0, y_0)$, the direction along the gradient vector is $(\frac{∂f}{∂x_0}, \frac{∂f} The direction of {∂y_0})^T$ is where $f(x, y)$ increases the fastest. In other words, along the direction of the gradient, it is easier to find the maximum value of the function, and conversely, along the opposite direction of the gradient, that is, $(-\frac{∂f}{∂x_0}, \frac{∂f} In the direction of {∂y_0})^T$, the gradient decreases the fastest, that is, it is easier to find the minimum value of the function.

2. Gradient descent and gradient ascent

 In the machine learning algorithm, when minimizing the loss function, it can be solved step by step iteratively through gradient descent to obtain the minimized loss function and model parameter values. Conversely, if we need to solve the maximum value of the loss function, then we need to solve it by gradient ascent.

 Gradient descent and gradient ascent are interchangeable. For example, we need to solve the minimum value of the loss function $f(\theta)$, then we need to use the gradient descent method to iteratively solve. But in fact, we can find the maximum value of the loss function $-f(\theta)$ in reverse, and then the gradient ascent method comes in handy.

 The gradient descent method is described in detail below.

3. Detailed explanation of gradient descent algorithm

3.1 Intuitive explanation of gradient descent

 Let's first look at an intuitive explanation of gradient descent. For example, when we are somewhere on a big mountain, because we don't know how to go down the mountain, we decide to take one step at a time, that is, when we reach a position at each step, solve the gradient of the current position, and along the direction of the gradient, also It is to take a step down from the current steepest position, and then continue to solve the gradient of the current position, and take a step to the position of this step along the steepest and easiest downhill position. Going on like this step by step, until we feel that we have reached the foot of the mountain. Of course, if we go on like this, it is possible that we will not reach the foot of the mountain, but will reach the lower part of a certain part of the mountain.

 It can be seen from the above explanation that gradient descent does not necessarily find the global optimal solution, but may be a local optimal solution. Of course, if the loss function is a convex function, the gradient descent method must be the global optimal solution.
Alt text

3.2 Related concepts of gradient descent

 Before getting into the details of the gradient descent algorithm, let's take a look at some related concepts.

  • Step size (Learning rate): The step size determines the length of each step in the negative direction of the gradient during the gradient descent iteration. Using the example of going downhill above, the step length is the length of the step taken along the steepest and easiest downhill position at the current step position.
  • Feature: refers to the input part of the sample, such as 2 single-feature samples $(x^{(0)}, y^{(0)})$, $(x^{(1)} , y^{(1)})$, the first sample feature is $x^{(0)}$, and the output of the first sample is $y^{(0)}$.
  • Hypothesis function (hypothesis function): In supervised learning, in order to fit the input sample, the hypothesis function used is extremely $h_{\theta}(x)$. For example, for m samples of a single feature $(x^{(i)}, y^{(i)}) (i=1,2,3,...m)$, the fitting function that can be used is as follows: $h_{\theta}(x)=\theta _0 + \theta _1x$
  • Loss function (loss function): In order to evaluate the quality of the model, the loss function is usually used to measure the degree of fit. The minimization of the loss function means that the fitting degree is the best, and the corresponding model parameters are the optimal parameters. In linear regression, the loss function is usually the square of the difference between the sample output and the hypothesis function. For example, for m samples $(x^{(i)}, y^{(i)}) (i=1,2,3,...m)$, using linear regression, the loss function is: $$J (\theta_0, \theta_1)=\sum_{i=1}^m(h_\theta(x_i)-y_i)^2$$ where $x_i$ represents the $i$ sample feature, $y_i$ represents the $ The output corresponding to i$ samples, $h_\theta(x_i)$ is the hypothesis function.

3.3 Detailed Algorithm of Gradient Descent

 The algorithm of gradient descent can be expressed in algebraic method and matrix method (also called vector method). If you are not familiar with matrix analysis, algebraic method is easier to understand. However, the matrix method is more concise, and because of the use of a matrix, the implementation logic is more clear at a glance. Here, the algebraic method is introduced first, and then the sentence matrix method is introduced.

3.3.1 Algebraic description of gradient descent

  • Prerequisites: Confirm the hypothesis function and loss function of the optimized model.
    • For example, for linear regression, suppose the function is expressed as $h_\theta(x_1, x_2, ...x_n) = \theta_0+\theta_1x_1+...+\theta_nx_n$, where $\theta_i(i=0,1,2,. ..n)$ is the model parameter, $x_i(i=0,1,2,...n)$ is the $n$ eigenvalues ​​of each sample. In this way, the hypothesis function can be expressed as: $$h_\theta(x_0, x_1,...x_n)=\sum_{i=0}^n\theta_ix_i$$ So the loss function can be expressed as: $$J(\theta_0 , \theta_1,...\theta_n)=\frac1{2m}\sum_{j=0}^m(h_\theta(x_0^{(j)}+x_1^{(j)}+x_2^{( j)},...x_n^{(j)})-y_j)^2$$
  • Initialization of algorithm-related parameters: mainly initialize $\theta_0, \theta_1,...\theta_n$, algorithm termination distance $\epsilon$ and step size $\alpha$. In the absence of any prior knowledge, all $\theta$ are generally initialized to 0, the step size is initialized to 1, and the optimization is performed during tuning.
  • Algorithmic process:
  1. Determine the gradient of the loss function at the current position. For $\theta_i$, its gradient expression is as follows: $$\frac ∂{∂ \theta_i}J(\theta_0,\theta_1,\theta_2,...\theta_n)$$
  2. Multiply the gradient of the loss function by the step size to get the descent distance from the current position: $$\alpha \frac ∂{∂ \theta_i}J(\theta_0,\theta_1,\theta_2,...\theta_n)$$
  3. Determine if all $\theta_i$, the gradient descent distance is less than $\epsilon$, if it is less than $\epsilon$, the algorithm terminates, and the current $\theta^T$ is the final result.
  4. Update all $\theta$, the specific algorithm is as follows: $$\theta_j=\theta_j-\alpha \frac∂{∂ \theta}J(\theta_0,\theta_1,\theta_0,...\theta_n)$$
    • Let's start calculating $\alpha \frac∂{∂ \theta}J(\theta_0,\theta_1,\theta_0,...\theta_n)$:
      $$J(\theta_0,\theta_1,\theta_0,...\ theta_n)=\frac1{2m}\sum_{j=0}^m(h_\theta(x_0^{(j)}+x_1^{(j)}+x_2^{(j)},...x_n ^{(j)})-y_j)^2$$
      $$\alpha \frac ∂{∂ \theta_i}J(\theta_0,\theta_1,\theta_2,...\theta_n)=\frac1{m}\ sum_{j=0}^m(h_\theta(x_0^{(j)},x_1^{(j)},...x_n^{(j)})-y_i)x_i^{(j)} $$
      So the iteration formula is as follows: $$\theta_i=\theta_i-\alpha\frac1{m}\sum_{j=0}^m(h_\theta(x_0^{(j)},x_1^{(j) },...x_n^{(j)})-y_i)x_i^{(j)}$$
      From this example, we can see that the gradient direction of the current point is jointly determined by all samples, which will be discussed in the fourth section below. The main difference between the variants of the gradient descent method is the method of taking samples. In the current section, we use all samples.

3.3.2 Matrix description of gradient descent method

  1. Prerequisites: Similar to 3.3.1, the hypothesis function and loss function of the optimization model need to be confirmed. For linear regression, suppose the matrix expression of the function $h_\theta(x_0, x_1, x_2, ...x_n) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$ is: $$h_\theta (x)=X\theta$$ where $h_\theta (X)$ is a vector of $m×1$, and $\theta$ represents a vector of $(n+1)×1$. Where $m$ represents the number of samples, and $n+1$ represents the number of features.
    The loss function expression is:
    $$J(\theta)=\frac{1}{2}(X\theta-Y)^T(X\theta-Y)$$
    where Y is the output vector of the sample, and the dimension is $m×1$.

  2. Algorithm-related parameter initialization: The $\theta$ vector can be initialized to the default value or the tuned value. The algorithm termination distance $\epsilon$, step size $\alpha$ and 3.3.1 ratio did not change.
  3. Algorithmic process:
    1. Determine the loss function of the current position. For the $\theta$ vector, its gradient expression is as follows: $$\frac{∂}{∂\theta}J(\theta)$$
    2. Multiply the gradient of the loss function by the step size to get the distance that the current position falls, that is, $\alpha\frac{∂}{∂\theta}J(\theta)$.
    3. Determine each value in the $\theta$ vector, and the gradient descent distance is less than $\epsilon$, the algorithm will terminate, and the current $\theta$ vector is the final result. Otherwise, go to step 4.
    4. Update the $\theta$ vector, and its update expression is as follows. After the update, go to step 1. $$\theta=\theta-\alpha\frac{∂}{∂\theta}J(\theta)$$

 Use the example of linear regression to describe the calculation process. Before the description, let's first look at the two formulas for matrix derivation $\$:
$$\frac{∂}{∂X}XX^T=\frac{∂}{∂X }X^TX=2X$$
$$\frac{∂}{∂\theta}X\theta=X^T$$
 Now we look at the derivation of the partial derivative of our loss function:
$$\frac ∂{∂x }J(\theta)=\frac{∂J(\theta)}{∂(X\theta-Y)}\frac{∂(X\theta-Y)}{∂\theta}=(X\theta- Y)X^T$$
 So the final iteration formula is: $$\theta=\theta-\alpha(X\theta-Y)X^T$$

3.4 Algorithm Tuning for Gradient Descent

 When using gradient descent, you need to tune, and where do you need to tune?

  1. Algorithm step size selection. In the previous algorithm description, I extracted a step size of 1, but the actual value depends on the data sample. You can take some more values, from large to small, and perform the algorithm separately to see the iterative effect (the final AUC, Or Loss), if the loss function is getting smaller, the value is valid, otherwise the step size should be increased. As mentioned earlier, if the step size is too large, the iteration will be too fast, and it may even miss the optimal solution. If the step size is too small, the iteration speed will be too slow, and the algorithm will not end for a long time. Therefore, the step size of the algorithm needs to be tried many times to get a better value.
  2. The initial value selection of the algorithm parameters, the initial value is different, the obtained minimum value may also be different, so the gradient descent only obtains the local minimum value; of course, if the loss function is a convex function, it must have the optimal solution. Due to the risk of local optimal solutions, it is necessary to run the algorithm multiple times with different initial values, remember the value with the smallest loss function, and choose the initial value that minimizes the loss function.
  3. Normalization, because the value range of different features of the sample is different, the iteration speed may be very slow. In order to reduce the influence of the feature value, the feature data can be normalized, that is, for each feature $x$, find Its expectation $\overline{x}$ and standard deviation $std(x)$ are then transformed into: $$\frac{x-\overline{x}}{std(x)}$$
    new expectation of such features is 0, the new variance is 1, and the number of iterations can be greatly accelerated.

4. The big family of gradient descent methods (BGD, SGD, MBGD)

4.1 Batch Gradient Descent

 The batch gradient descent method is the most commonly used form of the gradient descent method. The specific method is to use all the samples to update the parameters. This method corresponds to the gradient descent method of the linear regression in the previous 3.3.1, that is, 3.3 The gradient descent method of .1 is the batch gradient descent method.
$$\theta_i=\theta_i-\alpha\frac1{m}\sum_{j=0}^m(h_\theta(x_0^{(j)},x_1^{(j)},...x_n^ {(j)})-y_i)x_i^{(j)}$$
Since we have m samples, all samples are used to train the data when calculating the gradient.

4.2 Stochastic Gradient Descent

 The stochastic gradient descent method is actually similar to the batch gradient descent method. The difference is that the data of all m samples is not used when calculating the gradient, but only one sample j is selected to calculate the gradient. The corresponding update formula:
$$\theta_i=\ theta_i-\alpha(h_\theta(x_0^{(j)},x_1^{(j)},...x_n^{(j)})-y_i)x_i^{(j)}$$

 Stochastic gradient descent, and batch gradient descent in 4.1 are two extremes, one uses all data for gradient descent, and the other uses one sample for gradient descent. Naturally, the advantages and disadvantages of each are very prominent. For the training speed, the stochastic gradient descent method uses only one sample at a time to iterate, so the training speed is very fast, while the batch gradient descent method cannot satisfy the training speed when the sample size is large. For accuracy, stochastic gradient descent only uses one sample to determine the direction of the gradient, resulting in a solution that is likely to be suboptimal. For the convergence speed, since the stochastic gradient descent iterates one sample at a time, the iteration direction changes greatly, and the local optimal solution cannot be quickly converged.
 So, is there a reusable way to combine the advantages and disadvantages of both algorithms? This is the mini-batch gradient descent method.

4.3 Mini-batch Gradient Descent

 The mini-batch gradient descent method is a compromise between batch gradient descent and stochastic gradient descent, that is, for m samples, we use x patterns to iterate, $1 \leq q \leq m$, generally can take $x=10$, Of course, according to the number of samples, the value of $x$ can be adjusted. The corresponding update formula is:
$$\theta_i=\theta_i-\alpha\frac1{q}\sum_{j=t}^{t+q-1} (h_\theta(x_0^{(j)},x_1^{(j)},...x_n^{(j)})-y_i)x_i^{(j)}$$

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325288401&siteId=291194637