Machine learning notes (3) gradient descent method Gradient-Descent and data normalization

1. Gradient-Descent method Gradient-Descent

The gradient descent method is often used to optimize the loss function JJ in machine learningJ , is an optimization method based on search optimization. Its opposite is gradient ascent, which is used to maximize a utility function.

"The concept of gradient" : The gradient is the partial derivative of the function on each of its independent variables ( ∂ J ∂ θ \frac{\partial{J}}{\partial\theta}θJ), a vector Δ J = ( ∂ J ∂ θ 0 , ∂ J ∂ θ 1 , ⋯ , ∂ J ∂ θ n ) \Delta J=(\frac{\partial J}{\partial\ theta_0}, \frac{\partial J}{\partial\theta_1},\cdots,\frac{\partial J}{\partial\theta_n})ΔJ=(θ0J,θ1J,,θnJ) .
In the equation of a line, the gradient is a derivative (d J d θ \frac{dJ}{d\theta}d idJ), and the derivative represents the slope. Whereas, in the equation of a curve, the derivative represents the slope of the tangent.
insert image description here
The gradient represents θ \thetaWhen the θ unit changes, JJJ changes accordingly. The gradient can represent the direction, for which direction J increases. There is a blue dot (gradient > 0) in the above picture, now use the "positive and negative to indicate the direction"
11
of the gradient value, if the value of the derivative is positive, then it representsθ \thetaThe positive direction of the θ axis, if the value of the gradient is negative, then it representsθ \thetaThe negative direction of the θ axis. Then you will find that after knowing this direction, you also know thatθ \thetaIn which direction does θ change JJThe value of J will increase. If you wantJJThe value of J decreases, then letθ \thetaJust change theta in the opposite direction of what the gradient tells us. Use the "red" arrows to represent the direction the gradient is telling us, and the "blue" arrows to represent the opposite of the direction the gradient is pointing. The goal of the "gradient descent method"is to search for a position that can make the value of the function as small as possible, so it should be letθ \thetaθ goes in the direction of the blue arrow. So how to accomplish this operation? The formula is as follows:θ = θ − η ∇ J ∇ J = ( ∂ J ∂ θ 0 , ∂ J ∂ θ 1 , ⋯ , ∂ J ∂ θ n ) \theta=\theta-\eta\nabla J\qquad \nabla J =(\frac{\partial J}{\partial\theta_0}, \frac{\partial J}{\partial\theta_1},\cdots,\frac{\partial J}{\partial\theta_n})i=iηJJ=(θ0J,θ1J,,θnJ) , where:

  • the \etaη is called the learning rate (learning rate)
  • the \etaThe value of η affects the speed of obtaining the optimal solution
  • the \etaThe value of η is inappropriate, and even the optimal solution cannot be obtained
  • the \etaη is a hyperparameter of the gradient descent method

1.1, batch gradient descent method Batch Gradient Descent

When calculating the gradient, use all the data to calculate, but take the average (if it is not averaged, it is meaningless), the advantage is that the number of convergences is small, and the disadvantage is that each iteration needs to use all the data, which takes up a lot of memory and takes a lot of time.

1.2. Stochastic Gradient Descent

Only one sample is used to find the gradient each time. The advantage is that it is fast, but the disadvantage is that it may fall into a local optimum, and the search is relatively blind. It does not always move towards the optimal direction (because a single sample may have more noise)

1.3. Mini-Batch Gradient Descent

SGD and BGD are two extremes, and MBGD is a compromise between the two methods, each time selecting a batch of data to find the gradient.
This method is also easy to fall into local optimum.
Now SGD generally refers to MBGD

2. Data normalization

The purpose of normalization is to limit the preprocessed data to a certain range (such as [0,1] or [-1,1]), so as to eliminate the adverse effects caused by singular sample data and make the model easier to converge.

2.1, the most normalized normalization

Maximum value normalization: map all data to between 0-1 xscale = x − xminxmax − xmin x_{scale}=\frac{x-x_{min}}{x_{max}-x_{min}}xscale=xmaxxminxxmin

It is suitable for situations where the distribution has obvious boundaries, such as: image 0-255; it is greatly affected by outlier

2.2, mean and inverse variance normalization standardization

The data distribution has no clear boundaries; there may be extreme data values ​​Mean -
variance normalization: normalize all data to a distribution with mean 0 and contrast 1
-x_{mean}}{S}xscale=Sxxmean

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123733216
Recommended