Machine learning notes (3) gradient descent method Gradient-Descent and data normalization

1. Gradient-Descent method Gradient-Descent

The gradient descent method is often used to optimize the loss function in machine learning $J$ , is an optimization method based on search optimization. Its opposite is gradient ascent, which is used to maximize a utility function.

"The concept of gradient" : The gradient is the partial derivative of the function on each of its independent variables ( $\frac{\partial{J}}{\partial\theta}$ ), a vector $\Delta J=(\frac{\partial J}{\partial\ theta_0}, \frac{\partial J}{\partial\theta_1},\cdots,\frac{\partial J}{\partial\theta_n})$ .
In the equation of a line, the gradient is a derivative ( $\frac{dJ}{d\theta}$ ), and the derivative represents the slope. Whereas, in the equation of a curve, the derivative represents the slope of the tangent.
insert image description here
The gradient represents $\theta$ When the $θ$ $J$ changes accordingly. The gradient can represent the direction, for which direction J increases. There is a blue dot (gradient > 0) in the above picture, now use the "positive and negative to indicate the direction"

of the gradient value, if the value of the derivative is positive, then it represents $\theta$ $The positive direction of the θ$ axis, if the value of the gradient is negative, then it represents $\theta$ axis. Then you will find that after knowing this direction, you also know that $\theta$ In which direction does $θ$ $The value of J$ will increase. If you want $The value of J$ decreases, then let $\theta$ change theta in the opposite direction of what the gradient tells us. Use the "red" arrows to represent the direction the gradient is telling us, and the "blue" arrows to represent the opposite of the direction the gradient is pointing. The goal of the "gradient descent method"is to search for a position that can make the value of the function as small as possible, so it should be let $\theta$ goes in the direction of the blue arrow. So how to accomplish this operation? The formula is as follows: $\theta=\theta-\eta\nabla J\qquad \nabla J =(\frac{\partial J}{\partial\theta_0}, \frac{\partial J}{\partial\theta_1},\cdots,\frac{\partial J}{\partial\theta_n})$ , where:

$\eta$ is called the learning rate (learning rate)
$\eta$ affects the speed of obtaining the optimal solution
$\eta$ is inappropriate, and even the optimal solution cannot be obtained
$\eta$ is a hyperparameter of the gradient descent method

1.1, batch gradient descent method Batch Gradient Descent

When calculating the gradient, use all the data to calculate, but take the average (if it is not averaged, it is meaningless), the advantage is that the number of convergences is small, and the disadvantage is that each iteration needs to use all the data, which takes up a lot of memory and takes a lot of time.

1.2. Stochastic Gradient Descent

Only one sample is used to find the gradient each time. The advantage is that it is fast, but the disadvantage is that it may fall into a local optimum, and the search is relatively blind. It does not always move towards the optimal direction (because a single sample may have more noise)

1.3. Mini-Batch Gradient Descent

SGD and BGD are two extremes, and MBGD is a compromise between the two methods, each time selecting a batch of data to find the gradient.
This method is also easy to fall into local optimum.
Now SGD generally refers to MBGD

2. Data normalization

The purpose of normalization is to limit the preprocessed data to a certain range (such as [0,1] or [-1,1]), so as to eliminate the adverse effects caused by singular sample data and make the model easier to converge.

2.1, the most normalized normalization

Maximum value normalization: map all data to between 0-1 $x_{scale}=\frac{x-x_{min}}{x_{max}-x_{min}}$

It is suitable for situations where the distribution has obvious boundaries, such as: image 0-255; it is greatly affected by outlier

2.2, mean and inverse variance normalization standardization

The data distribution has no clear boundaries; there may be extreme data values Mean
variance normalization: normalize all data to a distribution with mean 0 and contrast 1
$-x_{mean}}{S}$