Neural Network 06 (Optimization Method)

1. Optimization method

After the network is built and the loss function is designed, the parameters (weights, offsets) are updated according to the loss function. The parameter update process is a neural network optimization process.

 

2. Gradient descent method

Gradient descent method is simply a method of finding a way to minimize the loss function . From a mathematical point of view, the direction of the gradient is the direction in which the function grows fastest, and the opposite direction of the gradient is the direction in which the function decreases fastest , so there are:

w is the weight and E is the loss function

Among them, n is the learning rate. If the learning rate is too small, the effect obtained after each training will be too small, which increases the time cost of training. If the learning rate is too large, it may directly skip the optimal solution and enter infinite training. The solution is that the learning rate also needs to change as training progresses .

 

In practice , the small-batch gradient descent algorithm is often used , which is implemented in tf.keras through the following methods

tf.keras.optimizers.SGD(
    learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs
)
# 导入相应的工具包
import tensorflow as tf
# 实例化优化方法:SGD 
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
# 定义要调整的参数
var = tf.Variable(1.0)
# 定义损失函数:无参但有返回值
loss = lambda: (var ** 2)/2.0  
# 计算梯度,并对参数进行更新,步长为 `- learning_rate * grad`
opt.minimize(loss, [var]).numpy()
# 展示参数更新结果
var.numpy()

When training models, there are three basic concepts:

In fact, the fundamental difference between several methods of gradient descent lies in the difference in Batch Size, as shown in the following table:

 

Assume that the data set has 50,000 training samples, and now select Batch Size = 256 to train the model.

The number of images to be trained in each Epoch: 50000
The number of Batches in the training set: 50000/256+1=196
The number of Iterations in each Epoch: 196
The number of Iterations in 10 Epochs: 1960
 

3. Back propagation algorithm (BP algorithm)

The neural network is trained using the backpropagation algorithm. This method is combined with the gradient descent algorithm to calculate the gradient of the loss function for all weights in the network, and use the gradient value to update the weights to minimize the loss function. 

3.1 Forward propagation and back propagation

Forward propagation refers to the data input in the neural network, which is transmitted forward layer by layer until the output layer.

During the training process of the network, there is always an error between the final result obtained after forward propagation and the true value of the sample, and the error is measured using the loss function. If you want to reduce this error, use the loss function Error to find the partial derivatives of each parameter from back to front . This is backpropagation.

3.2 Chain Rule

The backpropagation algorithm uses the chain rule to solve gradients and update weights. For complex composite functions, we split them into a series of addition, subtraction, multiplication, division, or elementary functions such as exponentials, logarithms, and trigonometric functions, and use the chain rule to complete the derivation of the composite function. For the sake of simplicity, here is an example of a common composite function in neural networks to illustrate this process. Let the composite function (x; w, b) be:

where x is the input data, w is the weight, and b is the bias. We can decompose this composite function into:

 

3.3 Backpropagation algorithm

The backpropagation algorithm uses the chain rule to update the weights of each node in the neural network .

Assume that the current forward propagation process is as follows:

Calculate the loss function and perform backpropagation:

Calculate gradient value

3.4 Gradient descent optimization method

When the gradient descent algorithm performs network training, it will encounter problems such as saddle points and local minima. So how do we improve SGD? Here we introduce some of the more commonly used ones.

3.4.1 Momentum algorithm

The momentum algorithm mainly targets saddle point problems. Before introducing the momentum algorithm, let's first look at the calculation method of the exponential weighted average.

exponentially weighted average

Among them, Yt is the real value at time t, St is the weighted average value of t, and β is the weight value. The red line is the result of the exponential weighted average.
In the above figure, β is set to 0.9, then the calculation result of the exponential weighted average is:

Momentum Gradient Descent Algorithm

Gradient Descent with Momentum calculates the exponentially weighted average of the gradient and uses this value to update parameter values. The entire process of the momentum gradient descent method is, where β is usually set to 0.9:

 

Compared with the original gradient descent algorithm, its descending trend is smoother .

3.4.2 AdaGrad

The AdaGrad algorithm will use a mini-batch stochastic gradient g_t to accumulate the element-wise squared variable st. When sending the generation for the first time, AdaGrad initializes each element in s0 to 0. In t iterations, the mini-batch stochastic gradient gt is first squared element-wise and then accumulated into the variable st:

where O is element-wise multiplication. Next, we re-adjust the learning rate of each element in the objective function independent variable through element-wise operations:

Where α is the learning rate, and e is a constant added to maintain numerical stability, such as 10^(-6). Here the operations of square root, division and multiplication are performed element by element. These element-wise operations allow each element in the objective function argument to have its own learning rate.

3.4.3 RMS plug

In the late iteration stage of the AdaGrad algorithm, it is difficult to find the optimal solution because the learning rate is too small. In order to solve this problem, the RMSProp algorithm makes a small modification to the AdaGrad algorithm.


Different from the AdaGrad algorithm in which the state variable st is the element-wise sum of all mini-batch stochastic gradients gt as of time step t, the RMSProp (RootMean Square Prop) algorithm makes an exponentially weighted moving average of these gradients based on the element square.

where e is a constant in order to maintain numerical stability. In the end, the learning rate of each element of the independent variable no longer decreases all the time during the iteration process. RMSProp helps reduce the oscillation on the path to the minimum and allows the use of a larger learning rate α, thus speeding up the algorithm learning.

3.4.4  Adam

The Adam optimization algorithm (Adaptive Moment Estimation, adaptive moment estimation) combines the Momentum and RMSProp algorithms. The Adam algorithm also performs an exponentially weighted moving average on the mini-batch stochastic gradient based on the RMSProp algorithm.
Assume that each mini-batch is used to calculate dw and db, at the t-th iteration:

Recommended values ​​for parameter settings
. Learning rate a: You need to try a series of values ​​to find a more suitable one
. β1: The commonly used default value is 0.9
. β2: Recommended to be 0.999
e: Default value 1e-8

4. Learning rate annealing

 When training a neural network, the learning rate will generally change with training . This is mainly due to the fact that in the later stages of neural network training, if the learning rate is too high, it will cause oscillations in the loss. However, if the learning rate decreases too much, If it is fast, it will cause the convergence to slow down.

4.1 Piecewise constant decay

Piecewise constant decay is to set different learning rate constants in the pre-defined training times interval . The learning rate is larger at the beginning, and then becomes smaller and smaller. The setting of the interval needs to be adjusted according to the sample size. Generally, the larger the sample size, the smaller the interval interval should be.

4.2 Exponential decay

4.3 1/t attenuation

Guess you like

Origin blog.csdn.net/peng_258/article/details/132836154