[Deep Learning] 5-1 Learning-related skills - parameter update (Momentum, AdaGrad, Adam)

The purpose of neural network learning is to find the parameters that make the value of the loss function as small as possible. This is a problem of finding the optimal parameters, and the process of solving this problem is called optimization .

But the optimization problem of neural network is very difficult. This is because the parameter space is very complex and an optimal solution cannot be easily found. Moreover, in deep neural networks, the number of parameters is very large, making the optimization problem more complex.

SGD

In the front, in order to find the optimal parameters, we use the gradient (derivative) of the parameters as a clue, use the gradient of the parameters, update the parameters along the gradient direction, and repeat this step many times, so as to gradually approach the optimal parameters. This process is called The stochastic gradient descent method is called SGD .

SGD can be written as the following formula in mathematical formula:
insert image description here
Here, the weight parameter to be updated is denoted as , and the gradient of the loss function is denoted as η. The ← in the formula means to update the value on the left with the value on the right.

SGD is a simple way to go only a certain distance in the direction of the gradient. Now, we implement SGD as a Python class (for convenience later, we implement it as a class named SGD).

class SGD:
	def __init__(self, lr=0.01):
		self.lr = lr

	def update(self, params, grads):
		for key in params.keys():
			params[key] -= self.lr * grads[key]

In addition, the update(params, grads) method is defined in the code segment, which will be called repeatedly in SGD.

Using this SGD class, the parameters of the neural network can be updated as follows (the code below is a pseudocode that cannot actually be run).

network = TwoLayerNet(...)
optimizer = SGD()

for i in range(10000):
	...
	x_batch, t_batch = get_mini_batch(...) # mini-batch
	grads = network.gradient(x_batch, t_batch)
	params = network.params
	optimizer.update(params, grads)
	...

Here optimzer means "optimizer", and here SGD assumes this role. The update of the parameters is done by the optimizer .

Disadvantages of SGD Although SGD is simple and easy to implement, it may not be effective
in solving certain problems . The disadvantage of SGD is that if the shape of the function is non-uniform , such as an extension, the search path will be very inefficient. Therefore, we need methods that are smarter than SGD that simply proceeds in the direction of the gradient. The root cause of the inefficiency of SGD is that the direction of the gradient does not point to the direction of the minimum .

In order to correct the shortcomings of SGD, the methods of Momentum , AdaGrad , and Adam are used below to replace SGD.

The optimal update path based on SGD: moving towards the minimum value (0, 0) in a "zigzag" shape, low efficiency
insert image description here

Momentum

Momentum means "momentum" and is related to physics. Express the Momentum method mathematically as follows:

A new variable v appears here , corresponding to the physical velocity, which indicates the force on the object in the direction of the gradient. Under the action of this force, the velocity of the object increases with this physical law. There is an item av
in the formula . When the object is not subjected to any force, this item undertakes the task of gradually decelerating the object, corresponding to physical ground friction or air resistance . The implementation code of Momentum is as follows: The Momentum method feels like a ball rolling on the ground. The av in the formula corresponds to the physical ground friction or air resistance, the following is the code implementation

insert image description here

class Momentum:
	def __init__(self, lr=0.01, momentum=0.9):
		self.lr = lr
		self.momentum = momentum
		self.v = None

	def update(self, params, grads):
		if self.v is None:
			self.v = {
    
    }
			for key, val in params.items():
				self.v[key] = np.zeros_like(val)
		
		for key in params.keys():
			self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
			params[key] += self.v[key]

The instance variable v will hold the velocity of the object. When initialized, nothing is saved in v, but when update() is called for the first time, v will save the same data as the parameter structure in the form of a dictionary variable .

In the image below, the update path is like a ball rolling in a bowl. Compared with SGD, we found that the "degree" of the zigzag is reduced. This is because although the force in the x-axis direction is very small, the force is always in the same direction, so there will be a certain acceleration in the same direction. Conversely, although the force in the direction of the y-axis is very large, because they are alternately subjected to positive and reverse forces, they will cancel each other out, so the speed in the direction of the y-axis is unstable. Therefore, compared with the case of SGD, it is possible to approach the x-axis direction more quickly, and the degree of fluctuation of the zigzag can be weakened .
insert image description here

AdaGrad

Among the effective techniques for learning rate is a method called learning rate decay , which is to make the learning rate gradually decrease as the learning progresses. In fact, the method of learning "more" at the beginning and then gradually learning "less" is often used in the learning of neural networks.

AdaGrad assigns " customized " values to "one by one" parameters . AdaGrad adjusts the learning rate appropriately for each element of the parameters while learning.
The following mathematical formula expresses the update method of AdaGrad
insert image description here

Here a new variable h appears , as shown in the formula, it saves the sum of the squares of all previous gradient values (the symbol in the formula represents the multiplication of the corresponding matrix elements )
and then, when updating the parameters, by multiplying insert image description here
, you can Adjust the scale of learning. This means that among the elements of the parameters, the learning rate of the element with a large change (largely updated) will be smaller . That is to say, the learning rate can be attenuated according to the elements of the parameters, so that the learning rate of the parameters with large changes can be gradually reduced.

AdaGrad keeps track of the sum of squares of all past gradients. Therefore, the deeper the learning, the smaller the update. In fact, if the learning is endless, the update amount will become 0 , no update at all. In order to improve this problem, you can use the RMSProp method. The RMSProp method does not add all the past gradients equally, but gradually forgets the past gradients , and reflects more information about the new gradients when doing addition operations. Professionally speaking, this operation is called " exponential" . "moving average ", which exponentially reduces the scale of past gradients .

The implementation process of AdaGrad is as follows:

class AdaGrad:
	def __init__(self, lr=0.01):
		self.lr = lr
		self.h = None

	def update(self, params, grads):
		if self.h is None:
			self.h = {
    
    }
			for key, val in params.items():
				self.h[key] = np.zeros_like(val)
		
		for key in params.keys():
			self.h[key] += grads[key] * grads[key]
			params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key])+ 1e-7)

Note here that the last line adds the tiny value 1e-7
insert image description here

The value of the function is efficiently shifted towards the minimum value. Due to the large gradient in the y-axis direction, the change is large at the beginning, but it will be adjusted proportionally according to this large change later to reduce the update pace.

Adam

Combining the two methods of Momentum and AdaGrad is the basic idea of the Adam method. In addition, performing " bias correction " of hyperparameters is also a feature of Adam.
The following is the Adam class implemented by Python,

class Adam:

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {
    
    }, {
    
    }
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
        
        self.iter += 1
        lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
        
        for key in params.keys():
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)

insert image description here
The update process based on Adam is like a ball rolling in a bowl. While Momentun has a similar movement, Adam's ball wobbles less from side to side in comparison. This is due to the fact that the update degree of learning is properly adjusted. .

Which update method to use
So far, we have learned 4 methods for updating parameters.
Each of these four methods has its own characteristics, and each has its own problems that are good at solving and problems that they are not good at solving.
SGD is still used today in many studies. Momentum and AdaGrad are also methods worth trying. Recently, many researchers and technicians like to use Adam. Here we still mainly use SGD or Adam

Comparison of update methods based on the MNIST dataset
Taking handwritten digit recognition as an example, compare the four methods introduced earlier, SGD, Momentum, AdaGrad, and Adam, and confirm how different methods differ in learning progress.
This experiment uses a 5-layer neural network with 100 neurons in each layer. The activation function uses ReLU .

Look at the code:

# coding: utf-8
import os
import sys
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import *


# 0:读入MNIST数据==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1:进行实验的设置==========
optimizers = {
    
    }
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['AdaGrad'] = AdaGrad()
optimizers['Adam'] = Adam()
#optimizers['RMSprop'] = RMSprop()

networks = {
    
    }
train_loss = {
    
    }
for key in optimizers.keys():
    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100, 100, 100],
        output_size=10)
    train_loss[key] = []    


# 2:开始训练==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in optimizers.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizers[key].update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    if i % 100 == 0:
        print( "===========" + "iteration:" + str(i) + "===========")
        for key in optimizers.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


# 3.绘制图形==========
markers = {
    
    "SGD": "o", "Momentum": "x", "AdaGrad": "s", "Adam": "D"}
x = np.arange(max_iterations)
for key in optimizers.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 1)
plt.legend()
plt.show()

The result graph is as follows
insert image description here
From the results of the graph, it can be seen that compared with SGD, the other three methods learn faster, and the speed of the target is basically the same.