SGD
The original weight W minus the product of the learning rate and the number of partial derivatives
class SGD:
"""随机梯度下降法(Stochastic Gradient Descent)"""
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
Strategy: towards a forward direction of the gradient of
advantages: simple to implement
Disadvantages: inefficient
Momentum
class Momentum:
"""Momentum SGD"""
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
"""
In [1]: a = np.arange(24).reshape(4,6)
... b = np.zeros_like(a)
Out [1]: (4, 6) #a.shape
Out [1]: (4, 6) #b.shape
Out [1]: array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]) # a_0
生成一个和你所给数组a相同shape的全0数组
"""
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
Strategy: Analog physical object movement to change stabilization
advantages: faster toward the x-axis direction close
disadvantages: Ultra learning rate parameter not determined
AdaGrad
Strategy: AdaGrad be appropriately adjusted for each element of the learning rate parameter variation parameter element larger (is greatly updated) of the element becomes small learning rate
: the advantage of efficiently moving toward the minimum value function
disadvantages: if endless learning, updating the amount will be changed to 0, no longer fully updated (gradient disappears)
Adam
Adam algorithm run faster than SGD faster, but there are two major drawbacks: the results may not converge, could not find the global optimal solution. That is its poor generalization capability in solving certain problems, performance is not as SGD