Parameter algorithm update policy

SGD

11640276-639c1c27c0ea96db.PNG

The original weight W minus the product of the learning rate and the number of partial derivatives

class SGD:

    """随机梯度下降法(Stochastic Gradient Descent)"""

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key] 

Strategy: towards a forward direction of the gradient of
advantages: simple to implement
Disadvantages: inefficient

Momentum

11640276-11ed96b95114f5d0.PNG
class Momentum:

    """Momentum SGD"""

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():                                
                self.v[key] = np.zeros_like(val)
"""
In  [1]: a = np.arange(24).reshape(4,6)
...      b = np.zeros_like(a)

Out [1]: (4, 6) #a.shape
Out [1]: (4, 6) #b.shape
Out [1]: array([[0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0]]) # a_0

生成一个和你所给数组a相同shape的全0数组
"""
                
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] 
            params[key] += self.v[key]

Strategy: Analog physical object movement to change stabilization
advantages: faster toward the x-axis direction close
disadvantages: Ultra learning rate parameter not determined

AdaGrad

11640276-a0bdf4653bf8ac29.PNG

Strategy: AdaGrad be appropriately adjusted for each element of the learning rate parameter variation parameter element larger (is greatly updated) of the element becomes small learning rate
: the advantage of efficiently moving toward the minimum value function
disadvantages: if endless learning, updating the amount will be changed to 0, no longer fully updated (gradient disappears)

Adam

11640276-d6dbd891a9ea1fa9.PNG
11640276-3927739a18c8ebb2.PNG

Adam algorithm run faster than SGD faster, but there are two major drawbacks: the results may not converge, could not find the global optimal solution. That is its poor generalization capability in solving certain problems, performance is not as SGD

Guess you like

Origin blog.csdn.net/weixin_34056162/article/details/90974494
Recommended