[论文]An overiview of gradient descent optimization algorithms

这篇文章最早是一篇2016年1月16日发表在Sebastian Ruder的博客

论文全文翻译:
An overview of gradient descent optimization algorithms
梯度下降优化算法概述
0. Abstract 摘要:
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
梯度下降优化算法虽然很流行,但通常用作黑盒优化,所以对于它们的优缺点很难作出实际的解释。
This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.
这篇论文旨在帮助读者建立对于不同算法性能表现的直觉,以便更好地使用这些算法。
In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
这篇论文介绍了几种不同的梯度下降算法,以及它们所面临的挑战。还介绍了最常用的优化算法,并行和分布式架构,以及其他梯度下降算法优化的策略。

1. Introduction 引言:
Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.
梯度下降是最流行的其中一种执行优化的算法,也是到目前为止用的最多的神经网络优化算法。
At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).
同时,各种最新的深度学习库(如:lasagne,caffe,keras)都实现了很多种梯度下降的优化算法。
These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.
然而,这些算法通常用作黑盒优化,很难对于它们的优缺点作出实际解释。

This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.
这篇论文旨在帮助读者建立对于不同梯度下降优化算法的性能表现的直觉,以便更好地使用这些算法。
In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.
在第二章,我们首先看一下不同的梯度下降算法,然后在第三章,简要总结一下算法训练过程中面临的挑战。
Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.
接下来,在第四章介绍了最常见的优化算法,以及它们如何应对挑战,并相应地更新规则。
Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.
然后,在第五章简要介绍了在并行及分布式架构下梯度下降的优化算法及框架。
Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.
最后,第六章介绍了一些其他有用的梯度下降优化策略。

Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in R^d\) by updating the parameters in the opposite direction of the gradient of the objective function \({\nabla}_{\theta} J({\theta})\) w.r.t. to the parameters.
梯度下降方法就是对于目标函数 \(J(\theta)\),计算梯度 \({\nabla}_{\theta} J({\theta})\) ,并负向更新参数 \(\theta \in R^d\),使得目标函数最小。
The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum.
学习率 \(\eta\) 确定了我们逼近(局部)最小值的步长。
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
换而言之,就是我们沿着目标函数的斜坡下降的方向走,知道到达谷底。

2. Gradient descent variants 梯度下降的变体
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
梯度下降有三种变体,他们的不同之处在于用来计算目标函数下降梯度的数据量不同。
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
根据数据量的不同,我们在参数更新的精度和更新时间之间作出权衡。

2.1 Batch gradient descent 批量梯度下降
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)

猜你喜欢

转载自www.cnblogs.com/yanqiang/p/11301079.html